Patent Issued for Self-contained system for de-identifying unstructured data in healthcare records (USPTO 11537748): Datavant Inc.

2023 JAN 17 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- From Alexandria, Virginia, NewsRx journalists report that a patent by the inventors Austin, Joseph (Sterling, MA, US), Bayless, Paul J. (Burke, VA, US), Kassam-Adams, Shahir (Lovingston, VA, US), LaBonte, Jason A. (Natick, MA, US), filed on January 23, 2019, was published online on December 27, 2022.

The patent’s assignee for patent number 11537748 is Datavant Inc. (San Francisco, California, United States).

News editors obtained the following quote from the background information supplied by the inventors: “There exists a vast store of information within the unstructured data fields of healthcare records that can be critical to properly understanding the effectiveness and safety of clinical treatment. However, due to their inherent lack of structure, de-identification of these data fields is a challenge due to the lack of an automated solution, leaving much of this information absent from analytical datasets. This challenge is one that was created by the implementation of databases and other data storage and analysis technologies combined with the unstructured nature of healthcare records combined with the need for meeting privacy regulations and requirements. Conventional healthcare data systems are limited in their ability to provide information from individual records in healthcare data sets because each record contains protected health information (“PHI”) or personal identification information (PII) (e.g., names, addresses, dates of birth, dates of death, social security numbers, etc.). It is a potential Health Insurance Portability and Accountability Act (HIPAA) violation to incorporate PHI elements into a healthcare data set. Accordingly, to be compliant with government regulations, all PHI data elements must be removed and/or de-identified before being incorporated into any healthcare data set. However, once PHI data elements are removed from record, users have no way to understand which individuals in the data set match particular structured or unstructured data relevant for analysis.

“Generally, conventional (systems, devices, methods) are unable to identify, flag, and remove protected health information (PHI) or personal identification information (PII) (e.g. names, addresses, dates of birth, dates of death, social security numbers, etc.) from unstructured data in an automated fashion. Instead, the current practice is to remove PHI or PII manually based on human determinations, or to not incorporate any unstructured data fields into healthcare data sets being used for non-clinical purposes, because of the inability to efficiently and accurately identify, flag, and remove PHI or PII using existing technologies.”

As a supplement to the background information on this patent, NewsRx correspondents also obtained the inventors’ summary information for this patent: “There is a need for improvements for enabling healthcare data sets within healthcare records of individuals to be accessible and useable without exposing protected healthcare information of the individual. There is a need for improvements for a system whereby unstructured data can be de-identified in an automated manner, but still be able to be matched to individual records in healthcare data sets without exposure of PHI or PII. Additionally, the de-identification process needs to be “tune-able” to control redaction of sensitive information while allowing information that looks like PHI or PII, but is not, to remain in the data set. Moreover, it would be desirable to have the de-identified text to remain coherent after the personal identifying information has been removed.

“The present invention is directed toward further solutions to address this need, in addition to having other desirable characteristics. Specifically, the present invention provides an advancement made in computer technology that consists of improvements defined by logical structures and processes directed to a specific implementation of a solution to a problem in software, data structures and data management, wherein the existing data structure technology relies upon unacceptable reproduction of protected health information, personal identification information, or other private information to transmit data for data processing purposes that cannot meet or be used under current requirements of the Health Insurance Portability and Accountability Act (HIPAA) (42 U.S.C. 1301 et seq.), and other laws, regulations, rules and standards governing privacy and data security (e.g. General Data Protection Regulation (Regulation (EU) 2016/679), Federal Trade Commission Act (15 U.S.C. 41-58), Children’s Online Privacy Protection Act (COPPA) (15 U.S.C. 6501-6506), Financial Services Modernization Act (Gramm-Leach-Bliley Act (GLB)) (15 U.S.C. 6801-6827) and California Consumer Privacy Act of 2018), by providing a system and method in which unstructured data is processed from individual records in a healthcare data set without exposing PHI. The present invention provides a system and method that creates a specific, non-abstract improvement to computer functionality previously incapable of merging certain data sets without exposing PHI and PII, that de-identifies data by removing protected health information and personal identification information from the record, adds a unique encrypted person token to each record, and merges the record with other healthcare data sets that have likewise been de-identified and tokenized by matching the unique encrypted person tokens in data sets to one another, thus maintaining the ability to match disparate data (e.g., unstructured data and structured healthcare data) from disparate sources for a same individual. In particular, the present invention implements a self-contained, dictionary-based system containing tunable lists of PHI or PII entities and data formats (e.g., names, birth dates, phone numbers, addresses, etc.) to be utilized to de-identify unstructured data within healthcare and other data sets. The dictionaries include blacklisted terms (e.g., first names, last names, etc.) to be redacted from the unstructured data and also include blacklisted standard number formats (e.g., social security numbers, telephone numbers, etc.) to be removed from records, a project specific whitelist of terms not to be removed (terms to be allowed to remain in the data despite being included in the blacklist), and a record-specific blacklist created from the PHI or PII present in a specific record (e.g., individual patient records). Taken together, these three dictionaries create an adjusted blacklist that can be “tuned” to control the level of de-identification and scrub a set of records. The present invention uses the dictionaries to remove all elements determined to be PII or PHI, but also replaces the removed elements with a case-type tag identifying the type of information being removed (e.g., “first name”, “address”, etc.). The addition of the case-type tag ensures that the unstructured data are still coherent even after information has been removed/redacted.

“Additionally, to connect the resulting de-identified data to other data sets, the invention works in a manner consistent with other de-identification systems and methods. In particular, the records are tokenized in a standardized format to include encrypted patient tokens with every record. This “tokenized” data can then be merged with structured healthcare data sets that have also been de-identified and tokenized by matching the tokens in each data set against each other. In this way, users can connect individuals across healthcare data sets without ever seeing or using PHI or PII.

“In accordance with example embodiments of the present invention, a method for de-identifying unstructured data within data sets is provided. The method includes initializing a blacklist dictionary and a whitelist dictionary, modifying the blacklist dictionary by removing terms included within the whitelist dictionary to create an adjusted blacklist dictionary, and augmenting the adjusted blacklist dictionary with a record-specific blacklist for each individual record within the data sets. The method also includes scrubbing personally identifiable information (PII) and protected health information (PHI) from each individual record utilizing the record-specific adjusted blacklist dictionary. The scrubbing includes removing all elements within each individual record determined to be PII or PHI according to terms in the record-specific adjusted blacklist dictionary, replacing removed elements with a case-type tag identifying a type of information being removed according to the record-specific adjusted blacklist dictionary, and repeating the removing and replacing step for each individual record within the data sets.

“In accordance with aspects of the present invention, the method can further include tokenizing and merging each individual record in the data sets. The blacklist dictionary can include standard terms and standard number formats to be removed from records within the data sets. The standard number formats can include social security numbers, telephone numbers, URLs, zip codes, email addresses, IP addresses, dates, patient IDs, record numbers, and insurance IDs. The standard formats can include cities, counties, first names, last names, prefixes, and medical terms. The whitelist dictionary can include terms allowed to remain in the data sets despite being included in the blacklist dictionary. The record-specific adjusted blacklist dictionary can include terms created from the PII or PHI present in specific individual records within the data sets.”

The claims supplied by the inventors are:

“1. A method for de-identifying unstructured data within data sets, the method comprising: initializing, using a computing device comprising a processor and memory, a blacklist dictionary having a structure that facilitates identification of personally identifiable information (PII) and protected health information (PHI) and that facilitates removal and replacement of PII and PHI with a case-type tag, and a whitelist dictionary comprising terms selected to remain in the data sets despite being included in the blacklist dictionary; modifying, using the processor and a register module, the blacklist dictionary by removing terms included within the whitelist dictionary to create an adjusted blacklist dictionary; augmenting, using the processor, the adjusted blacklist dictionary with a record-specific blacklist for each individual record within the data sets to create a record-specific adjusted blacklist dictionary; scrubbing, using the processor and a de-identification engine, PII and PHI from each individual record utilizing the record-specific adjusted blacklist dictionary, the scrubbing comprising: removing all elements within each individual record determined to be PII or PHI according to terms in the record-specific adjusted blacklist dictionary; replacing removed elements with a case-type tag identifying a type of information being removed according to the record-specific adjusted blacklist dictionary; and repeating, using the de-identification engine, the scrubbing of PII and PHI comprising removing steps and replacing steps for each individual record within the data sets.

“2. The method of claim 1, further comprising tokenizing and merging, using a merging module, each individual record in the data sets.

“3. The method of claim 1, wherein the blacklist dictionary comprises standard terms and standard number formats to be removed from records within the data sets.

“4. The method of claim 3, wherein the standard number formats comprise social security numbers, telephone numbers, URLs, zip codes, email addresses, IP addresses, dates, patient IDs, record numbers, and insurance IDs.

“5. The method of claim 3, wherein the standard terms comprise cities, counties, first names, last names, prefixes, and medical terms.

“6. The method of claim 1, wherein the record-specific adjusted blacklist dictionary comprises terms created from the PII or PHI present in specific individual records within the data sets.

“7. The method of claim 1, further comprising tuning terms in the whitelist dictionary and the record-specific adjusted blacklist dictionary to adjust a level of de-identification of records within the data sets.

“8. The method of claim 7, wherein the whitelist dictionary and the record-specific adjusted blacklist dictionary include a tunable list of names, birth dates, phone numbers, addresses and other forms of PII and PHI present in data store records or within the data sets.

“9. The method of claim 7, wherein augmenting comprises adjusting the adjusted blacklist dictionary to include known PII and PHI terms, for an individual associated with an individual record, according to the record-specific adjusted blacklist dictionary, that are designated to be removed.

“10. A system for de-identifying unstructured data within data sets, comprising: memory and a processor configured for accessing the data sets from data sources and parsing data within the data sets and identifying elements of unstructured data within the data sets and de-identifying unstructured data using a de-identification engine initializing a blacklist dictionary having a structure that facilitates identification of personally identifiable information (PII) and protected health information (PHI) and that facilitates removal and replacement of PII and PHI with a case-type tag, and a whitelist dictionary comprising terms selected to remain in the data sets despite being included in the blacklist dictionary, and comprising: a register module configured to modify the blacklist dictionary by removing terms included within the whitelist dictionary to generate an adjusted blacklist dictionary, and augmenting the adjusted blacklist dictionary with a record-specific blacklist for each individual record within the data sets to create a record-specific adjusted blacklist dictionary; a de-identification module configured to determine which unstructured elements are to be removed from each individual record of a data set by scrubbing PII and PHI from each individual record utilizing the record-specific adjusted blacklist dictionary, the scrubbing comprising: removing all elements within each individual record determined to be PII or PHI according to terms in the record-specific adjusted blacklist dictionary; and replacing removed elements with a case-type tag identifying a type of information being removed according to the record-specific adjusted blacklist dictionary; wherein the de-identification module repeats the scrubbing PII and PHI for each individual record within the data sets; and one or more user devices configured to receive input from one or more users by an input-output interface and communicate with the processor and the de-identification engine over a telecommunication network, providing functionality for the register module, de-identification module, and merging module that share a secured network connection.

“11. The system of claim 10, further comprising a merging module configured to tokenize and merge each individual record in the data sets by identifying a same token in both data sets and join together matched to individual records in healthcare data sets without exposure of PHI or PII.

“12. The system of claim 10, wherein the blacklist dictionary comprises standard terms and standard number formats to be removed from records within the data sets.

“13. The system of claim 12, wherein the standard number formats comprise social security numbers, telephone numbers, URLs, zip codes, email addresses, IP addresses, dates, patient IDs, record numbers, and insurance IDs.

“14. The system of claim 12, wherein the standard terms comprise cities, counties, first names, last names, prefixes, and medical terms.

“15. The system of claim 10, wherein the record-specific adjusted blacklist dictionary comprises terms created from the PII or PHI present in specific individual records within the data sets.

“16. The system of claim 10, further comprising tuning terms in the whitelist dictionary and the record-specific adjusted blacklist dictionary to adjust a level of de-identification of records within the data sets.

“17. The system of claim 16, wherein the whitelist dictionary and the record-specific adjusted blacklist dictionary include a tunable list of names, birth dates, phone numbers, addresses and other forms of PII and PHI present in data source records or within the data sets.

“18. The system of claim 16, wherein augmenting comprises adjusting the adjusted blacklist dictionary to include known PII and PHI terms for an individual associated with an individual record, according to the record-specific adjusted blacklist dictionary, that are designated to be removed.”

For additional information on this patent, see: Austin, Joseph. Self-contained system for de-identifying unstructured data in healthcare records. U.S. Patent Number 11537748, filed January 23, 2019, and published online on December 27, 2022. Patent URL (for desktop use only): https://ppubs.uspto.gov/pubwebapp/external.html?q=(11537748)&db=USPAT&type=ids

(Our reports deliver fact-based news of research and discoveries from around the world.)