Researchers Submit Patent Application, “Free Text De-Identification”, for Approval (USPTO 20210303791): Patent Application

2021 OCT 18 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- From Washington, D.C., NewsRx journalists report that a patent application by the inventors KOSTER, ROBERT PAUL (Eindhoven, NL); PLETEA, DANIEL (Eindhoven, NL); VAN LIESDONK, PETER PETRUS (EINDHOVEN, NL), filed on October 10, 2019, was made available online on September 30, 2021.

No assignee for this patent application has been made.

News editors obtained the following quote from the background information supplied by the inventors: “Recent regulations, e.g. GDPR “General Data Protection Regulation, Council of European Union, Regulation (eu) 2016/679 of the European parliament and of the council of 27 Apr. 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec, April 2016”, HIPAA “The health insurance portability and accountability act; U.S. Dept. of Labor, Employee Benefits Security Administration, 2004”, put strict requirements on the handling of personally identifiable information (PII), while also putting huge fines on noncompliance.

“Text-based patient medical records are a vital resource in medical research and data analytics. In order to preserve patient privacy and confidentiality, regulation like the HIPAA and the GDPR require protected health information (PHI) to be removed from medical records before they can be used for secondary purposes. The de-identification of unstructured text documents is often realized manually and requires significant resources.

“While there has been significant research done in the area of de-identification of structured clinical data (e.g. hospital databases, relational data warehouses), research on de-identifying data like free text clinical notes, discharge summaries and handover notes is less mature due to the unstructured nature of the data. Solutions for this problem use a multidisciplinary approach involving the domain knowledge on medical science, natural language processing (e.g. see “Hui Yang and Jonathan M. Garibaldi. Automatic detection of protected health information from clinic narratives. J. of Biomedical Informatics, 58(S):S30-S38, December 2015”), clinical text mining, machine learning (e.g. see “K. Rajput, G. Chetty, and R. Davey. Phis (protected health information) identification from free text clinical records based on machine learning; 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1-9, Nov 2017”) and recurrent neural networks (e.g. see “Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits: De-identification of patient notes with recurrent neural networks; Journal of the American Medical Informatics Association, 24(3):596-606, 2017”).

“However, blacklisting-based methods have a significant number of true-negatives due to the unstructured nature of the data. For example they cannot cover exceptions (e.g. “Summer” is both a name and a time indicator/season), misspellings (e.g. “Jonh” instead of “John”) or just the free nature of unstructured data (e.g. Christmas means actually December 25).

“Additionally, de-identification of unstructured text is domain dependent and relies on domain specific dictionaries, which in most of the cases are not available. An example of such a domain specific dictionary is the MIMIC database (see “Ishna Neamatullah, Margaret M. Douglass, Li wei H. Lehman, Andrew T. Reisner, Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. Mark, and Gari D. Clifford: Automated de-identification of free-text medical records; BMC Medical Informatics and Decision Making, 8:32-32, 2008”), while most of the other state-of-the-art de-identification methods rely on using blacklisting (e.g. see “Stephane M. Meystre, F. Jeffrey Friedlin, Brett R. South, Shuying Shen, and Matthew H. Samore: Automatic de-identification of textual documents in the electronic health record: a review of recent research; BMC medical research methodology, 2010”).

“Machine learning techniques need training data, which in addition needs to be annotated. Such requirements may be hard to satisfy at least in a short time manner and would need to be repeated for different domains. Furthermore the amount of data that is needed for training is a lot bigger than for example just a simple one time de-identification task.

“However, current free-text de-identification methods do not mask identifiers that are not covered by blacklists, and also have the following problems:

“

“Domain language. De-identification of unstructured text may require knowledge of the domain (e.g. MIMIC database, domain-specific words) and in many cases domain-based white-lists are not available due to not being yet built. De-identification experts may also be slowed down by the specificity of the domain.

“True-negatives. Misspellings are part of the PHI that should be masked in the de-identified output, but they are slipping the usual methods of de-identification.

“Inefficiency. Current methods need building domain knowledge and white-listing based on manual review. The de-identification of unstructured text documents is often realized manually, and requires significant resources.

“

As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventors’ summary information for this patent application: “It is an object of the invention to provide a method and system for free text de-identification that takes into account at least one of the preceding issues.

“For this purpose, devices and methods for generating de-identified output from a data set of patient data are provided as defined in the appended claims. According to an aspect of the invention a method for generating de-identified output from a data set of patient data of multiple patients is provided as defined in claim 1. A system is provided as defined in claim 13. According to a further aspect of the invention there is provided a computer program product downloadable from a network and/or stored on a computer-readable medium and/or microprocessor-executable medium, the product comprising program code instructions for implementing the above method when executed on a computer.

“To overcome these disadvantages, the de-identification method for unstructured text masks or removes (blacks out) word items which do not occur often in the text and blacklisted word items. Thereto the unstructured text is de-identified by performing a word count and allowing in the de-identified output only words occurring in the text more than a minimum number of occurrences. The method further suppresses or replaces words that are blacklisted (e.g. the 18 HIPAA Identifiers). The word count provides a list of low-rate word items that have a number of occurrences (k) in the unstructured text below a threshold. Then, the low-rate word items and the blacklist word items are removed from, or masked, in the unstructured text to generate the de-identified output. Word items may include, next to words as-is, word sequences, word stems, and word patterns.

“Advantageously, the method and system do not require initial domain knowledge input and are able to lower the amount of true-negatives in comparison with state of the art solutions.

“In an embodiment of the invention the word items in the word-count and/or blacklist entries are associated to the syntactic category (verb, noun, etc.) that the word has in the text, as determined by natural language processing (NLP). This increases the quality of the blacklist by discovering words that are potential identifiers, but not covered by the blacklist due to known limitations of static blacklists.

“In another embodiment of the invention a domain-specific white-list word list is created from the words that passed the word count. These words can be later allowed in the de-identified output even if in some cases their occurrence is not high.

“The methods according to the invention may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both. Executable code for a method according to the invention may be stored on a computer program product. Examples of computer program products include memory devices such as a memory stick, optical storage devices such as an optical disc, integrated circuits, servers, online software, etc.

“The computer program product in a non-transient form may comprise non-transitory program code means stored on a computer readable medium for performing a method according to the invention when said program product is executed on a computer. In an embodiment, the computer program comprises computer program code means adapted to perform all the steps or stages of a method according to the invention when the computer program is run on a computer. Preferably, the computer program is embodied on a computer readable medium. There is also provided a computer program product in a transient form downloadable from a network and/or stored in a volatile computer-readable memory and/or microprocessor-executable medium, the product comprising program code instructions for implementing a method as described above when executed on a computer.

“Another aspect of the invention provides a method of making the computer program in a transient form available for downloading. This aspect is used when the computer program is uploaded into, e.g., Apple’s App Store, Google’s Play Store, or Microsoft’s Windows Store, and when the computer program is available for downloading from such a store.

“Further preferred embodiments of the devices and methods according to the invention are given in the appended claims, disclosure of which is incorporated herein by reference.

“The figures are purely diagrammatic and not drawn to scale. In the Figures, elements which correspond to elements already described may have the same reference numerals.”

The claims supplied by the inventors are:

“1. A computer-implemented method for generating de-identified output from a data set of patient data of multiple patients, the patient data comprising unstructured text, the unstructured text comprising word items of words, numbers and symbols arranged in natural language phrases, and a blacklist comprising blacklist word items that are not allowed in the de-identified output, the method comprising the steps of: processing the unstructured text to determine a word count comprising a list of low-rate word items that have a number of occurrences (k) in the unstructured text below a threshold, and removing or masking the low-rate word items and the blacklist word items in the unstructured text to generate the de-identified output.

“2. The method according to claim 1, wherein the processing comprises setting the threshold above a minimum threshold in dependence of a desired percentage of the unstructured text that is allowed in the de-identified output.

“3. The method according to claim 1, wherein the method comprises determining, as word items, separate word items for a same word having different syntactic positions in the phrases.

“4. The method according to claim 1, wherein the method comprises determining, as word items, word patterns, a word pattern comprising in a phrase at least one word in combination with an adjacent pattern of numbers or symbols.

“5. The method according to claim 1, wherein the method comprises determining, as word items, word strings, a word string comprising a specific sequence of words.

“6. The method according to claim 1, wherein the method comprises determining, as word items, word stems, a word stem comprising a set of different words having a similar semantic function in different phrases.

“7. The method according to claim 1, wherein the processing comprises determining the blacklist using the word items as claimed in claim 1.

“8. The method according to claim 1, wherein the processing comprises determining the word count using the word items as claimed in claim 1.

“9. The method according to claim 1, wherein the processing comprises determining a whitelist comprising word items that are allowed in the de-identified output, and preventing said removing or masking the low-rate word items by allowing in the de-identified output low-rate word items that are in the whitelist.

“10. The method according to claim 1, wherein the processing comprises determining a confidence list comprising a confidence score for confidence word items based on word count results in previous de-identification events, and adapting the word count for the confidence word items by adjusting, in dependence of the confidence score, the number of occurrences (k) or the threshold.

“11. The method according to claim 10, wherein the confidence score represents in a percentage how many times the confidence word item was above the threshold in the word count in the previous de-identification events.

“12. A computer program product for generating de-identified output from a data set of patient data of multiple patients, the computer program product comprising instructions which when carried out on a computer cause the computer to perform a method as claimed in claim 1.

“13. A system for generating de-identified output from a data set of patient data of multiple patients, the system comprising: a data interface configured to receive patient data of multiple patients, the patient data comprising unstructured text, the unstructured text comprising word items of words, numbers and symbols arranged in natural language phrases, and a blacklist comprising blacklist word items that are not allowed in the de-identified output; and a processor arranged to: process the unstructured text to determine a word count comprising a list of low-rate word items that have a number of occurrences (k) in the unstructured text below a threshold, and remove or mask the low-rate word items and the blacklist word items in the unstructured text to generate the de-identified output.

“14. Use of the method according to claim 1, the computer program product and/or the system in one selected from the group consisting of genomics, genetics, bioinformatics research, transcriptomics, proteomics and systems biology or diagnosis.”

For additional information on this patent application, see: KOSTER, ROBERT PAUL; PLETEA, DANIEL; VAN LIESDONK, PETER PETRUS. Free Text De-Identification. Filed October 10, 2019 and posted September 30, 2021. Patent URL: https://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220210303791%22.PGNR.&OS=DN/20210303791&RS=DN/20210303791

(Our reports deliver fact-based news of research and discoveries from around the world.)

Researchers Submit Patent Application, “Free Text De-Identification”, for Approval (USPTO 20210303791): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Researchers Submit Patent Application, “Free Text De-Identification”, for Approval (USPTO 20210303791): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account