Researchers Submit Patent Application, “Systems And Methods For Data Normalization”, for Approval (USPTO 20230147366): Patent Application

2023 MAY 31 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- From Washington, D.C., NewsRx journalists report that a patent application by the inventors Biglari, Mehrdad (Redmond, WA, US); Burugapalli, Srinivas R. R. (Redmond, WA, US); Culpin, Patty (Bellevue, WA, US); Gopalakrishnan, Nishant (Kirkland, WA, US); Kolavennu, Ramesh (Sammamish, WA, US); Marcjan, Cezary A. (Sammamish, WA, US); Nachimuthu, Senthil (Salt Lake City, UT, US); Nanduri, Jayaram (Issaquah, WA, US); Subash, Swarna (Bellevue, WA, US); Tsui, Wing (Issaquah, WA, US); Zarandioon, Saman (Seattle, WA, US), filed on November 8, 2022, was made available online on May 11, 2023.

No assignee for this patent application has been made.

News editors obtained the following quote from the background information supplied by the inventors: “Medical research has come a long way since paper records were digitized. Researchers now have access to more health data than ever before. But limitations persist. Research is still often conducted on relatively small data sets that may be weeks or even months old and not represent the full diversity of a population. This can result in biased insights that can compromise patient care.

“Healthcare entities such as hospitals, clinics, and laboratories produce enormous volumes of health data. This health data can provide valuable insights for research and improving patient care. However, the patient records and other health data received from health system members can arrive from different databases in multiple formats, often incorporating a wide variety of terminologies and medical code sets. The structure of these records can also vary widely. Additionally, even with standard medical terminology, the way in which that terminology is used can also vary widely. A heart attack in one record, for example, may be described as acute myocardial infarction or AMI in another. All of these different structures, terminologies, and semantics can make it difficult to work across heath data records and identify meaningful trends and insights. There has been much progress made to arrive at a set of standards and processes that can help address this inconsistency, but the larger and more diverse the dataset, the more complex and time consuming the processing.

“The HIPAA Privacy Rule does not restrict the use or disclosure of de-identified health information-health information that neither identifies nor provides a reasonable basis for identifying a patient or individual. However, conventional techniques for de-identifying health data may remove too much information from the patient record, resulting in data that has limited utility for subsequent applications. Additionally, conventional de-identification techniques may not be well-suited for handling patient data that is received at different times or from different health systems because, for example, they are not stored in a uniform format. Accordingly, improved systems and methods for de-identifying patient data are needed.”

As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventors’ summary information for this patent application: “The present technology relates to systems and methods for data normalization. In some embodiments, a health data platform is configured to consolidate multiple and disparate data streams into a common data model for effective research. For example, the health data platform can interface with health system members providing more than 16% of care in the United States in tens of thousands of clinical care sites in 42 states, representing the full diversity of the country across age, geography, race, ethnicity, and gender. Billions of clinical data points from this care can be brought together in the health data platform to enable research on any drug, disease, or device across the full diversity of the United States. The health data platform can assemble millions of patient records from multiple health provider members. In some embodiments, data flows into the system daily, providing researchers with virtually real time updates. However, the speed, volume, and diversity of this data can pose significant management challenges. For example, the data received from the health system members can include all Electronic Health Record (EHR) data, such as labs, vitals, diagnosis codes, procedure codes, physician notes, imaging reports, pathology reports, images, and/or genomics information. The structure of these records can vary widely, as well as the terminology used in the records. Accordingly, there is a need for systems and methods that can make sense of a large and diverse flow of health data without compromising the diversity and accuracy of that data, or the speed of its delivery for research and/or other purposes.

“The present technology provides a process for making data useful from health system members, referred to herein as “normalization.” Data normalization can refer to the practice of converting a diverse flow of data into a unified and consistent data model. Conventionally, the task of interpreting health data and mapping to standard models is done by an expert team of annotators, informaticists, and other clinical experts. Given the size and speed of the data flow that health data platform manages, this process is not practical or scalable. Instead, the present technology provides a unique system that combines artificial intelligence (AI), machine learning, and natural language processing with expert analysis. In this way, the present technology can automate much of the normalization process at massive scale while leveraging clinical experts to monitor, update and evolve the system.

“In some embodiments, the disclosed techniques provide a network-based patient data management method that acquires, aggregates, and normalizes patient information from various sources into a uniform or common format, stores the aggregated patient information, and notifies health care providers and/or patients after information is updated via one or more communication channels. In some cases, the acquired patient information may be provided by one or more users through an interface, such as a graphical user interface, that provides remote access to users over a network so that any one or more of the users can provide at least one updated patient record in real time, such as a patient record in a format other than the uniform or common format, including formats that are dependent on a hardware and/or software platform used by a user providing the patient information.

“In some embodiments, the disclosed techniques employ a data catalog that facilitates data governance and analyzing and adding records to a repository. The data catalog can capture metadata for multi-modal data, thereby providing a single place to track data in the system. Furthermore, metadata driven transforms provide for data normalization that allows data modelers and analysts to work independent of target data platform where data is processed. Metadata driven processing improves consistency, debugging and reduces maintenance while capturing data lineage. Metrics and alerts related to any data and quality of data can be authored and persisted in data catalog while schema and transforms can be versioned, ensuring backward compatibility.

“Embodiments of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings in which like numerals represent like elements throughout the several figures, and in which example embodiments are shown. Embodiments of the claims may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The examples set forth herein are non-limiting examples and are merely examples among other possible examples.

“The headings provided herein are for convenience only and do not interpret the scope or meaning of the claimed present technology. Embodiments under any one heading may be used in conjunction with embodiments under any other heading.

“I. Health Data Platform

“FIGS. 1A and 1B provide a general overview of a health data platform configured in accordance with embodiments of the present technology. Specifically, FIG. 1A is a schematic diagram of a computing environment 100a in which a health data platform 102 can operate, and FIG. 1B is a schematic diagram of a data architecture 100b that can be implemented by the health data platform 102.

“Referring first to FIG. 1A, the health data platform 102 is configured to receive health data from a plurality of health systems 104, aggregate the health data into a common data repository 106, and allow one or more users 108 to access the health data stored in the common data repository 106. As described in further detail below, the common data repository 106 can store health data from multiple different health systems 104 and/or other data sources in a uniform schema, thus allowing for rapid and convenient searching, analytics, modeling, and/or other applications that would benefit from access to large volumes of health data.

“The health data platform 102 can be implemented by one or more computing systems or devices having software and hardware components (e.g., processors, memory) configured to perform the various operations described herein. For example, the health data platform 102 can be implemented as a distributed “cloud” server across any suitable combination of hardware and/or virtual computing resources. The health data platform 102 can communicate with the health system 104 and/or the users 108 via a network 110. The network 110 can be or include one or more communications networks, such as any of the following: a wired network, a wireless network, a metropolitan area network (MAN), a local area network (LAN), a wide area network (WAN), a virtual local area network (VLAN), an internet, an extranet, an intranet, and/or any other suitable type of network or combinations thereof.

“The health data platform 102 can be configured to receive and process many different types of health data, such as patient data. Examples of patient data include, but are not limited to, any of the following: age, gender, height, weight, demographics, symptoms (e.g., types and dates of symptoms), diagnoses (e.g., types of diseases or conditions, date of diagnosis), medications (e.g., type, formulation, prescribed dose, actual dose taken, timing, dispensation records), treatment history (e.g., types and dates of treatment procedures, the healthcare facility or provider that administered the treatment), vitals (e.g., body temperature, pulse rate, respiration rate, blood pressure), laboratory measurements (e.g., complete blood count, metabolic panel, lipid panel, thyroid panel, disease biomarker levels), test results (e.g., biopsy results, microbiology culture results), genetic data, diagnostic imaging data (e.g., X-ray, ultrasound, MRI, CT), clinical notes and/or observations, other medical history (e.g., immunization records, death records), insurance information, personal information (e.g., name, date of birth, social security number (SSN), address), familial medical history, and/or any other suitable data relevant to a patient’s health. In some embodiments, the patient data is provided in the form of electronic health record (EHR) data, such as structured EHR data (e.g., schematized tables representing orders, results, problem lists, procedures, observations, vitals, microbiology, death records, pharmacy dispensation records, lab values, medications, allergies, etc.) and/or unstructured EHR data (e.g., patient records including clinical notes, pathology reports, imaging reports, etc.). A set of patient data relating to the health of an individual patient may be referred to herein as a “patient record.”’

There is additional summary information. Please visit full patent to read further.”

The claims supplied by the inventors are:

“1. A method for normalizing medical records, the method comprising: receiving a plurality of annotated medical records, each annotated medical record corresponding to a patient; training a machine learning model based on the received plurality of annotated medical records; receiving a first medical record; normalizing the first medical record at least in part by applying the trained machine learning model to the received first medical record, wherein applying the trained machine learning model to the received first medical record generates a first concept code for the first medical record and a detection confidence score for the first concept code; determining that the detection confidence score for the first concept code exceeds a predetermined threshold; and after determining that the detection confidence score for the first concept code exceeds the predetermined threshold, appending the first concept code to the first medical record.

“2. The method of claim 1, wherein applying the trained machine learning model to the received first medical record generates a plurality of concept codes for the first medical record and a detection confidence score for each generated concept code.

“3. The method of claim 2, further comprising: randomly selecting one or more of the generated concept codes; for each of the randomly selected one or more generated concept codes, receiving, from a user, an indication of whether the generated concept code has a false high confidence score; and re-training the machine learning model based on one or more of the received indications.

“4. The method of claim 1, further comprising: in response to determining that the detection confidence score for the first concept code exceeds a first threshold, inserting the first concept code into the first medical record.

“5. The method of claim 1, further comprising: in response to determining that the detection confidence score for the first concept code does not exceed a first threshold, ranking the first concept code relative to one or more other concept codes.

“6. The method of claim 5, further comprising: receiving, from a user, a selection of one or more of the ranked concept codes; and inserting the selected one or more concept codes into the first medical record.

“7. The method of claim 1, wherein applying the trained machine learning model to the first medical record comprises generating a feature vector for the first medical record, the feature vector comprising a value for each of a plurality of attributes.

“8. The method of claim 7, wherein the feature vector for the first medical record includes at least one value for a metadata attribute of the first medical record.

“9. The method of claim 1, further comprising: providing remote access to users over a network so that any one or more of the users can provide at least one updated record in real time through an interface, wherein at least one of the users provides an updated record in a format other than a common format, wherein the format other than the common format is dependent on hardware and software platform used by the at least one user; converting the at least one updated record into the common format; generating a set of at least one normalized record from the at least one updated record; storing the generated set of at least one normalized record; after storing the generated set of at least one normalized record, generating a message containing the generated set of at least one normalized record; and transmitting the message to one or more users over the network in real time, so that the users have access to the updated record.

“10. The method of claim 1, further comprising: de-identifying each of the plurality of annotated medical records and the first medical record.

“11. The method of claim 1, wherein each of the annotated medical record includes a number of fields and a corresponding annotation, the corresponding annotation representing a field from a common schema model, the method further comprising: training a second machine learning model based on the received plurality of annotated medical records, wherein the trained second machine learning model is configured to identify at least one field from the common schema model for each of a plurality of fields associated with an input medical record; further normalizing the first medical record at least in part by applying the trained second machine learning model to the received first medical record, wherein applying the trained second machine learning model to the received first medical record identifies, for each of a plurality of fields associated with the first medical record, at least one field of the common schema model associated with the field associated with the first medical record; and appending each identified field of the common schema model to the first medical record.

“12. A computing system for normalizing records, the computing system comprising: at least one memory; at least one processor; a component configured to train a plurality of machine learning models based on a first plurality of annotated records; a component configured to, for each of a second plurality of records, apply one or more of the plurality of machine learning models to the record, wherein applying each trained machine learning model to the record generates a code for the record and a corresponding confidence score; a component configured to determine whether a first confidence score for a first code exceeds a predetermined threshold; and a component configured to, after determining that the confidence score for the first code exceeds the predetermined threshold, append the first code to one or more records of the second plurality of records, wherein each of the component comprises computer-executable instructions stored in the at least one memory for execution by the computing system.

“13. The computing system of claim 12, wherein one or more of the annotated records is annotated with one or more syntactic classifications, each syntactic classification having an associated code.

“14. The computing system of claim 12, further comprising: a component configured to rank a plurality of classifications generated for a first record based on the confidence score associated with each of the plurality of classifications.

“15. The computing system of claim 12, further comprising: a component configured to identify at least one ontology associated with a first annotated record of the first plurality of annotated records.

“16. The computing system of claim 12, further comprising: a component configured to, for each of a plurality of metadata attributes associated with a first record, determine a value for the metadata attribute; and a component configured to apply one or more trained models to the determined values.

“17. A computer-readable storage medium storing instructions that, when executed by a computing system having a memory and a processor, cause the computing system to perform a method for normalizing data, the method comprising: training a machine learning model based on a plurality of annotated records, each annotated record specifying at least one associated classification; receiving a new record; normalizing the new record at least in part by applying the trained machine learning model to the received new record, wherein applying the trained machine learning model to the new record generates a first classification for the new record and a confidence score for the first classification; determining that the confidence score for the first classification exceeds a predetermined threshold; and after determining that the confidence score for the first classification exceeds the predetermined threshold, appending the first record to include an indication of the first classification.

“18. The computer-readable storage medium of claim 17, wherein one or more of the annotated records is annotated with one or more semantic classifications, each semantic classification having an associated code.

“19. The computer-readable storage medium of claim 17, further comprising: for each of a plurality of classifications generated for a first record, determining whether a confidence score associated with the classification exceeds the predetermined threshold, and in response to determining that the confidence score associated with the classification exceeds the predetermined threshold, adding the classification to the first record.

“20. The computer-readable storage medium of claim 17, further comprising: randomly selecting one or more classifications associated with a first record; and for each of the randomly selected one or more classifications, receiving, from a user, an indication of whether the classification has a false high confidence score.

“21. The computer-readable storage medium of claim 20, further comprising: re-training at least one machine learning model based on one or more of the received indications.”

There are additional claims. Please visit full patent to read further.

For additional information on this patent application, see: Biglari, Mehrdad; Burugapalli, Srinivas R. R.; Culpin, Patty; Gopalakrishnan, Nishant; Kolavennu, Ramesh; Marcjan, Cezary A.; Nachimuthu, Senthil; Nanduri, Jayaram; Subash, Swarna; Tsui, Wing; Zarandioon, Saman. Systems And Methods For Data Normalization. U.S. Patent Application Number 20230147366, filed November 8, 2022 and posted May 11, 2023. Patent URL (for desktop use only): https://ppubs.uspto.gov/pubwebapp/external.html?q=(20230147366)&db=US-PGPUB&type=ids

(Our reports deliver fact-based news of research and discoveries from around the world.)

Researchers Submit Patent Application, “Systems And Methods For Data Normalization”, for Approval (USPTO 20230147366): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Researchers Submit Patent Application, “Systems And Methods For Data Normalization”, for Approval (USPTO 20230147366): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account