“De-Identification Of Protected Information In Multiple Modalities” in Patent Application Approval Process (USPTO 20200074101)
2020 MAR 25 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “As technology advances, more and more data is being collected, e.g., from the ‘internet of things,’ as well as from more specialized data sources such as health care equipment and personnel. For example, with the advent of the Electronic Health Record (‘EHR’) system, there is an exponential growth in the volume of information (e.g., symptoms, diagnoses, procedures, medications etc.) collected from patients during the course of a treatment. A multi-specialty hospital has many departments resulting in the generation of hundreds of gigabytes of data every day. Also, more and more structured data is being made available for research. As data collection and proliferation becomes more and more ubiquitous, it becomes increasingly important to anonymize various types of protected data while also allowing the data to be leveraged to its full potential. For example, various types of data may be subjected to de-identification or anonymization processing in which data that are usable to identify an individual or group may be scrubbed while other data may be maintained in some form so that it can be used for various beneficial purposes.
“Patient healthcare data can be extremely useful for a variety of purposes, such as disease research, development of drugs and other treatments, etc. However, this data is typically considered highly sensitive, and therefore may be covered by national, regional, hospital, or business regulations. Examples include the Health Insurance Portability Act (‘HIPAA’) requirements for data privacy in the US, Informatics for Integrating Biology and the Bedside (‘i2b2’),
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “Given the many possible requirements for what constitutes PHI in a particular study or how that PHI is required to be handled, efforts to create a software system capable of producing de-identified output acceptable to all standards have failed. Instead, software systems have been created piecemeal that are tailored for each application. The problem is compounded by the requirement to process many different types of data, such as imaging data, electronic medical record (‘EMR’) extracts, waveforms, free text notes, etc., in a consistent manner such that the output of all systems may be linked to form a full multi-modal view of the patient. The traditional solution to this problem has been to create individual software systems that process each type of data, as well as each modality of a data type. Each new type of data to be processed requires re-implementation of the de-identification components, consistent configurations to ensure that all components are treating PHI in an identical way, and methods of ensuring that the output of each isolated processing layer is consistent. This is especially difficult if look-up tables are required (as they often are), and lookup tables must be synced between processing components.
“Accordingly, the present disclosure is directed to a framework for centralized de-identification of protected data associated with subjects in multiple modalities based on a hierarchal taxonomy of policies and corresponding handlers. For example, in the healthcare context, techniques described herein may be implemented to provide a centralized platform that is capable of processing multiple data streams containing multiple data types and/or data modalities. The platform may be easily configurable to perform de-identification in accordance with a variety of different regulations, as well as to facilitate other features such as deduplication, auditing, and/or discoverability. In some embodiments, the platform may make use of a hierarchal taxonomy to classify individual data points, as well as to select handlers to process the data points in accordance with their classifications. Techniques disclosed herein create a single software platform and framework to act as a single point of configuration and to perform centralized
“As used herein, a ‘data type’ refers to a type of data, e.g., a source of data. One example of a data type is a subject identifier. Subject identifiers can include what will be referred to herein as ‘external,’ ‘internal,’ and ‘system’ identifiers. An external identifier is a general-purpose identifier (although it may have been initially created for a specific context) that is used in a variety of circumstances beyond a particular context, such as a social security number, a driver’s license number, United States
“As used herein, a ‘data modality’ or ‘modality’ refers to a way of expressing a particular data type, e.g., with a particular level of granularity. For example a datetime can be expressed in a number of ways (i.e. modalities), such as ISO 8601. As another example, a location data type can be expressed in various modalities and/or granularities, such as a ZIP code, a street address, a city/state, etc. As yet another example, phone numbers may be expressed in various ways, such as with or without area codes, with or without interspersed commas, and so forth. In various embodiments, various modalities may be captured by regular expressions or other similar means.
“Generally, in one aspect, a method for multi-modal, centralized de-identification, may include: receiving, one or more data sets associated with one or more subjects, each of the one or more data sets containing a plurality of data points associated with a respective subject of the one or more subjects, wherein at least some of the plurality of data points associated with the respective subject are usable to identify the respective subject, and wherein the plurality of data points associated with the respective subject include multiple data types; for each respective subject of the one or more subjects: determining a classification of each data point of the plurality of data points associated with the respective subject in accordance with a hierarchal taxonomy, wherein the hierarchal taxonomy defines, for each respective data type of the multiple data types, a sub-taxonomy of modalities associated with the respective data type; based on the classifications, identifying a plurality of respective handlers for the plurality of data points associated with the respective subject, wherein at least one of the handlers is configured to obfuscate or drop a data point of the plurality of data points associated with the respective subject; and processing each data point of the plurality of data points associated with the respective subject using the respective identified handler, thereby de-identifying the plurality of data points associated with the respective subject.
“In various embodiments, the one or more subjects may include one or more patients, and the one or more data sets associated with the one or more subjects may include medical records associated with the one or more patients. In various embodiments, the multiple data types of the plurality of data points associated with each respective patient of the one or more patients may include an external identification number associated with the respective patient and a physiological measurement of the respective patient.
“In various embodiments, the plurality of data points associated with each respective subject may be received from multiple different data sources, each data source storing a particular data type of the multiple data types. In various embodiments, identifying the plurality of respective handlers for the plurality of data points associated with the respective subject may include identifying, for each given data point of the plurality of data points, the respective handler based on the modality of the given data point. In various embodiments, a data type of the given data point may be a date, and the respective handler may be configured to apply a date shift to the given data point. In various embodiments, a data type of the given data point may be an external identifier that is usable to identify a subject of the one or more subjects, and the respective handler may be configured to obfuscate or drop the external identifier. In various embodiments, a data type of the given data point may be an internal identifier to which external access is limited, and the respective handler may be configured to allow the internal identifier to pass through.
“In various embodiments, the method may further include generating a log to track the processing of each data point of the plurality of data points associated with the respective subject, wherein the log may be usable to audit the centralized de-identification.
“In addition, some implementations include one or more processors of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.
“It should be appreciated that all combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein. It should also be appreciated that terminology explicitly employed herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.”
The claims supplied by the inventors are:
“1. A method for multi-modal, centralized de-identification, the method implemented using one or more processors and comprising: receiving, one or more data sets associated with one or more subjects, each of the one or more data sets containing a plurality of data points associated with a respective subject of the one or more subjects, wherein at least some of the plurality of data points associated with the respective subject are usable to identify the respective subject, and wherein the plurality of data points associated with the respective subject include multiple data types; for each respective subject of the one or more subjects: determining a classification of each data point of the plurality of data points associated with the respective subject in accordance with a hierarchal taxonomy, wherein the hierarchal taxonomy defines, for each respective data type of the multiple data types, a sub-taxonomy of modalities associated with the respective data type; based on the classifications, identifying a plurality of respective handlers for the plurality of data points associated with the respective subject, wherein at least one of the handlers is configured to obfuscate or drop a data point of the plurality of data points associated with the respective subject; and processing each data point of the plurality of data points associated with the respective subject using the respective identified handler, thereby de-identifying the plurality of data points associated with the respective subject.
“2. The method of claim 1, wherein the one or more subjects comprise one or more patients, and the one or more data sets associated with the one or more subjects include medical records associated with the one or more patients.
“3. The method of claim 2, wherein the multiple data types of the plurality of data points associated with each respective patient of the one or more patients include an external identification number associated with the respective patient and a physiological measurement of the respective patient.
“4. The method of claim 1, wherein the plurality of data points associated with each respective subject are received from multiple different data sources, each data source storing a particular data type of the multiple data types.
“5. The method of claim 1, wherein identifying the plurality of respective handlers for the plurality of data points associated with the respective subject includes identifying, for each given data point of the plurality of data points, the respective handler based on the modality of the given data point.
“6. The method of claim 5, wherein a data type of the given data point is a date, and the respective handler is configured to apply a date shift to the given data point.
“7. The method of claim 5, wherein a data type of the given data point is an external identifier that is usable to identify a subject of the one or more subjects, and the respective handler is configured to obfuscate or drop the external identifier.
“8. The method of claim 5, wherein a data type of the given data point is an internal identifier to which external access is limited, and the respective handler is configured to allow the internal identifier to pass through.
“9. The method of claim 1, further comprising generating a log to track the processing of each data point of the plurality of data points associated with the respective subject, wherein the log is usable to audit the centralized de-identification.
“10. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations: receiving, one or more data sets associated with one or more subjects, each of the one or more data sets containing a plurality of data points associated with a respective subject of the one or more subjects, wherein at least some of the plurality of data points associated with the respective subject are usable to identify the respective subject, and wherein the plurality of data points associated with the respective subject include multiple data types; for each respective subject of the one or more subjects: determining a classification of each data point of the plurality of data points associated with the respective subject in accordance with a hierarchal taxonomy, wherein the hierarchal taxonomy defines, for each respective data type of the multiple data types, a sub-taxonomy of modalities associated with the respective data type; based on the classifications, identifying a plurality of respective handlers for the plurality of data points associated with the respective subject, wherein at least one of the handlers is configured to obfuscate or drop a data point of the plurality of data points associated with the respective subject; processing each data point of the plurality of data points associated with the respective subject using the respective identified handler, thereby de-identifying the plurality of data points associated with the respective subject.
“11. The at least one non-transitory computer-readable medium of claim 10, wherein the one or more subjects comprise one or more patients, and the one or more data sets associated with the one or more subjects include medical records associated with the one or more patients.
“12. The at least one non-transitory computer-readable medium of claim 11, wherein the multiple data types of the plurality of data points associated with each respective patient of the one or more patients include an external identification number associated with the respective patient and a physiological measurement of the respective patient.
“13. The at least one non-transitory computer-readable medium of claim 10, wherein the plurality of data points associated with each respective subject are received from multiple different data sources, each data source storing a particular data type of the multiple data types.
“14. The at least one non-transitory computer-readable medium of claim 10, wherein identifying the plurality of respective handlers for the plurality of data points associated with the respective subject includes identifying, for each given data point of the plurality of data points, the respective handler based on the modality of the given data point.
“15. The at least one non-transitory computer-readable medium of claim 14, wherein a data type of the given data point is a date, and the respective handler is configured to apply a date shift to the given data point.
“16. The at least one non-transitory computer-readable medium of claim 14, wherein a data type of the given data point is an external identifier that is usable to identify a subject of the one or more subjects, and the respective handler is configured to obfuscate or drop the external identifier.
“17. The at least one non-transitory computer-readable medium of claim 14, wherein a data type of the given data point is an internal identifier to which external access is limited, and the respective handler is configured to allow the internal identifier to pass through.
“18. The at least one non-transitory computer-readable medium of claim 10, further comprising instructions for generating log to track the processing of each data point of the plurality of data points associated with the respective subject, wherein the log is usable to audit the centralized de-identification.
“19. A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations: receiving, one or more data sets associated with one or more subjects, each of the one or more data sets containing a plurality of data points associated with a respective subject of the one or more subjects, wherein at least some of the plurality of data points associated with the respective subject are usable to identify the respective subject, and wherein the plurality of data points associated with the respective subject include multiple data types; for each respective subject of the one or more subjects: determining a classification of each data point of the plurality of data points associated with the respective subject in accordance with a hierarchal taxonomy, wherein the hierarchal taxonomy defines, for each respective data type of the multiple data types, a sub-taxonomy of modalities associated with the respective data type; based on the classifications, identifying a plurality of respective handlers for the plurality of data points associated with the respective subject, wherein at least one of the handlers is configured to obfuscate or drop a data point of the plurality of data points associated with the respective subject; processing each data point of the plurality of data points associated with the respective subject using the respective identified handler, thereby de-identifying the plurality of data points associated with the respective subject.
“20. The system of claim 19, wherein the one or more subjects comprise one or more patients, and the one or more data sets associated with the one or more subjects include medical records associated with the one or more patients.”
URL and more information on this patent application, see: Carlson,
(Our reports deliver fact-based news of research and discoveries from around the world.)



American Enterprise Institute: A Critique of the House Republican Climate Policy Proposals
Advisor News
- Retirement Reimagined: This generation says it’s no time to slow down
- The Conversation Gap: Clients tuning out on advisor health care discussions
- Wall Street executives warn Trump: Stop attacking the Fed and credit card industry
- Americans have ambitious financial resolutions for 2026
- FSI announces 2026 board of directors and executive committee members
More Advisor NewsAnnuity News
- Retirees drive demand for pension-like income amid $4T savings gap
- Reframing lifetime income as an essential part of retirement planning
- Integrity adds further scale with blockbuster acquisition of AIMCOR
- MetLife Declares First Quarter 2026 Common Stock Dividend
- Using annuities as a legacy tool: The ROP feature
More Annuity NewsHealth/Employee Benefits News
- Virginia Republicans split over extending health care subsidies
- CareSource spotlights youth mental health
- Hawaii lawmakers start looking into HMSA-HPH alliance plan
- Senate report alleges Medicare upcoding by UnitedHealth
- Health insurance enrollment deadline extended
More Health/Employee Benefits NewsLife Insurance News