Researchers Submit Patent Application, “Methods And Systems For Monitoring A Risk Of Re-Identification In A De-Identified Database”, for Approval (USPTO 20220129584): Mirador Analytics Limited
2022 MAY 17 (NewsRx) -- By a
The patent’s assignee is
News editors obtained the following quote from the background information supplied by the inventors: “A database, or dataset, is an organized collection of data, generally stored and accessed electronically from a computer system. Databases are often organized in tables where each row represents a database record and each column represents a database field. A record may correspond for example to an individual and a field may correspond for example to an attribute of a person, such as the person’s name, age, nationality and so on.
“With the advancement of big data analytics and data science, the number of data marketplaces and organizations selling or sharing databases has multiplied. Parallelly, the privacy of individuals whose information is contained in those databases has become an increasing concern. Various provisions, both national and international, have been introduced to ensure that organizations working with databases which contain personal data protect the privacy of individuals through sufficient levels of data de-identification.
“Organizations are generally required to de-identify databases before sharing them with third parties and/or the public. De-identification is the process of removing or obscuring fields that allow an individual to be identified. Typically, a dataset is de-identified by removing fields which comprise explicit personal information such as personal names or social security numbers. These are generally called “identifiers” or “direct identifiers”. However, a database may also comprise fields, referred to as “quasi-identifiers”, which are not direct identifiers but which in combination with other quasi-identifiers from the same or from other databases may lead to the identification of an individual. Examples of quasi-identifiers may be for example full zip codes, date of birth or death and so on. An attacker may manage to re-identify one or more records in a database where no direct identifiers are present by consulting public sources such as civil registries or Census databases and linking quasi-identifiers in the database to direct identifiers available in the public source.
“The risk of re-identification of a dataset, i.e. the risk that one or more records in the dataset may be re-identified and associated to a specific individual, is a big concern particularly for databases which contain healthcare data, such as databases managed by hospital systems, provider groups, insurance companies, analytics companies, and so on. Some regulations set a minimum standard for de-identification which such database owners must meet in order to ensure the risk of re-identification is kept at a minimum. In the US for example, the sharing of electronic medical records (EMR) is subject to the de-identification standard set forth by the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Despite the HIPAA Privacy Rule delineating several routes by which data can be rendered de-identified, most organizations rely on the “Safe Harbor” approach, which enumerates 18 identifiers that must be suppressed. However, this approach is often criticized as being too stringent because it may suppress attributes which are essential for epidemiological and population-based studies, thereby limiting the usefulness of the databases for research purposes.
“A dataset, such as the electronic medical records of a clinical trial, may be updating constantly, due to the removal, amending or addition of data. Therefore, it is likely that a user may need to re-check that the risk of re-identification is still compliant with the provisions quite frequently. This is problematic from a privacy perspective, since it relies on database owners repeatedly needing the service of a risk determination expert or relying on their own assessment to determine when the risk of a database needs to be re-assessed which may result in an increased violation of the regulations in place. Even when a risk determination expert is regularly consulted, customer’s data will often drift toward higher risk levels between evaluations thereby bringing the database compliance into question.
“For large datasets, even one assessment of risk is often quite expensive from a time, computational and economical point of view. Current methods to estimate a risk of re-identification are quite cumbersome and often involve a risk determination expert assessing each database individually with fairly limited automation. The risk determination expert may be required to go through various meetings and conversations with their customers, the database owners, before the assessment is possible. The process may be further slowed down by customers and expert often being in different time zones, customers providing inaccurate data and accompanying information, and so on.
“In conclusion, assessing the risk of re-identification can often turn out to be a time-consuming process which slows down the workflows of owners, researchers who need the database for their studies, and users in general; and the issues are even more problematic when multiple determinations are required over a short period of time. It would also be desirable to have system and methods for automatically alerting the user when a new risk assessment is required; and it would be desirable to have systems and methods for assessing the risk of re-identification in a faster and fully or semi-automated way such that the time, costs and number of interactions between database owners/users and experts are minimized.
“The optimal de-identification strategy and the model used to estimate a risk of re-identification for a database may depend on the specific application. Different users may tolerate different levels of risk or they may prioritize certain attributes to be maintained in the database over others. In certain circumstances, a user may prefer to remove specific records which have a particularly high risk of being re-identified, rather than perturbing or removing a field for the entire database. Or, a user may prefer to sacrifice certain attributes or fields and remove them from the database altogether rather than stripping some records off the database. It would be desirable to have methods and systems which allow the users and the risk determination experts to take into account user needs and easily adapt the de-identification strategy and the risk model to each specific application.
“The risk of re-identification has traditionally been assessed by risk determination experts by measuring the level of violation of k-anonymity, i.e. by assessing how many records in the dataset have a k-value above a pre-determined threshold. A dataset is said to have k-anonymity if the information for each record in the dataset cannot be distinguish from at least k-1 other records in the dataset. Violation from k-anonymity is calculated as the percentage of records that have a k-value less than some threshold, e.g. 5. Generally, the accepted criteria for considering a dataset de-identified is having less than 1% of the records with k-value below 5. However, this approach presents some disadvantages. Firstly, it is based on a relative calculation in which the risk of each record is computed relative to other records in the dataset, and therefore it can only be applied to a dataset and not to individual records. Secondly, the k-anonymity approach implies that the risk of each record is affected by the size of the dataset. Thirdly, if some of the records are missing certain information, this affects calculation of the risk of the records which falls in the same k-anonymity group. Lastly, it does not allow for an easy understanding of how each variable contributes to the risk. Using k-anonymity to estimate re-identification risk can also result in an overestimation of the risk and in turn in unnecessary suppression of information contained in the database, thereby degrading the quality and utility of the dataset. High levels of privacy for individuals should be guaranteed in all databases comprising sensitive healthcare data while maximizing data utility to allow for innovation, efficiency, and development in healthcare. Therefore, de-identification criteria should be construed on the principle that the risk of re-identification should be kept small enough in order to ensure the privacy of individuals is protected whilst not removing useful data unnecessarily. It would be desirable to provide a method for estimating a risk of re-identification of a database which is not overly stringent and which takes into account the absolute risk of re-identification of a record.
“Lastly, database owners may need to document statistical analysis and rationale for any residual disclosure risk to prove p compliance to multiple regulatory bodies, for example if similar data is used in different countries or made available to different types of recipients or for different applications. Definitions of deidentification and anonymization may differ for different industries, countries, or regions meaning a company must perform determinations to align with the differing definitions. Database owners need to perform an expert determination each time there are changes in the data or the surrounding environment. These determinations can take time, and the multiple iterations can further contribute to times and delays. Between assessments, a database which is being regularly updated may have reached an unacceptable level of risk.”
As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventors’ summary information for this patent application: “It is an object of the disclosure to address one or more of the above-mentioned limitations.
“According to a first aspect of the disclosure there is provided: a method for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the method comprising: providing a user interface (UI) configured to receive as input the datasets and updates to said dataset; providing as input to the UI the dataset; estimating a risk of re-identification for the dataset or a subset of the database; providing as input to the UI the updates to said dataset; regularly monitoring whether the risk of re-identification for at least one of the updated dataset, the subset of the database and the updates is below a predetermined dataset risk threshold; and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, notifying the user.
“Optionally, the database comprises a plurality of database records and a plurality of database fields, wherein each database record has a plurality of associated field values, each associated field value being related to a database field.
“Optionally, the dataset comprises a plurality of dataset records and a plurality of dataset fields, wherein each dataset record has a plurality of associated field values, each associated field value being related to a dataset field.
“Optionally, the plurality of dataset records is a subset of the database records and the plurality of dataset fields is a subset of the database fields.
“Optionally, each database record corresponds to an individual of a source population.
“Optionally, the database fields comprise one or more medical data fields.
“Optionally, the updates to the dataset comprise one or more of: removing one or more records from the dataset records; adding one or more records to the dataset records; and removing, adding or modifying one or more dataset fields.
“Optionally, one or more fields correspond to a categorical or numerical variable and modifying such fields comprises reducing the granularity of the field values relating to said fields.
“The method may comprise providing in input one or more different risk estimation models and the risk of re-identification for a dataset is estimated according to one or more different risk estimation models.
“Optionally, estimating a risk of re-identification comprises estimating an individual risk of re-identification for each record; and determining how many records have an individual risk of re-identification above a pre-specified individual risk threshold.
“Optionally, estimating the individual risk or re-identification for each record comprises: selecting a subset of fields; and for each field in the subset, computing a population field statistical distribution.
“Optionally, estimating the individual risk or re-identification for each record further comprises: computing a combined statistical distribution of the subset of fields from the population field statistical distributions; and from said combined statistical distribution, computing the likely number of members of the source population that have the same field value as the record for each field in the subset of fields.
“Optionally, the fields in the subset of fields are selected such that all fields in the subset of fields are quasi-identifiers.
“Optionally, computing the population field statistical distribution comprises selecting the source database or a second database external to the source database which relates to the source population; and deriving the population field statistical distribution from the selected database.
“Optionally, estimating the risk of re-identification comprises computing a mean and standard deviation of the individual risk of re-identification for all dataset records.
“Optionally, the method comprises computing an internal statistical distribution of the dataset; and regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below the predetermined dataset risk threshold comprises: regularly monitoring the internal statistical distribution of the dataset; and if the internal statistical distribution varies beyond a predetermined accepted variation, re-computing the risk of re-identification for the dataset.
“Optionally, providing updates to the initial dataset comprises providing a set of de-identified records to be added to the dataset.
“Optionally, regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises computing the risk of re-identification for the set of database records; and if the risk of re-identification for the set of database records is greater than the risk of re-identification for the dataset, re-computing the risk of re-identification for the updated dataset.
“Optionally, regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises: each time a set of database records is added to the dataset, computing an internal statistical distribution of the set of database records; and if the internal statistical distribution of the set of database records differs from the internal statistical distribution of the dataset beyond the predetermined accepted variation, re-computing the risk of re-identification for the updated dataset.
“Optionally, the method further comprises providing as output a metric representing the absolute or proportional number of identifiable and non-identifiable records in the dataset.
“Optionally, the method further comprises providing as output a metric representing the absolute or proportional number of higher risk and lower risk records in the dataset.
“Optionally, estimating the risk of re-identification comprises: for each source database, providing a list of risk-determination rules; and automatically computing the risk of-reidentification of the database based on the list of risk-determination rules.
“Optionally, the method further comprises: if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, identifying the dataset as non-compliant; iteratively modifying the non-compliant dataset until the risk of re-identification for the modified dataset is below the predetermined dataset risk threshold in order to generate a compliant dataset; and providing the compliant dataset in the user interface.
“Optionally, the method further comprises providing as input to the user interface a set of modification rules based on the source database.
“Optionally, the non-compliant dataset is modified according to the modification rules.
“Optionally, the method comprises providing in input one or more user field priority settings and/or other user priority settings and the modification rules take into account said settings.
“Optionally, modifying the non-compliant dataset comprises removing one or more records for which the individual risk of re-identification is above the pre-determined individual risk threshold.
“Optionally, generating a compliant dataset comprises identifying fields in the dataset which are contributing to the risk of re-identification and removing or modifying one or more of said fields.
“Optionally, regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined threshold comprises estimating the risk of re-identification for the updated dataset at scheduled intervals.
“Optionally, the intervals are predetermined time intervals.
“Optionally, the intervals are a predetermined number of updates intervals.
“Optionally, the predetermined dataset risk threshold comprises a range set by a user, an assessor or a regulatory body.
“Optionally, the method comprises: if the risk of re-identification is close to reaching or exceeding the predetermined dataset risk threshold, providing an alert in the user interface.
“Optionally, the method comprises: if the risk of re-identification is close to reaching or exceeding the predetermined dataset risk threshold, providing an alert by email and/or text.
“Optionally, the method further comprises providing in the GUI an automatically generated outcome report of the monitoring of the risk of re-identification.
“Optionally, the user interface comprises a graphical user interface (GUI).
“Optionally, the method further comprises providing a graphical representation of the fluctuations of the risk of re-identification over time.
“Optionally, the method further comprises providing in the GUI a certificate of compliance with the predetermined dataset risk threshold.
“According to a second aspect of the disclosure there is provided: a system for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the system comprising:
“
“a user interface (UI) configured to receive as input the dataset and updates to said dataset;
“a memory configured to store the dataset;
“a risk estimation module configured to estimate a risk of re-identification for the dataset or a subset of the database; and
“a risk monitoring module configured to regularly monitor whether the risk of re-identification for at least one of the updated datasets or the subset of the database updates is below a predetermined dataset risk threshold;
“wherein the system is configured to:
“if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, notify the user.
“
“Optionally the user interface comprises a graphical user interface (GUI).
“Optionally, the graphical user interface comprises one or more of a dataset owner view, an expert view and a reviewer view.”
There is additional summary information. Please visit full patent to read further.”
The claims supplied by the inventors are:
“1. A method for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the method comprising: providing a user interface (UI) configured to receive as input the dataset and updates to said dataset; providing as input to the UI the dataset; providing a computer device configured to estimate a risk of re-identification for the dataset or a subset of the database by automatically estimating an individual risk of re-identification for each record and determining how many records have an individual risk of re-identification above a pre-specified individual risk threshold; providing as input to the UI the updates to said dataset; wherein said computer device is further configured to regularly monitor whether the risk of re-identification for at least one of the updated dataset or the subset of the database or the updates is below a predetermined dataset risk threshold, and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, to automatically notify a user.
“2. The method of claim 1, wherein the database comprises a plurality of database records and a plurality of database fields, wherein each database record has a plurality of associated field values, each associated field value being related to a database field; the dataset comprises a plurality of dataset records and a plurality of dataset fields, wherein each dataset record has a plurality of associated field values, each associated field value being related to a dataset field; and the plurality of dataset records is a subset of the database records and the plurality of dataset fields is a subset of the database fields.
“3. The method of claim 2, wherein each database record corresponds to an individual of a source population.
“4. (canceled)
“5. The method of claim 4, wherein estimating the individual risk or re-identification for each record comprises: selecting a subset of fields; and for each field in the subset, computing a population field statistical distribution; computing a combined statistical distribution of the subset of fields from the population field statistical distributions; and from said combined statistical distribution, computing the likely number of members of the source population that have the same field value as the record for each field in the subset of fields.
“6. The method of claim 5, wherein the fields in the subset of fields are selected such that all fields in the subset of fields are quasi-identifiers.
“7. The method of claim 6, wherein computing the population field statistical distribution comprises: selecting the source database or a second database external to the source database which relates to the source population; and deriving the population field statistical distribution from the selected database.
“8. The method of claim 1, wherein the method further comprises computing an internal statistical distribution of the dataset; and regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below the predetermined dataset risk threshold comprises: regularly monitoring the internal statistical distribution of the dataset; and if the internal statistical distribution varies beyond a predetermined accepted variation, re-computing the risk of re-identification for the dataset.
“9. The method of claim 1, wherein providing updates to the initial dataset comprises providing a set of de-identified records to be added to the dataset.
“10. The method of claim 9, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises computing the risk of re-identification for the set of database records; and if the risk of re-identification for the set of database records is greater than the risk of re-identification for the dataset, re-computing the risk of re-identification for the updated dataset.
“11. The method of claim 9, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises: each time a set of database records is added to the dataset, computing an internal statistical distribution of the set of database records; if the internal statistical distribution of the set of database records differs from the internal statistical distribution of the dataset beyond the predetermined accepted variation, re-computing the risk of re-identification for the updated dataset.
“12. The method of claim 1, wherein estimating the risk of re-identification comprises: for each source database, providing a list of risk-determination rules; and automatically computing the risk of-reidentification of the database based on the list of risk-determination rules.
“13. The method of claim 1, wherein the method further comprises: if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, identifying the dataset as non-compliant; iteratively modifying the non-compliant dataset until the risk of re-identification for the modified dataset is below the predetermined dataset risk threshold in order to generate a compliant dataset; and providing the compliant dataset in the user interface.
“14. The method of claim 13, wherein the method further comprises providing as input to the user interface a set of modification rules based on the source database; and the non-compliant dataset is modified according to the modification rules.
“15. The method of claim 14, wherein generating a compliant dataset comprises identifying fields in the dataset which are contributing to the risk of re-identification and removing or modifying one or more of said fields.
“16. The method of claim 1, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined threshold comprises estimating the risk of re-identification for the updated dataset at scheduled intervals.
“17. The method of claim 1, wherein the method further comprises providing in the GUI an automatically generated outcome report of the monitoring of the risk of re-identification.
“18. A system for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the system comprising: a user interface (GUI) configured to receive as input the dataset and updates to said dataset; a memory configured to store the dataset; and a risk monitoring computer device configured to regularly monitor whether the risk of re-identification for at least one of the updated dataset or a subset of the database or the updates is below a predetermined dataset risk threshold, by automatically estimating an individual risk of re-identification for each record and determining how many records have an individual risk of re-identification above a pre-specified individual risk threshold; wherein the system is configured to if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, automatically notify the user.
“19. The system of claim 18, wherein the database comprises a plurality of database records and a plurality of database fields, wherein each database record has a plurality of associated field values, each associated field value being related to a database field; the dataset comprises a plurality of dataset records and a plurality of dataset fields, wherein each dataset record has a plurality of associated field values, each associated field value being related to a dataset field; and the plurality of dataset records is a subset of the database records and the plurality of dataset fields is a subset of the database fields.
“20. The system of claim 19, wherein the user interface comprise a graphical user interface (GUI); the updates to the database comprise one or more of: removing one or more records from the dataset records; adding one or more records to the dataset records; and removing, adding or modifying one or more dataset fields; and the graphical user interface comprises graphical elements to allow the user to modify one or more dataset fields; and the graphical user interface is configured to show the evolution of the risk of re-identification for the dataset in real-time.
“21. A method for determining whether a dataset de-identified from a source database containing information identifiable to individuals is compliant with one or more given regulations, the method comprising: storing a list of risk-determination and compliance rules in a memory; using a computer device, automatically computing the risk of re-identification of the dataset based on one or more of said rules stored in the memory; and using a computer device, automatically determining whether the dataset is compliant with the one or more regulations based on one or more of said rules.
“22. The method of claim 21, wherein the rules in the list of risk-determination and compliance rules are dependent on the one or more provided regulations.
“23. The method of claim 21, wherein the method further comprises: if the dataset is determined to be compliant with one or more of the provided regulations, automatically generating a certificate of compliance for said one or more regulations.
“24. The method of claim 21, wherein the method further comprises, if the dataset is determined to be non-compliant with one or more of the provided regulations, implementing one or more of the following steps: automatically determining a list of causes of non-compliance; and automatically determining a list of corrective steps to modify the dataset and make it compliant; automatically modifying the non-compliant dataset to provide a compliant dataset.”
There are additional claims. Please visit full patent to read further.
For additional information on this patent application, see: BAYLESS, Paul; BLACKPORT, John; GRAY, Jamie; MOFFATT, Colin; SYMMERS, Paul. Methods And Systems For Monitoring A Risk Of Re-Identification In A De-Identified Database. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)
Patent Issued for Computer implemented insurance selection systems and methods (USPTO 11315192): Metropolitan Life Insurance Co.
Patent Issued for Precision health monitoring with digital devices (USPTO 11314492): VigNet Incorporated
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News