Patent Issued for Methods and systems for monitoring a risk of re-identification in a de-identified database (USPTO 11741262): Mirador Analytics Limited

2023 SEP 20 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- A patent by the inventors Bayless, Paul (Burke, VA, US), Blackport, John (Redpath, GB), Gray, Jamie (Melrose, GB), Moffatt, Colin (Melrose, GB), Symmers, Paul (Edinburgh, GB), filed on July 19, 2021, was published online on August 29, 2023, according to news reporting originating from Alexandria, Virginia, by NewsRx correspondents.

Patent number 11741262 is assigned to Mirador Analytics Limited (Colchester, United Kingdom).

The following quote was obtained by the news editors from the background information supplied by the inventors: “A database, or dataset, is an organized collection of data, generally stored and accessed electronically from a computer system. Databases are often organized in tables where each row represents a database record and each column represents a database field. A record may correspond for example to an individual and a field may correspond for example to an attribute of a person, such as the person’s name, age, nationality and so on.

“With the advancement of big data analytics and data science, the number of data marketplaces and organizations selling or sharing databases has multiplied. Parallelly, the privacy of individuals whose information is contained in those databases has become an increasing concern. Various provisions, both national and international, have been introduced to ensure that organizations working with databases which contain personal data protect the privacy of individuals through sufficient levels of data de-identification.

“Organizations are generally required to de-identify databases before sharing them with third parties and/or the public. De-identification is the process of removing or obscuring fields that allow an individual to be identified. Typically, a dataset is de-identified by removing fields which comprise explicit personal information such as personal names or social security numbers. These are generally called “identifiers” or “direct identifiers”. However, a database may also comprise fields, referred to as “quasi-identifiers”, which are not direct identifiers but which in combination with other quasi-identifiers from the same or from other databases may lead to the identification of an individual. Examples of quasi-identifiers may be for example full zip codes, date of birth or death and so on. An attacker may manage to re-identify one or more records in a database where no direct identifiers are present by consulting public sources such as civil registries or Census databases and linking quasi-identifiers in the database to direct identifiers available in the public source.

“The risk of re-identification of a dataset, i.e. the risk that one or more records in the dataset may be re-identified and associated to a specific individual, is a big concern particularly for databases which contain healthcare data, such as databases managed by hospital systems, provider groups, insurance companies, analytics companies, and so on. Some regulations set a minimum standard for de-identification which such database owners must meet in order to ensure the risk of re-identification is kept at a minimum. In the US for example, the sharing of electronic medical records (EMR) is subject to the de-identification standard set forth by the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Despite the HIPAA Privacy Rule delineating several routes by which data can be rendered de-identified, most organizations rely on the “Safe Harbor” approach, which enumerates 18 identifiers that must be suppressed. However, this approach is often criticized as being too stringent because it may suppress attributes which are essential for epidemiological and population-based studies, thereby limiting the usefulness of the databases for research purposes.

“A dataset, such as the electronic medical records of a clinical trial, may be updating constantly, due to the removal, amending or addition of data. Therefore, it is likely that a user may need to re-check that the risk of re-identification is still compliant with the provisions quite frequently. This is problematic from a privacy perspective, since it relies on database owners repeatedly needing the service of a risk determination expert or relying on their own assessment to determine when the risk of a database needs to be re-assessed which may result in an increased violation of the regulations in place. Even when a risk determination expert is regularly consulted, customer’s data will often drift toward higher risk levels between evaluations thereby bringing the database compliance into question.

“For large datasets, even one assessment of risk is often quite expensive from a time, computational and economical point of view. Current methods to estimate a risk of re-identification are quite cumbersome and often involve a risk determination expert assessing each database individually with fairly limited automation. The risk determination expert may be required to go through various meetings and conversations with their customers, the database owners, before the assessment is possible. The process may be further slowed down by customers and expert often being in different time zones, customers providing inaccurate data and accompanying information, and so on.

“In conclusion, assessing the risk of re-identification can often turn out to be a time-consuming process which slows down the workflows of owners, researchers who need the database for their studies, and users in general; and the issues are even more problematic when multiple determinations are required over a short period of time. It would also be desirable to have system and methods for automatically alerting the user when a new risk assessment is required; and it would be desirable to have systems and methods for assessing the risk of re-identification in a faster and fully or semi-automated way such that the time, costs and number of interactions between database owners/users and experts are minimized.

“The optimal de-identification strategy and the model used to estimate a risk of re-identification for a database may depend on the specific application. Different users may tolerate different levels of risk or they may prioritize certain attributes to be maintained in the database over others. In certain circumstances, a user may prefer to remove specific records which have a particularly high risk of being re-identified, rather than perturbing or removing a field for the entire database. Or, a user may prefer to sacrifice certain attributes or fields and remove them from the database altogether rather than stripping some records off the database. It would be desirable to have methods and systems which allow the users and the risk determination experts to take into account user needs and easily adapt the de-identification strategy and the risk model to each specific application.

“The risk of re-identification has traditionally been assessed by risk determination experts by measuring the level of violation of k-anonymity, i.e. by assessing how many records in the dataset have a k-value above a pre-determined threshold. A dataset is said to have k-anonymity if the information for each record in the dataset cannot be distinguish from at least k-1 other records in the dataset. Violation from k-anonymity is calculated as the percentage of records that have a k-value less than some threshold, e.g. 5. Generally, the accepted criteria for considering a dataset de-identified is having less than 1% of the records with k-value below 5. However, this approach presents some disadvantages. Firstly, it is based on a relative calculation in which the risk of each record is computed relative to other records in the dataset, and therefore it can only be applied to a dataset and not to individual records. Secondly, the k-anonymity approach implies that the risk of each record is affected by the size of the dataset. Thirdly, if some of the records are missing certain information, this affects calculation of the risk of the records which falls in the same k-anonymity group. Lastly, it does not allow for an easy understanding of how each variable contributes to the risk. Using k-anonymity to estimate re-identification risk can also result in an overestimation of the risk and in turn in unnecessary suppression of information contained in the database, thereby degrading the quality and utility of the dataset. High levels of privacy for individuals should be guaranteed in all databases comprising sensitive healthcare data while maximizing data utility to allow for innovation, efficiency, and development in healthcare. Therefore, de-identification criteria should be construed on the principle that the risk of re-identification should be kept small enough in order to ensure the privacy of individuals is protected whilst not removing useful data unnecessarily. It would be desirable to provide a method for estimating a risk of re-identification of a database which is not overly stringent and which takes into account the absolute risk of re-identification of a record.

“Lastly, database owners may need to document statistical analysis and rationale for any residual disclosure risk to prove p compliance to multiple regulatory bodies, for example if similar data is used in different countries or made available to different types of recipients or for different applications. Definitions of deidentification and anonymization may differ for different industries, countries, or regions meaning a company must perform determinations to align with the differing definitions. Database owners need to perform an expert determination each time there are changes in the data or the surrounding environment. These determinations can take time, and the multiple iterations can further contribute to times and delays. Between assessments, a database which is being regularly updated may have reached an unacceptable level of risk.”

In addition to the background information obtained for this patent, NewsRx journalists also obtained the inventors’ summary information for this patent: “It is an object of the disclosure to address one or more of the above-mentioned limitations.

“According to a first aspect of the disclosure there is provided: a method for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the method comprising: providing a user interface (UI) configured to receive as input the datasets and updates to said dataset; providing as input to the UI the dataset; estimating a risk of re-identification for the dataset or a subset of the database; providing as input to the UI the updates to said dataset; regularly monitoring whether the risk of re-identification for at least one of the updated dataset, the subset of the database and the updates is below a predetermined dataset risk threshold; and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, notifying the user.

“Optionally, the database comprises a plurality of database records and a plurality of database fields, wherein each database record has a plurality of associated field values, each associated field value being related to a database field.

“Optionally, the dataset comprises a plurality of dataset records and a plurality of dataset fields, wherein each dataset record has a plurality of associated field values, each associated field value being related to a dataset field.

“Optionally, the plurality of dataset records is a subset of the database records and the plurality of dataset fields is a subset of the database fields.”

The claims supplied by the inventors are:

“1. A computer implemented method for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the method comprising: providing a user interface (UI) configured to receive as input the dataset and updates to said dataset; providing as input to the UI the dataset; providing a computer device configured by a first computer program to estimate a risk of re-identification for the dataset or a subset of the database by automatically estimating an individual risk of re-identification for each record and determining how many records have an individual risk of re-identification above a pre-specified individual risk threshold; providing as input to the UI the updates to said dataset; wherein said computer device is further configured by a second computer program to regularly monitor whether the risk of re-identification for at least one of the updated dataset or the subset of the database or the updates is below a predetermined dataset risk threshold, and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, to automatically estimate an individual risk of re-identification that comprises: selecting a subset of fields, for each field in the subset computing a population field statistical distribution; computing a combined statistical distribution of the subset of fields from the population field statistical distributions; and computing the likely number of members of the source population that have the same field value as the record for each field in the subset of fields, from the combined statistical distribution, and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, identifying the dataset as non-compliant; iteratively modifying the non-compliant dataset until the risk of the re-identification for the modified dataset is below the predetermined dataset risk threshold in order to generate a compliant dataset and providing the compliant dataset in the user interface.

“2. The method of claim 1, wherein the database comprises a plurality of database records and a plurality of database fields, wherein each database record has a plurality of associated field values, each associated field value being related to a database field; the dataset comprises a plurality of dataset records and a plurality of dataset fields, wherein each dataset record has a plurality of associated field values, each associated field value being related to a dataset field; and the plurality of dataset records is a subset of the database records and the plurality of dataset fields is a subset of the database fields.

“3. The method of claim 2, wherein each database record corresponds to an individual of a source population.

“4. The method of claim 1, wherein the fields in the subset of fields are selected such that all fields in the subset of fields are quasi-identifiers.

“5. The method of claim 1, wherein computing the population field statistical distribution comprises: selecting the source database or a second database external to the source database which relates to the source population; and deriving the population field statistical distribution from the selected database.

“6. The method of claim 1, wherein the method further comprises computing an internal statistical distribution of the dataset; and regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below the predetermined dataset risk threshold comprises: regularly monitoring the internal statistical distribution of the dataset; and if the internal statistical distribution varies beyond a predetermined accepted variation, re-computing the risk of re-identification for the dataset.

“7. The method of claim 1, wherein providing updates to the initial dataset comprises providing a set of de-identified records to be added to the dataset.

“8. The method of claim 7, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises computing the risk of re-identification for the set of database records; and if the risk of re-identification for the set of database records is greater than the risk of re-identification for the dataset, re-computing the risk of re-identification for the updated dataset.

“9. The method of claim 7, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined dataset risk threshold comprises: each time a set of database records is added to the dataset, computing an internal statistical distribution of the set of database records; if the internal statistical distribution of the set of database records differs from the internal statistical distribution of the dataset beyond the predetermined accepted variation, re-computing the risk of re-identification for the updated dataset.

“10. The method of claim 1, wherein estimating the risk of re-identification comprises: for each source database, providing a list of risk-determination rules; and automatically computing the risk of-reidentification of the database based on the list of risk-determination rules.

“11. The method of claim 1, wherein the method further comprises providing as input to the user interface a set of modification rules based on the source database; and the non-compliant dataset is modified according to the modification rules.

“12. The method of claim 11, wherein generating a compliant dataset comprises identifying fields in the dataset which are contributing to the risk of re-identification and removing or modifying one or more of said fields.

“13. The method of claim 1, wherein regularly monitoring whether the risk of re-identification for at least one of the updated dataset or the updates is below a predetermined threshold comprises estimating the risk of re-identification for the updated dataset at scheduled intervals.

“14. The method of claim 1, wherein the method further comprises providing in the GUI an automatically generated outcome report of the monitoring of the risk of re-identification.

“15. A system for monitoring a risk of re-identification for a dataset de-identified from a source database containing information identifiable to individuals, the system comprising: a user interface (GUI) configured to receive as input the dataset and updates to said dataset; a memory configured to store the dataset; and a risk monitoring computer device configured to regularly monitor whether the risk of re-identification for at least one of the updated dataset or a subset of the database or the updates is below a predetermined dataset risk threshold, by automatically estimating an individual risk of re-identification for each record and determining how many records have an individual risk of re-identification above a pre-specified individual risk threshold; wherein the system is configured so that if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, automatically notify the user; wherein, if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, the user is automatically notified; wherein estimating the individual risk or re-identification for each record comprises: selecting a subset of fields, for each field in the subset computing a population field statistical distribution; computing a combined statistical distribution of the subset of fields from the population field statistical distributions; and computing the likely number of members of the source population that have the same field value as the record for each field in the subset of fields, from the combined statistical distribution; and if the risk of re-identification has reached or exceeded the predetermined dataset risk threshold, identifying the dataset as non-compliant; iteratively modifying the non-compliant dataset until the risk of the re-identification for the modified dataset is below the predetermined dataset risk threshold in order to generate a compliant dataset and providing the compliant dataset in the user interface.

“16. The system of claim 15, wherein the user interface comprise a graphical user interface (GUI); the updates to the database comprise one or more of: removing one or more records from the dataset records; adding one or more records to the dataset records; and removing, adding or modifying one or more dataset fields; and the graphical user interface comprises graphical elements to allow the user to modify one or more dataset fields; and the graphical user interface is configured to show the evolution of the risk of re-identification for the dataset in real-time.”

URL and more information on this patent, see: Bayless, Paul. Methods and systems for monitoring a risk of re-identification in a de-identified database. U.S. Patent Number 11741262, filed July 19, 2021, and published online on August 29, 2023. Patent URL (for desktop use only): https://ppubs.uspto.gov/pubwebapp/external.html?q=(11741262)&db=USPAT&type=ids

(Our reports deliver fact-based news of research and discoveries from around the world.)