Patent Issued for Adaptive statistical data de-identification based on evolving data streams (USPTO 11151113): International Business Machines Corporation

2021 NOV 05 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- A patent by the inventors Gkoulalas-Divanis, Aris (Waltham, MA, US), filed on December 20, 2017, was published online on October 19, 2021, according to news reporting originating from Alexandria, Virginia, by NewsRx correspondents.

Patent number 11151113 is assigned to International Business Machines Corporation (Armonk, New York, United States).

The following quote was obtained by the news editors from the background information supplied by the inventors:

“1. Technical Field

“Present invention embodiments relate to data access, and more specifically, to dynamically adapting data de-identification in data streams.

“2. Discussion of the Related Art

“Data de-identification is a process of transforming values in datasets to protect personally identifiable information, where there is no reasonable basis to believe that the information remaining in the dataset can be used to re-identify individuals.

“Under the US Health Insurance Portability and Accountability Act of 1996 (HIPAA), acceptable ways for de-identifying a dataset pertaining to personal health information include using a Safe Harbor list, and using expert determination. Each of these mainly rely on data de-identification rulesets to offer data protection. Other legal privacy frameworks adopt a similar approach by considering the application of a data de-identification ruleset to original data values of a dataset in order to protect personal data.

“The data de-identification rulesets are usually: constructed by following population density/population uniqueness criteria; based on distribution of the data (e.g., data involving citizens of a certain region); and based on a possibility of successful triangulation attacks against publicly available datasets (e.g., census data, yellow pages, deaths reported in obituaries, open data, etc.).

“Each of these criteria that are used to derive the data de-identification rules for protecting a dataset are based on information that may change at any point in time, thereby leaving a previously sufficiently de-identified dataset (given a particular data de-identification ruleset) susceptible to new re-identification and sensitive information disclosure attacks.

“A reason that data de-identification rules may become insufficient to protect individuals’ privacy is that the data de-identification rules are static. In other words, the data de-identification rules are derived by privacy experts based on their knowledge of the then (i.e., at the point they examined the dataset) publicly available information, the then data contained in the dataset, and the then validity of various assumptions pertaining to the perceived power of attackers (i.e., considered background knowledge that attackers may have in order to breach data privacy). All of these assumptions that were reasonable at the time the domain experts evaluated the privacy level in the dataset after the application of their prescribed de-identification rules, can be invalidated at any point later in time, rendering the data vulnerable to new privacy attacks.

“Although expert determinations are accompanied by an expiration date (usually 2-3 years), the expiration date is calculated in terms of years and there is no guarantee that the prescribed data de-identification rules will not become outdated based on data and knowledge changing in the interim. Since many open data initiatives exist worldwide, this may lead to a plethora of datasets becoming available online and existing datasets being updated with newer information. Thus, it becomes increasingly easier for attackers to breach the privacy offered by static data de-identification rules in person-specific datasets.”

In addition to the background information obtained for this patent, NewsRx journalists also obtained the inventors’ summary information for this patent: “According to one embodiment of the present invention, a system dynamically changes a data de-identification ruleset applied to a dataset for de-identifying data and comprises at least one processor. The system periodically monitors a dataset derived from data that is de-identified according to a data de-identification ruleset under a set of conditions. The set of conditions for the data de-identification ruleset is evaluated with respect to the monitored data to determine applicability of the data de-identification. One or more rules of the data de-identification ruleset are dynamically changed in response to the evaluation indicating one or more conditions of the set of conditions for the initial data de-identification ruleset are no longer satisfied. An embodiment of the present invention may further dynamically change one or more rules of the data de-identification ruleset based on machine learning. Embodiments of the present invention may further include a method and computer program product for dynamically changing a ruleset applied to a dataset for de-identifying data in substantially the same manner described above.

“This provides a mechanism that continuously (e.g., in real time) or periodically evaluates the validity of assumptions made by a statistical expert, and adapts the data de-identification rules as necessary in order to maintain a high level of privacy protection. The data de-identification may be adapted based on machine learning to provide a cognitive and intelligent adaptation of the data de-identification. In other words, the mechanism maintains a re-identification risk below an acceptable threshold, usually based on the applicable privacy requirements and legal frameworks (e.g., HIPAA Safe Harbor, etc.).

“An embodiment of the present invention may further dynamically change one or more rules of the data de-identification ruleset by replacing the data de-identification ruleset with a new data de-identification ruleset selected from among a group of data de-identification rulesets based on the evaluation, wherein conditions of the new data de-identification ruleset are satisfied. The set of applicable data de-identification rulesets and corresponding conditions for their validity may be prescribed by the domain/statistical expert as part of the original expert determination performed on the dataset. In this case, the domain/statistical expert provides a determination that corresponds to a set of data de-identification rulesets and the respective conditions for their validity, thereby capturing a variety of scenaria related to changes that may happen in the future (e.g., changes to the data distribution, changes to external datasets, etc.), causing changes to the determination that he or she would provide, and the corresponding mitigation strategies (i.e., data de-identification rulesets) for offering sufficient privacy protection of the data. This ensures that the data de-identification can be dynamically adapted based on the monitored data changes (e.g., in real time) in a manner that preserves privacy as the dataset changes over time.

“An embodiment of the present invention may further replace the data de-identification ruleset with the new data de-identification ruleset in response to a change in a threshold of a data de-identification rule in the data de-identification ruleset. This enables an appropriate data de-identification ruleset to be dynamically selected based on monitored data changes (e.g., in real time) to maintain privacy in view of dataset changes.

“An embodiment of the present invention may prevent release of the de-identified data in response to the evaluation indicating one or more conditions for each data de-identification ruleset of the group are not satisfied. This prevents release of potentially vulnerable data when data de-identification processes are insufficient to protect the changed data. The processing of de-identification may further be terminated (e.g., until a proper data de-identification ruleset can be identified, etc.) in order to preserve computing resources and efficiency.”

The claims supplied by the inventors are:

“1. A method of dynamically changing a data de-identification ruleset applied to a dataset for de-identifying data to transform values of the dataset to protect sensitive information comprising: periodically monitoring a dataset including a plurality of attributes via a processor, wherein the dataset changes over time and is derived from data that is de-identified according to a data de-identification ruleset under a set of conditions, wherein the data de-identification ruleset indicates one or more manners of de-identifying data to suppress a first amount of information within each of one or more corresponding attributes of the dataset to protect personally identifiable information, and wherein the set of conditions includes a quantity of successful linkages satisfying a threshold and the successful linkages are between de-identified records of the monitored dataset and records from one or more external datasets with information re-identifying identities of individuals of the de-identified records; evaluating, via the processor, the set of conditions for the data de-identification ruleset with respect to the monitored dataset and determining that one or more conditions of the set of conditions for the data de-identification ruleset are no longer satisfied by the monitored dataset; and dynamically changing, via the processor, one or more rules of the data de-identification ruleset to apply at least one different manner of de-identifying data to at least one corresponding attribute of the dataset to suppress a second amount of information within each of the at least one corresponding attribute greater than the first amount of information and provide de-identified data satisfying each condition of the set of conditions, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: replacing the data de-identification ruleset with a new data de-identification ruleset from among a plurality of data de-identification rulesets in response to a change in a threshold of a data de-identification rule in the data de-identification ruleset, wherein the new data de-identification ruleset includes a modified version of the data de-identification ruleset; and preventing release of the monitored dataset when one or more conditions for each of the plurality of data de-identification rulesets are not satisfied.

“2. The method of claim 1, wherein evaluating the set of conditions comprises: performing triangulation attacks of the monitored dataset against the one or more external datasets to indicate a probability of successful re-identification; and one or more from a group of: retrieving statistics from publicly available and other external datasets to derive population density and population uniqueness criteria; evaluating correlations between attributes of the monitored dataset based on the monitored dataset or knowledge from external datasets that indicate indirect re-identification; and determining changes to data distribution that affect the data de-identification ruleset.

“3. The method of claim 1, further comprising: generating a first notification of the change to the one or more rules of the data de-identification ruleset to notify a data owner, and a second notification of transit to another data de-identification ruleset whose conditions are currently satisfied.

“4. The method of claim 1, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: changing one or more conditions of the data de-identification ruleset to enable each of the conditions to be satisfied.

“5. The method of claim 1, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: dynamically changing one or more rules of the data de-identification ruleset based on machine learning.

“6. A system for dynamically changing a data de-identification ruleset applied to a dataset for de-identifying data to transform values of the dataset to protect sensitive information comprising: at least one processor configured to: periodically monitor a dataset including a plurality of attributes, wherein the dataset changes over time and is derived from data that is de-identified according to a data de-identification ruleset under a set of conditions, wherein the data de-identification ruleset indicates one or more manners of de-identifying data to suppress a first amount of information within each of one or more corresponding attributes of the dataset to protect personally identifiable information, and wherein the set of conditions includes a quantity of successful linkages satisfying a threshold and the successful linkages are between de-identified records of the monitored dataset and records from one or more external datasets with information re-identifying identities of individuals of the de-identified records; evaluate the set of conditions for the data de-identification ruleset with respect to the monitored dataset and determine that one or more conditions of the set of conditions for the data de-identification ruleset are no longer satisfied by the monitored dataset; and dynamically change one or more rules of the data de-identification ruleset to apply at least one different manner of de-identifying data to at least one corresponding attribute of the dataset to suppress a second amount of information within the at least one corresponding attribute greater than the first amount of information and provide de-identified data satisfying each condition of the set of conditions, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: replacing the data de-identification ruleset with a new data de-identification ruleset from among a plurality of data de-identification rulesets in response to a change in a threshold of a data de-identification rule in the data de-identification ruleset, wherein the new data de-identification ruleset includes a modified version of the data de-identification ruleset; and preventing release of the monitored dataset when one or more conditions for each of the plurality of data de-identification rulesets are not satisfied.

“7. The system of claim 6, wherein evaluating the set of conditions comprises: performing triangulation attacks of the monitored dataset against the one or more external datasets to indicate a probability of re-identification; and one or more from a group of: retrieving statistics from publicly available datasets to derive population density and population uniqueness criteria; evaluating correlations between attributes of the monitored dataset based on the monitored dataset or knowledge from external datasets that indicate indirect re-identification; and determining changes to data distribution that affect the data de-identification ruleset.

“8. The system of claim 6, wherein the at least one processor is further configured to: generate a first notification of the change to the one or more rules of the data de-identification ruleset to notify a data owner, and a second notification of transit to another data de-identification ruleset whose conditions are currently satisfied.

“9. The system of claim 6, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: changing one or more conditions of the data de-identification ruleset to enable each of the conditions to be satisfied.

“10. The system of claim 6, wherein dynamically changing one or more rules of the data de-identification ruleset further comprises: dynamically changing one or more rules of the data de-identification ruleset based on machine learning.”

There are additional claims. Please visit full patent to read further.

URL and more information on this patent, see: Gkoulalas-Divanis, Aris. Adaptive statistical data de-identification based on evolving data streams. U.S. Patent Number 11151113, filed December 20, 2017, and published online on October 19, 2021. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=11151113.PN.&OS=PN/11151113RS=PN/11151113

(Our reports deliver fact-based news of research and discoveries from around the world.)