Patent Issued for Coordinated de-identification of a dataset across a network (USPTO 11093645)

Insurance Daily News

2021 SEP 08 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- A patent by the inventors Gkoulalas-Divanis, Aris (Waltham, MA, US), filed on June 20, 2019, was published online on August 17, 2021, according to news reporting originating from Alexandria, Virginia, by NewsRx correspondents.

Patent number 11093645 is assigned to International Business Machines Corporation (Armonk, New York, United States).

The following quote was obtained by the news editors from the background information supplied by the inventors:

“1. Technical Field

“Present invention embodiments relate to methods, systems and computer program products for receiving at a network device a dataset with masked direct identifiers from a client’s site and performing further data de-identification of the dataset to protect indirect (or quasi) identifiers and sensitive attributes. In particular, a server receives from a customer site a person-specific dataset with masked direct identifiers, discovers indirect/quasi identifiers and sensitive attributes within the dataset, and performs further compatible data de-identification techniques to protect the indirect identifiers and the sensitive attributes of the dataset.

“2. Discussion of the Related Art

“Data anonymization is a data sanitization process for protecting personally identifiable information in datasets, including both direct identifiers that can directly identify individuals such as, for example, full names of individuals, social security numbers, customer numbers, patient identifiers, phone numbers, credit card numbers, etc., as well as indirect identifiers, which are non-direct identifier attribute values in a dataset, a combination of which may be unique for some individuals and could be used to re-identify these individuals. For example, a five-digit zip code of a home address, a gender, and a date of birth of individuals are well-known quasi-identifiers because a combination of their values has been shown to be unique for a large number of United States residents.

“A third type of identifier in a dataset is sensitive attributes, which are non-direct, non-quasi-identifier attributes having values that are sensitive and should therefore not be linked to specific individuals. As an example, individuals may not want to be linked with disease, salary, or sensitive location information in a dataset (e.g., church, hospital, etc.). Preventing linkage of individuals to their sensitive attribute values blocks sensitive information disclosure attacks and goes beyond protection against subject re-identification. However, preventing sensitive information disclosure is usually part of data de-identification efforts.

“Personal data that have been “sufficiently anonymized” such as, for example, anonymized data that satisfies the Health Insurance Portability and Accountability Act (HIPAA) requirements in the United States or the General Data Protection Regulation (GDPR) in Europe, can be used for secondary purposes, such as for supporting various types of data analyses.

“Data owners are hesitant to allow highly sensitive personal data such as, for example, customers’ transactions, purchase records, healthcare information, etc., to leave their premises (even in encrypted form using state-of-the-art encryption algorithms) for uploading to a cloud platform for de-identification and additional processing to support business use cases, analytics and other uses. Before allowing highly sensitive personal data to leave their premises, data owners are increasingly using existing in-house solutions for performing data de-identification, which are limited to the support of data masking algorithms and in most cases are unable to adequately protect data to meet legal requirements.”

In addition to the background information obtained for this patent, NewsRx journalists also obtained the inventors’ summary information for this patent: “According to a first aspect of embodiments of the invention, a method of de-identifying a dataset is provided. A network device receives information from a client device, wherein the information includes a list of at least one group of techniques selected from groups consisting of a group of data masking techniques and a group of pseudonymization techniques, associated configuration options that are supported by the client device and a description of a dataset to be de-identified. The network device determines a first technique from the at least one group of techniques and associated configuration options supported by the client device and the network device. The network device receives a dataset from the client device, wherein the dataset is produced at the client device by applying the determined first technique and the associated configuration options to corresponding attributes. A de-identification technique is applied to the dataset at the network device to produce a resulting set of de-identified data, wherein the data de-identification technique is coordinated with the first technique and configuration options to further de-identify the dataset.

“According to a second aspect of embodiments of the invention, a system for de-identifying data of a dataset is provided. The system includes at least one processor and at least one memory having instructions embodied therein such that the at least one processor is configured to perform: receiving information from a client device, wherein the information includes a list of at least one group of techniques selected from groups consisting of a group of data masking techniques and a group of data pseudonymization techniques, and associated configuration options that are supported by the client device and a description of a dataset to be de-identified; determining a first technique from the at least one group of techniques and configuration options that are supported by the client device and the system; receiving a dataset from the client device, wherein the dataset is produced at the client device by applying the determined first technique and the associated configuration options to corresponding data attributes; and applying a de-identification technique to the dataset to produce a resulting set of de-identified data, wherein the de-identification technique is coordinated with the first technique and the associated configuration options to de-identify the masked dataset.

“According to a third aspect of embodiments of the invention, a computer program product including at least one computer readable storage medium having computer readable program code embodied therewith for execution on at least one processor is provided. The computer readable program code is configured to be executed by the at least one processor to perform: receiving information from a client device, wherein the information includes a list of at least one group of techniques selected from groups consisting of a group of data masking techniques and a group of data pseudonymization techniques, and associated configuration options that are supported by the client device and a description of a dataset to be de-identified; determining a first technique from the at least one group of techniques, associated configuration options supported by the client device and a system including the at least one processor; receiving a dataset from the client device, wherein the dataset is produced at the client device by applying the determined first technique and the associated configuration options to corresponding data attributes; and applying a de-identification technique to the dataset to produce a resulting set of de-identified data, wherein the de-identification technique is coordinated with the first technique and the configuration options to de-identify the dataset.”

The claims supplied by the inventors are:

“1. A method of de-identifying a dataset comprising: receiving information from a client device at a network device, wherein the information includes a list of at least one group of techniques supported by the client device and selected from groups consisting of a group of data masking techniques and a group of data pseudonymization techniques, configuration options associated with the at least one group of techniques supported by the client device, and a first data dictionary of a dataset to be de-identified, the first data dictionary including attribute names, attribute types and associated metadata including attribute descriptions of attributes of the dataset; mapping at the network device attributes of the first data dictionary to attributes of a second data dictionary included in the network device by matching attributes of the first data dictionary with attributes of the second data dictionary based on corresponding attribute names and attribute descriptions, the second data dictionary being different from the first data dictionary and including attribute names, attribute types and associated metadata including attribute descriptions of attributes that appear in each ingested data source, the second data dictionary further including a characterization of all direct identifiers from the each ingested data source leading to recognition of direct identifiers of the dataset; determining at the network device first techniques and associated configuration options mutually supported by the client device and the network device based on the at least one group of techniques, wherein the determined first techniques are compatible with de-identification techniques of the network device and selected from a group of data masking techniques and data pseudonymization techniques; sending the determined first techniques to the client device; receiving at the network device the dataset from the client device, wherein the dataset is produced at the client device by applying one or more of the determined first techniques and the associated configuration options to corresponding attributes; and applying a de-identification technique to the dataset at the network device to produce a resulting set of de-identified data, wherein the de-identification technique is compatible with the applied one or more first techniques and the associated configuration options to de-identify the dataset.

“2. The method of claim 1, wherein the network device resides within a cloud computing environment.

“3. The method of claim 1, wherein the attributes of the first data dictionary include one or more direct identifiers.

“4. The method of claim 3, wherein the applying the de-identification technique further comprises: identifying one or more sets of quasi-identifiers within the dataset; and applying the de-identification technique to the identified one or more sets of quasi-identifiers to produce the resulting set of de-identified data.

“5. The method of claim 4, wherein the identifying the one or more sets of quasi-identifiers comprises: analyzing values of attributes of each record to find unique combinations of the values; and identifying attributes of the unique combinations of the values as the one or more sets of quasi-identifiers.

“6. The method of claim 1, further comprising: applying further protection to the resulting set of de-identified data at the network device to improve a privacy level by extending the one or more first techniques applied at the client device using compatible techniques supported at the network device; identifying at least one sensitive attribute within the dataset; and applying the de-identification technique to the at least one identified sensitive attribute to produce the resulting set of de-identified data.”

URL and more information on this patent, see: Gkoulalas-Divanis, Aris. Coordinated de-identification of a dataset across a network. U.S. Patent Number 11093645, filed June 20, 2019, and published online on August 17, 2021. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=11093645.PN.&OS=PN/11093645RS=PN/11093645

(Our reports deliver fact-based news of research and discoveries from around the world.)

Older

Springfield firefighters demand COVID merit pay; Mayor Domenic Sarno calls union leadership a ‘political pawn’

Newer

Homeland Security Department; Federal Emergency Management Agency (F.R. Page 48238) – Meeting

Advisor News

More Advisor News

Annuity News

More Annuity News

Health/Employee Benefits News

More Health/Employee Benefits News

Life Insurance News

Sponsor

More Life Insurance News

Patent Issued for Coordinated de-identification of a dataset across a network (USPTO 11093645)

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account