Patent Application Titled “System And Method For Intermediary Mapping And De-Identification Of Non-Standard Datasets” Published Online (USPTO 20240037123): Patent Application

2024 FEB 21 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- According to news reporting originating from Washington, D.C., by NewsRx journalists, a patent application by the inventors Bradley, George Wesley (Ottawa, CA); Di Valentino, David Nicholas Maurice (Ottawa, CA); Mian, Muhammad Oneeb Rehman (Ottawa, CA), filed on October 10, 2023, was made available online on February 1, 2024.

No assignee for this patent application has been made.

Reporters obtained the following quote from the background information supplied by the inventors: “Many jurisdictions now have data privacy laws and regulations to protect against disclosure of personal information, and organizations also wish to protect against the disclosure of confidential information. De-identification is a process by which personal information relating to a data subject and/or an individual is protected by various means (e.g., transformation, suppression, masking, synthesis, etc.). The de-identification can be rules-based, for example, the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method, whereby select eighteen identifying variables are hidden or transformed. Moreover, Expert Determination and Safe Harbor are HIPAA methods. Moreover, an amount of de-identification required to meet a standard of statistical disclosure control is also influenced overall by the context in which data is being shared or released; public releases have a higher bar than data releases to a secure platform (e.g., a portal in which data access and retention are controlled and regulated) or sharing access in a controlled data environment. Disclosure control includes protecting identity, attribute, and inferential disclosure.

“A statistical disclosure risk measurement requires a series of steps including appropriately modeling a dataset, introspecting a data to find various types of identifiable/sensitive information or variables, finding one or more connections between different variables, and assigning appropriate risk algorithm settings for each variable based on a previous research and expertise. After an initial disclosure risk of a dataset is determined, a de-identification is performed to bring the disclosure risk below the acceptable threshold. Any deviation in these steps may over- or under-estimate the disclosure risk leading to an over-de-identification (and thereby reduced data utility) or leaking of a personally identifiable information, respectively.

“Datasets containing personal or confidential information come in all shapes and sizes. A dataset is any collection of data, including but not limited to structured data, unstructured data (including documents), or any combination or derivation thereof. In certain fields, data sharing needs or regulatory submission requirements have driven an establishment of data standards. In clinical trials sphere, for example, the two most commonly used standards are a Study Data Tabulation Model (SDTM), and an Analysis Data Model (ADaM). Many companies still possess or operate with non-standard datasets, as the non-standard datasets are historical datasets or due to existence of internal standards/policies that results in datasets deviating from or being extension of established standards. Studies have been conducted to assess a level of compliance to the established standards SDTM and ADaM. However, it has been observed that majority of datasets significantly deviated from the standards in their native format (as shown in FIG. 1A). Even where the datasets were considered highly compliant (>85%), there is a room for process improvement to ensure an accuracy of risk estimation, consistency of data transformations or synthesis, and a reduced effort, expertise, and training requirements. Moreover, FIG. 1(A) also illustrates datasets with medium compliance (60-85%), and datasets with low compliance (<60%).

“De-identification of datasets, and specifically non-standard datasets, to share or release data for transparency, innovation, service improvement, and other secondary uses has high level of effort and expertise requirements to process (as shown in FIG. 1B). Currently, analysts must manually introspect data to correctly model the data, perform advanced Extract-Transform-Load (ETL) processes as necessary, find identifiable/sensitive information, and possess detailed know-how (expertise) regarding connections between the identifiable variables and appropriate settings for each variable, to accurately measure disclosure risk and de-identify the data. However, the latter steps are especially resource-intensive and can take up to 5-10 days for experienced analysts to complete and quality control, given the variability of incoming non-standard datasets and a sheer volume of variables (as in the case of clinical trial datasets) (as shown in the FIG. 1B). Moreover, some datasets can contain up to 100 tables and 10000+ variables with many interconnections and indirect relationships.

“Conventionally, there are some processes or tools of data harmonization used, whereby not a mapping but a full conversion or transformation to a standard format is performed. An example of the data harmonization in practice is to convert or transform various clinical data sources into SDTM datasets, such as part of a data life cycle while collecting data from data spokes into a data hub.

“Moreover, de-identification solutions currently allow generic characterization of datasets and elements of the datasets. An example is that available de-identification software solutions currently allow a user to associate variables in the data to very generic variable types, such as public quasi-identifier or direct identifier. The generic variable types can be combined with a feature that can load or apply settings for a dataset from either another project setup or from variable settings stored in, for example, an Excel format. This can be akin to a data catalog process whereby an exhaustive list of variables and variable settings are stored for future retrieval; if an incoming data variable matches particulars of a variable already existing in the catalog, it is handled appropriately.

“However, previous solutions, systems and methods that have been developed to handle non-standard datasets have multiple drawbacks such that it may require specialized ETL processes to estimate disclosure risk and derive a de-identification strategy, requires detailed assessments of a potential correlation between variables, and heavy manual effort to align dependencies between correlated or indirectly connected variables and to perform the overall de-identification process on datasets. The estimation of disclosure risk to derive a de-identification strategy may cause over-estimation of risk and over-de-identification, or under-estimation of risk and potentially leaking sensitive information. Moreover, this workflow, as shown in the FIG. 1B, also requires specialized ETL processes to ingest data for disclosure risk estimation, and post-processing to ensure the derived de-identification strategy (including de-identification transformations or replacement through data synthesis) is fully applied to the entire non-standard dataset. Such processes are known to be difficult to productize in their entirety, and do not negate the need for expertise in certain areas such as variable connections and configuring risk/de-identification settings. There are many decision points in key areas of the process, resulting in higher requirements for quality control checks and multiple analysts working on the same dataset. Unless data harmonization to standard formats is part of a normal data life cycle for clients, it is unrealistic to expect them to perform transformation or conversion of their non-standard datasets to standard format just for the purposes of applying data privacy, and then converting back.

“Moreover, the current solutions do not provide enough granularity in variable types and other forms of data characterizations to accurately capture the disclosure risk and de-identification complexities of all types of data, such as clinical trials data. This necessitates the use of data catalogs, to track every instance of a variable and data characteristic seen previously and the associated settings for it. However, there is always the possibility that a given dataset may contain new variables or other data characteristics that are not captured by generic variable types or the data catalog, and requires an expansion of the data catalog. This limits scalability across multiple dimensions, including effort, time, and utility.

“Thus, there is a need for a system, a device, and a process to automate the conversion, or to map the data to the standard.”

In addition to obtaining background information on this patent application, NewsRx editors also obtained the inventors’ summary information for this patent application: “Embodiments of the present invention provide an intermediary mapping and de-identification system for de-identification of one or more non-standard datasets to share or release data for transparency, innovation, service improvement, and other secondary uses. The intermediary mapping and de-identification system is configured to perform an intermediary mapping of the non-standard datasets to a known set of schema and variables (or standard) for which complex requirements can be pre-defined in an automated fashion.

“Embodiments in accordance with the present invention may provide a number of advantages depending on its particular configuration. First, embodiments of the present invention may provide a system and a method to perform an intermediary mapping to a standard schema model and variables, which allows a simple and an automated interpretation of a variable connection and disclosure risk metric settings handling, and de-identification. Further, embodiments of the present invention may provide a system and a method to streamline a quality control and an auditing of an entire de-identification workflow by reducing inter-analyst variability in an expertise application.

“Further, embodiments of the present invention may provide a system and a method to use a wrapper of intermediary mapping to apply data privacy to non-standard datasets (i.e., the non-standard dataset that is processed, maintains its format upon completion). Further, embodiments of the present invention may provide a system and a method which reduce an effort as currently, the de-identification process consists of many steps including data modeling, variable classification, variable risk settings, variable connection, and variable de-identification settings. Embodiments of the present invention may provide a system and a method to restrict an effort to the modeling and classification steps, whereby users map the schema and variables to a given standard. The remaining steps can be inferred from the mapping as per this process.

“Further, embodiments of the present invention may provide a system and a method for reducing a requirement of expertise and training as determining variable connections and settings is a highly technical aspect of a risk-based de-identification process, almost always requiring an experienced disclosure risk analyst doing these steps. Embodiments of the present invention may provide a system and a method that eliminates this expertise barrier by having the details encoded/preset for a given standard, restricting the expertise and training to be centered around how non-standard datasets map to a given standard only.

“Further, embodiments of the present invention may provide a flexible system as previous solutions have been centered around a specific domain for mapping or downstream uses, for example, clinical trials. Embodiments of the present invention may provide a system that allows adaptability of the system for any type of data, such as transactional data.

“Further, embodiments of the present invention may provide a scalable system as generic variable types and data catalog processes are not scalable when it comes to disclosure risk and control, as tweaks are almost always required based on incoming non-standard datasets. By mapping to a standard, downstream actions of the disclosure risk and control can be inferred. Thus, an overall solution becomes more scalable, since a large part of the de-identification process becomes static.

“Embodiments of the present invention may provide one or more new variable types for mapping and new determinations on advanced disclosure control settings required for each variable type. One advanced example of a shift in methods would be that instances of Medical History Start Date would presently be categorized generally as Date fields, which do not share prior estimates (i.e., frequency distributions) for a disclosure risk measurement. In an embodiment of the present invention, the above stated is mapped to a more granular medical_history_start_date variable type that does share prior estimates (e.g., the frequency distributions), thus providing more granularity and accuracy for disclosure risk assessment, and subsequent improved de-identification.

“Presently, correlations between fields are used to inform how to apply a de-identification strategy for the de-identification of the full dataset. The application of a de-identification may be performed manually or using custom scripting. In an embodiment of the present invention, correlations are akin to groupings of variables, which serve a dual purpose; in a more accurate disclosure-risk calculation, groupings may manifest as measurement groups, and in a more refined, automated de-identification process, groupings may serve the role of propagation de-id groups. Further, certain variable groupings are redesigned that existed before, as well new groupings are created. Furthermore, the disclosure control is performed over the entire dataset in a single pass, versus present approaches that may require specialized ETL processes to determine a de-identification strategy before applying this for the de-identification of the full dataset.

“These and other advantages will be apparent from the present application of the embodiments described herein.

“The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor an exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.”

The claims supplied by the inventors are:

“1.-20. (canceled)

“21. An automated classification and interpretation device, comprising: one or more processors; and one or more non-transitory computer-readable storage media storing instructions which, when executed by the one or more processors, cause the one or more processors to: retrieve one or more datasets from a data source; select a target standard including schema and variables; map the retrieved one or more datasets to the schema and variables of the target standard; infer one or more characteristics of the mapped one or more datasets; determine a disclosure risk based on one of, the mapped one or more datasets, the one or more inferred characteristics, or a combination thereof; de-identify the retrieved one or more datasets using one of, the mapped one or more datasets, the inferred one or more characteristics, or a combination thereof; and convert the retrieved one or more datasets using one of, the mapped one or more datasets, the inferred one or more characteristics, or a combination thereof.

“22. The device of claim 21, wherein the target standard is a standard already available in a field, a custom standard, an ontology defined by an analyst for reuse for sets of a similar dataset, a subset of a standard, an extension of a standard, a combination of multiple standards, or combinations thereof.

“23. The device of claim 21, wherein the one or more datasets are mapped to the schema and variables of the target standard using one or more of a schema mapping and a variable mapping.

“24. The device of claim 21, wherein the one or more inferred characteristics of the mapped one or more datasets include one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or combinations thereof.

“25. The device of claim 21, wherein the instructions further cause the one or more processors to retrieve metadata from the data source.

“26. The device of claim 25, wherein the one or more datasets and the one or more metadata are mapped to the schema and variables of the target standard.

“27. The device of claim 26, wherein one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or combinations thereof are inferred from the mapped one or more datasets and the mapped metadata.

“28. The device of claim 26, wherein the disclosure risk is determined based on one of, the mapped one or more datasets, the mapped metadata, the one or more inferred characteristics of the mapped one or more datasets, one or more inferred characteristics of the mapped metadata, or a combination thereof.

“29. A computer system, comprising: a memory configured to store instructions; one or more processors configured to execute the instructions, causing the one or more processors to: retrieve one or more datasets from a data source; select a target standard including schema and variables; map the retrieved one or more datasets to the schema and variables of the target standard; infer one or more characteristics of the mapped one or more datasets; determine a disclosure risk based on one of, the mapped one or more datasets, the inferred one or more characteristics, or a combination thereof; de-identify the retrieved one or more datasets using one of, the mapped one or more datasets, the inferred one or more characteristics, or a combination thereof; and convert the retrieved one or more datasets using one of the mapped one or more datasets, the inferred one or more characteristics, or a combination thereof.

“30. The system of claim 29, wherein the target standard is a standard already available in a field, a custom standard, an ontology defined by an analyst for reuse for sets of a similar dataset, a subset of a standard, an extension of a standard, a combination of multiple standards, or combinations thereof.

“31. The system of claim 29, wherein the one or more datasets are mapped to the schema and variables of the target standard using one or more of a schema mapping and a variable mapping.

“32. The system of claim 29, wherein the one or more inferred characteristics of the mapped one or more datasets include one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or combinations thereof.

“33. The system of claim 29, wherein the instructions further cause the one or more processors to retrieve metadata from the data source.

“34. The system of claim 33, wherein the one or more datasets and the metadata are mapped to the schema and variables of the target standard.

“35. The system of claim 34, wherein one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or combinations thereof are inferred from the mapped one or more datasets and the mapped metadata.

“36. The system of claim 34, wherein the disclosure risk is determined based on one of, the mapped one or more datasets, the mapped metadata, the one or more inferred characteristics of the mapped one or more datasets, one or more inferred characteristics of the mapped metadata, or a combination thereof.

“37. A method comprising: retrieving one or more datasets and metadata from a data source; selecting a target standard, the target standard including schema and variables; mapping the retrieved one or more datasets and the metadata to the schema and variables of the target standard; inferring one or more characteristics based on the mapped one or more datasets and the mapped metadata; determining a disclosure risk based on one of, the mapped one or more datasets, the mapped metadata, the inferred one or more characteristics, or a combination thereof; de-identifying the retrieved one or more datasets using one of, the mapped one or more datasets, the mapped metadata, the inferred one or more characteristics, or a combination thereof; and converting the retrieved one or more datasets using one of, the mapped one or more datasets, the mapped metadata, the inferred one or more characteristics, or a combination thereof.

“38. The method of claim 21, wherein the target standard is a standard already available in a field, a custom standard, an ontology defined by an analyst for reuse for sets of a similar dataset, a subset of a standard, an extension of a standard, a combination of multiple standards, or combinations thereof.

“39. The method of claim 21, wherein the one or more datasets are mapped to the schema and variables of the target standard based on one or more of a schema mapping and a variable mapping.

“40. The method of claim 21, wherein the one or more inferred characteristics include one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or combinations thereof.”

For more information, see this patent application: Bradley, George Wesley; Di Valentino, David Nicholas Maurice; Mian, Muhammad Oneeb Rehman. System And Method For Intermediary Mapping And De-Identification Of Non-Standard Datasets. U.S. Patent Application Number 20240037123, filed October 10, 2023 and posted February 1, 2024. Patent URL (for desktop use only): https://ppubs.uspto.gov/pubwebapp/external.html?q=(20240037123)&db=US-PGPUB&type=ids

(Our reports deliver fact-based news of research and discoveries from around the world.)

Patent Application Titled “System And Method For Intermediary Mapping And De-Identification Of Non-Standard Datasets” Published Online (USPTO 20240037123): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Patent Application Titled “System And Method For Intermediary Mapping And De-Identification Of Non-Standard Datasets” Published Online (USPTO 20240037123): Patent Application

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account