Patent Application Titled “System And Method For Intermediary Mapping And De-Identification Of Non-Standard Datasets” Published Online (USPTO 20220129485): Patent Application
2022 MAY 16 (NewsRx) -- By a
No assignee for this patent application has been made.
Reporters obtained the following quote from the background information supplied by the inventors: “Many jurisdictions now have data privacy laws and regulations to protect against disclosure of personal information, and organizations also wish to protect against the disclosure of confidential information. De-identification is a process by which personal information relating to a data subject and/or an individual is protected by various means (e.g., transformation, suppression, masking, synthesis, etc.). The de-identification can be rules-based, for example, the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method, whereby select eighteen identifying variables are hidden or transformed. Moreover, Expert Determination and Safe Harbor are HIPAA methods. Moreover, an amount of de-identification required to meet a standard of statistical disclosure control is also influenced overall by the context in which data is being shared or released; public releases have a higher bar than data releases to a secure platform (e.g., a portal in which data access and retention are controlled and regulated) or sharing access in a controlled data environment. Disclosure control includes protecting identity, attribute, and inferential disclosure.
“A statistical disclosure risk measurement requires a series of steps including appropriately modeling a dataset, introspecting a data to find various types of identifiable/sensitive information or variables, finding one or more connections between different variables, and assigning appropriate risk algorithm settings for each variable based on a previous research and expertise. After an initial disclosure risk of a dataset is determined, a de-identification is performed to bring the disclosure risk below the acceptable threshold. Any deviation in these steps may over- or under-estimate the disclosure risk leading to an over-de-identification (and thereby reduced data utility) or leaking of a personally identifiable information, respectively.
“Datasets containing personal or confidential information come in all shapes and sizes. A dataset is any collection of data, including but not limited to structured data, unstructured data (including documents), or any combination or derivation thereof. In certain fields, data sharing needs or regulatory submission requirements have driven an establishment of data standards. In clinical trials sphere, for example, the two most commonly used standards are a Study Data Tabulation Model (SD™), and an Analysis Data Model (ADaM). Many companies still possess or operate with non-standard datasets, as the non-standard datasets are historical datasets or due to existence of internal standards/policies that results in datasets deviating from or being extension of established standards. Studies have been conducted to assess a level of compliance to the established standards SD™ and ADaM. However, it has been observed that majority of datasets significantly deviated from the standards in their native format (as shown in FIG. 1A). Even where the datasets were considered highly compliant (>85%), there is a room for process improvement to ensure an accuracy of risk estimation, consistency of data transformations or synthesis, and a reduced effort, expertise, and training requirements. Moreover, FIG. 1(A) also illustrates datasets with medium compliance (60-85%), and datasets with low compliance (<60%).
“De-identification of datasets, and specifically non-standard datasets, to share or release data for transparency, innovation, service improvement, and other secondary uses has high level of effort and expertise requirements to process (as shown in FIG. 1B). Currently, analysts must manually introspect data to correctly model the data, perform advanced Extract-Transform-Load (ETL) processes as necessary, find identifiable/sensitive information, and possess detailed know-how (expertise) regarding connections between the identifiable variables and appropriate settings for each variable, to accurately measure disclosure risk and de-identify the data. However, the latter steps are especially resource-intensive and can take up to 5-10 days for experienced analysts to complete and quality control, given the variability of incoming non-standard datasets and a sheer volume of variables (as in the case of clinical trial datasets) (as shown in the FIG. 1B). Moreover, some datasets can contain up to 100 tables and 10000+ variables with many interconnections and indirect relationships.
“Conventionally, there are some processes or tools of data harmonization used, whereby not a mapping but a full conversion or transformation to a standard format is performed. An example of the data harmonization in practice is to convert or transform various clinical data sources into SD™ datasets, such as part of a data life cycle while collecting data from data spokes into a data hub.
“Moreover, de-identification solutions currently allow generic characterization of datasets and elements of the datasets. An example is that available de-identification software solutions currently allow a user to associate variables in the data to very generic variable types, such as public quasi-identifier or direct identifier. The generic variable types can be combined with a feature that can load or apply settings for a dataset from either another project setup or from variable settings stored in, for example, an Excel format. This can be akin to a data catalog process whereby an exhaustive list of variables and variable settings are stored for future retrieval; if an incoming data variable matches particulars of a variable already existing in the catalog, it is handled appropriately.
“However, previous solutions, systems and methods that have been developed to handle non-standard datasets have multiple drawbacks such that it may require specialized ETL processes to estimate disclosure risk and derive a de-identification strategy, requires detailed assessments of a potential correlation between variables, and heavy manual effort to align dependencies between correlated or indirectly connected variables and to perform the overall de-identification process on datasets. The estimation of disclosure risk to derive a de-identification strategy may cause over-estimation of risk and over-de-identification, or under-estimation of risk and potentially leaking sensitive information. Moreover, this workflow, as shown in the FIG. 1B, also requires specialized ETL processes to ingest data for disclosure risk estimation, and post-processing to ensure the derived de-identification strategy (including de-identification transformations or replacement through data synthesis) is fully applied to the entire non-standard dataset. Such processes are known to be difficult to productize in their entirety, and do not negate the need for expertise in certain areas such as variable connections and configuring risk/de-identification settings. There are many decision points in key areas of the process, resulting in higher requirements for quality control checks and multiple analysts working on the same dataset. Unless data harmonization to standard formats is part of a normal data life cycle for clients, it is unrealistic to expect them to perform transformation or conversion of their non-standard datasets to standard format just for the purposes of applying data privacy, and then converting back.
“Moreover, the current solutions do not provide enough granularity in variable types and other forms of data characterizations to accurately capture the disclosure risk and de-identification complexities of all types of data, such as clinical trials data. This necessitates the use of data catalogs, to track every instance of a variable and data characteristic seen previously and the associated settings for it. However, there is always the possibility that a given dataset may contain new variables or other data characteristics that are not captured by generic variable types or the data catalog, and requires an expansion of the data catalog. This limits scalability across multiple dimensions, including effort, time, and utility.
“Thus, there is a need for a system, a device, and a process to automate the conversion, or to map the data to the standard.”
In addition to obtaining background information on this patent application, NewsRx editors also obtained the inventors’ summary information for this patent application: “Embodiments of the present invention provide an intermediary mapping and de-identification system for de-identification of one or more non-standard datasets to share or release data for transparency, innovation, service improvement, and other secondary uses. The intermediary mapping and de-identification system is configured to perform an intermediary mapping of the non-standard datasets to a known set of schema and variables (or standard) for which complex requirements can be pre-defined in an automated fashion.
“Embodiments in accordance with the present invention may provide a number of advantages depending on its particular configuration. First, embodiments of the present invention may provide a system and a method to perform an intermediary mapping to a standard schema model and variables, which allows a simple and an automated interpretation of a variable connection and disclosure risk metric settings handling, and de-identification. Further, embodiments of the present invention may provide a system and a method to streamline a quality control and an auditing of an entire de-identification workflow by reducing inter-analyst variability in an expertise application.
“Further, embodiments of the present invention may provide a system and a method to use a wrapper of intermediary mapping to apply data privacy to non-standard datasets (i.e., the non-standard dataset that is processed, maintains its format upon completion). Further, embodiments of the present invention may provide a system and a method which reduce an effort as currently, the de-identification process consists of many steps including data modeling, variable classification, variable risk settings, variable connection, and variable de-identification settings. Embodiments of the present invention may provide a system and a method to restrict an effort to the modeling and classification steps, whereby users map the schema and variables to a given standard. The remaining steps can be inferred from the mapping as per this process.
“Further, embodiments of the present invention may provide a system and a method for reducing a requirement of expertise and training as determining variable connections and settings is a highly technical aspect of a risk-based de-identification process, almost always requiring an experienced disclosure risk analyst doing these steps. Embodiments of the present invention may provide a system and a method that eliminates this expertise barrier by having the details encoded/preset for a given standard, restricting the expertise and training to be centered around how non-standard datasets map to a given standard only.
“Further, embodiments of the present invention may provide a flexible system as previous solutions have been centered around a specific domain for mapping or downstream uses, for example, clinical trials. Embodiments of the present invention may provide a system that allows adaptability of the system for any type of data, such as transactional data.
“Further, embodiments of the present invention may provide a scalable system as generic variable types and data catalog processes are not scalable when it comes to disclosure risk and control, as tweaks are almost always required based on incoming non-standard datasets. By mapping to a standard, downstream actions of the disclosure risk and control can be inferred. Thus, an overall solution becomes more scalable, since a large part of the de-identification process becomes static.
“Embodiments of the present invention may provide one or more new variable types for mapping and new determinations on advanced disclosure control settings required for each variable type. One advanced example of a shift in methods would be that instances of Medical History Start Date would presently be categorized generally as Date fields, which do not share prior estimates (i.e., frequency distributions) for a disclosure risk measurement. In an embodiment of the present invention, the above stated is mapped to a more granular medical_history_start_date variable type that does share prior estimates (e.g., the frequency distributions), thus providing more granularity and accuracy for disclosure risk assessment, and subsequent improved de-identification.
“Presently, correlations between fields are used to inform how to apply a de-identification strategy for the de-identification of the full dataset. The application of a de-identification may be performed manually or using custom scripting. In an embodiment of the present invention, correlations are akin to groupings of variables, which serve a dual purpose; in a more accurate disclosure-risk calculation, groupings may manifest as measurement groups, and in a more refined, automated de-identification process, groupings may serve the role of propagation de-id groups. Further, certain variable groupings are redesigned that existed before, as well new groupings are created. Furthermore, the disclosure control is performed over the entire dataset in a single pass, versus present approaches that may require specialized ETL processes to determine a de-identification strategy before applying this for the de-identification of the full dataset.
“These and other advantages will be apparent from the present application of the embodiments described herein.
“The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor an exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.”
The claims supplied by the inventors are:
“1. A computing device configured to operate as a computer-implemented automated classification and interpretation tool, comprising: one or more processors; and one or more non-transitory computer-readable storage media storing instructions which, when executed by the one or more processors, cause the computing device to: retrieve one or more datasets and one or more metadata from a data source; select a target standard, wherein the standard is one of, a standard already available in a field, a custom standard, an ontology defined by an analyst for reuse for sets of a similar dataset, a subset of a standard, an extension of a standard, a combination of multiple standards, or a combination thereof; map the retrieved one or more datasets and the one or more metadata to the target standard, wherein the one or more datasets and the one or more metadata are mapped to the target standard using one of, a schema mapping, a variable mapping, or a combination thereof; infer one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or a combination thereof using the mapped one or more datasets and the mapped one or more metadata; perform a disclosure risk assessment using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, or a combination thereof; and perform a de-identification and a de-identification propagation using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, or a combination thereof; and perform a conversion using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, the inferred one or more conversion settings, or a combination thereof.
“2. The computing device of claim 1, wherein the schema mapping is performed using one or more table and/or one or more domain type list, wherein the one or more table and/or the one or more domain type list comprises one of, a customized list of table, one or more domain types based on the standard, an extension table, one or more domain types, or a combination thereof.
“3. The computing device of claim 1, wherein the variable mapping is performed using one or more variable type list, wherein the one or more variable type list comprises one of, a customized list of variables based on the standard, one or more extension variable types informed by the standard, one or more extension variable types informed by a disclosure control expert, or a combination thereof.
“4. The computing device of claim 1, wherein the mapping of the one or more datasets and the one or more metadata are automated and/or facilitated by using a ruleset.
“5. The computing device of claim 1, wherein the one or more variable classifications, the one or more variable connections, the one or more groupings, and the one or more disclosure risk settings, the one or more de-identification settings, the one or more conversion settings, or a combination thereof are inferred by using at least one of, a ruleset, a variable type container, or a combination thereof, and wherein the one or more variable connections, the one or more groups, the one or more disclosure risk settings, the one or more de-identification settings, the one or more conversion settings or the combination thereof are stored.
“6. The computing device of claim 1, wherein the datasets are non-standard datasets and are structured or non-structured non-standard datasets.
“7. The computing device of claim 1, wherein the datasets are standard datasets and are structured or non-structured standard datasets.
“8. A system comprising: a target selection module that selects a target standard for mapping a retrieved dataset from a data source, wherein the dataset is a standard dataset or a non-standard dataset; a scheming mapping module that maps one or more tables from the retrieved dataset to one or more specific domains, wherein the scheming module performs schema mapping using one or more table and/or one or more domain type list, wherein claims and/or transactions from the one or more tables are considered for a disclosure risk assessment; a central processor coupled to the target selection module and the scheming mapping module, wherein the central processor performs a de-identification of the mapped data set, and wherein the central processor performs a conversion of the mapped dataset.
“9. The system of claim 8, wherein the conversion of the mapped datasets includes mapped one or more metadata and inferred one or more variable classifications.
“10. The system of claim 8, wherein the mapped and converted datasets are stored.
“11. The system of claim 8, wherein the mapped and converted datasets are outputted to a communication network.
“12. The system of claim 8, wherein conversion rules or settings are retrieved to enable the conversion of the mapped dataset.
“13. The system of claim 8, wherein the retrieved non-standard dataset is mapped to standard variables of the target standard.
“14. The system of claim 8, wherein the central processor determines whether the non-standard dataset can be mapped to standard variables of the target standard.
“15. A method comprising: retrieving one or more datasets and one or more metadata from a data source by a mapping platform; selecting a target standard by the mapping platform, wherein the standard is one of, a standard already available in a field, a custom standard, an ontology defined by an analyst for reuse for sets of a similar dataset, a subset of a standard, an extension of a standard, a combination of multiple standards, or a combination thereof; mapping the retrieved one or more datasets and the one or more metadata to the target standard by the mapping platform, wherein the one or more datasets and the one or more metadata are mapped to the target standard using one of, a schema mapping, a variable mapping, or a combination thereof; inferring one or more variable classifications, one or more variable connections, one or more groupings, one or more disclosure risk settings, one or more de-identification settings, or a combination thereof using the mapped one or more datasets and the mapped one or more metadata by the mapping module; performing a disclosure risk assessment by a central processor using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, or a combination thereof; and performing a de-identification and a de-identification propagation by the central processor using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, or a combination thereof; and performing a conversion by the central processor using one of, the mapped one or more datasets, the mapped one or more metadata, the inferred one or more variable classifications, the inferred one or more variable connections, the inferred one or more groupings, the inferred one or more disclosure risk settings, the inferred one or more de-identification settings, the inferred one or more conversion settings, or a combination thereof.
“16. The method of claim 15, wherein the central processor determines whether the de-identification of the converted mapped dataset or datasets is required.
“17. The method of claim 15, further comprising: a ruleset engine that performs the inference based on the schema and/or the variable mapping or mappings.
“18. The method of claim 15, wherein the de-identification includes data transformation, data masking and/or data synthesis.
“19. The method of claim 15, further comprising: outputting and/or storing the mapped and converted and/or de-identified datasets.
“20. The method of claim 15, further comprising: determining whether the conversion of mapped datasets are required”
For more information, see this patent application: Bradley,
(Our reports deliver fact-based news of research and discoveries from around the world.)
University Hospital Reports Findings in Influenza Vaccines (Trends of influenza vaccination coverage in pregnant women: a ten-year analysis from a French healthcare database): Immunization and Public Health – Influenza Vaccines
Patent Application Titled “Autonomous Vehicle Control Assessment And Selection” Published Online (USPTO 20220126820): State Farm Mutual Automobile Insurance Company
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News