“Digital Watermarking Without Significant Information Loss In Anonymized Datasets” in Patent Application Approval Process (USPTO 20200250338)
2020 AUG 21 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “This invention relates to a computer-implemented process of altering original data in a dataset; data anonymisation and digital watermarking is applied.
“Many organisations hold highly valuable datasets, which enable a wide number of secondary uses. Health data enable medical research; consumer banking and retail purchase data enable fraud analysis, market analysis and economic modelling; telecoms data enable a vast array of behavioural analyses, and there are countless other examples. However, since such datasets often contain highly private data about individuals, great care must be taken to ensure that this private information is protected.
“Two mechanisms to protect individual privacy are to remove or obscure the private data in a dataset through anonymisation techniques, and to control and track the distribution of a data set by inserting a digital watermark into the data.
“Anonymisation techniques reduce the risk of an adversary identifying one or more individuals in a dataset. Watermarking enables detection and attribution of unauthorised distribution or publishing of a sensitive dataset, and so enables deterrent mechanisms against such unauthorised behaviour.
“Anonymisation of a dataset reduces the risk of re-identification but may not completely eliminate all possibility of doing so, and so it is prudent to also apply digital watermarking. Embedding a unique fingerprint into a dataset enables that dataset to be associated with an audit trail of who authorized data access to which user, and for what purpose. This information may be encoded within the watermark or stored in a registry, with the key to that record encoded within the watermark.
“The formats used for many documents, media files or computer programs often contain metadata or redundant data within which a watermark can be encoded without damaging the integrity of the file. By contrast, raw datasets (for example a tabular file, an extract from a relational or semi structured non-relational database, or the result of an interactive database query) typically do not contain such metadata or redundant data, and so digital watermarking typically requires manipulating or perturbing the data itself, leading to an undesirable loss of information and utility of the dataset.
“However, privacy preserving techniques such as tokenisation, generalisation, data blurring and insertion of synthetic records do themselves perturb the raw data, in order to remove or generalise the private data. By extending and specialising these anonymisation techniques to incorporate watermark generation, as taught by the invention, digital watermarks may be embedded in an anonymised dataset without further information loss beyond that incurred by anonymisation.”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “The invention is a computer-implemented process of altering original data in a dataset, comprising the step of anonymising the original data, and including a digital watermark in the anonymised data.
“Optional features of the invention include one or more of the following:
“Anonymising the original data incurs information loss, and the further step of including the digital watermark does not add significant further information loss. The digital watermark may operate on a probabilistic basis (we give specific examples below).
“The watermark may be included in original raw data that has been anonymised, as opposed to metadata or redundant data. The original data may be a tabular file, a relational or a non-relational database, or the results of interactive database queries.
“Anonymising the data is achieved using one or more techniques that perturb the original data, such as tokenisation, generalisation; data blurring, synthetic record insertion, record removal or re-ordering.
“When tokenization is used, watermarking is incorporated by extending this tokenisation to generate or select replacement values according to a key or containing a hidden pattern; a function of other fields in the same record; or to use some unique token values in each data release which only ever appear in that release and so uniquely identify it.
“When generalization is used, watermarking is incorporated through the choice of how to generate the replacement for a raw value; or the distribution of the populations of the unique groups; or the choices of group boundaries; or selecting group members to create patterns of data in other variables in the dataset for those individuals within a group.
“When data blurring is used, watermarking is incorporated by perturbing the data in such a way as to include a pattern within the perturbed values, or to generate the offset values according to an algorithm or secret key.
“When insertion of synthetic records is used, watermarking is incorporated by generating the data in these synthetic records according to a pattern or digital key.
“When removal of records is used, watermarking is incorporated by the choice and recording of which records to include and suppress.
“The watermark to be encoded into the anonymised data may be a number or other ID (collectively a ‘number’) which is stored in a watermark registry or is a number that is related or mapped to another number which is stored in the watermark registry. The number stored in the watermark registry may be: a random number; an e-mail address; a unique ID associated with a person; a unique text string; or any data string mapped or related to the foregoing. The watermark to be encoded into the anonymised data may be a random decimal number, which is stored in a watermark registry. The length of the number may be determined by the number and size of the file’s available watermark carrier.
“Each watermark carrier may use its assigned digits as the probability of performing some mutation to each value it processes. Reprocessing of the resultant output file and observing how often the mutation occurs allows deduction of the probability with which it was applied, and hence enables reconstruction of the watermark.
“An audit trail of who authorized data access to which user, and for what purpose is encoded within the watermark and/or stored in the registry, with the key to that record encoded within the watermark.
“For each watermarked data release, the registry may also store details including one or more of: the source data location, schema and description, the policy and techniques applied to create the anonymised copy, the level of sensitivity of the source and anonymised data, the name and contact details user or group of users approved to use the anonymised data, the name and contact details of the approver, and the purpose and duration for which the data is to be used.
“The watermark may be encoded into each row of a file, so that the removal or modification or addition of individual rows in the output has negligible effect on the ability to reconstruct the watermark.
“Mutations may be applied to each cell or row individually, without any knowledge of what mutations will be applied to other rows, to allow the watermark to be applied to the data in a distributed, streaming fashion.
“The watermark may be encoded by altering the frequency distribution of bits or digits in the anonymised data (we give specific examples below).
“Components of the watermark may be encoded at the row level. Where the watermark requires row removal, then the watermark digits define a band of the hash number space and if the watermark component is an N digit decimal number D with a first digit that is >0, then this range is given by [H.sub.L+((H.sub.u-H.sub.L)*(D-1)*10-N), H.sub.L+((H.sub.u-H.sub.L)*D*10-N)), where H.sub.L and H.sub.u are the lower and upper bounds of the hash number space.
“The digits can then be reconstructed from the output file by hashing each row and building up a histogram of hash frequency, where each bin has width 10.sup.-N of the hash number space so that the bin that contains no values reveals the digits for this watermark carrier.
“Where the watermark requires row addition, then N watermark digits define a slice of the hash space and synthetic data is then generated that hashes to a number within this range. The digits can be reconstructed from the output file by hashing each row and building up a histogram of hash frequency so that a bin that is overrepresented in this histogram reveals the digits for this watermark carrier.
“The watermark carriers may be at the cell level and depend on whether the cell data type is for numeric values or tokenised values. Where the cell data type is for numeric values, N*M digits of the watermark are used to mutate the N least significant bits of each value, with a precision of M, and for each of the N watermark digits, the digit is divided by 10.sup.M to derive the probability that this bit will be set in one of the values, and values in the cell are then mutated, setting this bit with the required probability. When reading the file back, the process is to stream through the values and derive the probability of zero for each of the N carrier bits, to a precision of M, to reveal the N*M original digits. N is chosen depending on the range of the numeric values to constrain value distortion to an acceptable range, and M should be chosen based on the data volume.
“Where the cell data type is for tokenised values, tokenised cell values are generated consistent with some regular expression and analysis of this regular expression gives a lexicographically ordered list of all possible output tokens. The watermark component may be an N digit decimal number D with a first digit that is >0, which is used to exclude any output tokens that have an ordinal that is divisible by D, and to reconstruct the watermark digits, the process is to create a histogram with a bin for each number from 10.sup.N-1 to 10.sup.N-1 and for each token the process is to increment the bin count for all of the token ordinal’s factors, so that the lowest ordinal bin with a zero count or small count reveals the watermark digits. N is set with regard to the number of unique input values that require tokenisation, so that a greater volume of data requires a larger value of N.
“There is no requirement to exactly reconstitute the watermark but merely to be able to perform a fuzzy match of a calculated value to the distinct possibilities recorded in a watermark registry that stores watermarks. Using the digital watermark enables detection and attribution of unauthorized distribution or publishing of data.
“The processes described above may also include including non-destructive watermarking techniques, such as reordering records or data fields within the dataset.
“The processes described above are implemented using one or more computer processors.
“At least some of original data defines private medical or health data; private banking or financial data; private communications data; human resources or payroll data; retail or e-commerce data; government records, including records relating to one or more of: taxation; health insurance; mortgages; pensions; benefits; education; health.
“Another aspect of the invention is a computing device or computing system programmed to implement the process defined above.”
The claims supplied by the inventors are:
“1. A computer-implemented process of altering original data in a dataset, comprising the steps of (a) anonymising the original data, in which anonymizing the original data is achieved using a non-hashing algorithm; and (b) including a digital watermark in the anonymised data to generate an altered dataset, the digital watermark being taken from a source that is extrinsic to the dataset, and © providing the altered dataset.
“2. The process of claim 1, in which the step of anonymising the original data incurs information loss, and the further step of including the digital watermark does not add significant further information loss.
“3. The process of claim 1, in which the digital watermark operates on a probabilistic basis.
“4. The process of claim 1 in which the watermark is included in original raw data that has been anonymised, as opposed to metadata or redundant data, and the original raw data is data such as a tabular file, a relational or a non-relational database, or the results of interactive database queries.
“5. (canceled)
“6. The process of claim 1 in which anonymising the data is achieved using one or more techniques that perturb the original data, such as tokenization, and/or generalization and/or data blurring and/or insertion of synthetic records; and/or reordering records or data fields within the dataset.
“7. The process of claim 6 in which the technique that perturbs the original data is tokenization, and the watermarking is incorporated by extending this tokenisation to generate or select replacement values according to a key or containing a hidden pattern; a function of other fields in the same record; or to use some unique token values in each data release which only ever appear in that release and so uniquely identify it.
“8. (canceled)
“9. The process of claim 6 in which the technique that perturbs the original data is generalization, and the watermarking is incorporated through the choice of how to generate the replacement for a raw value; or the distribution of the populations of the unique groups; or the choices of group boundaries; or selecting group members to create patterns of data in other variables in the dataset for those individuals within a group.
“10. (canceled)
“11. The process of claim 6 in which the technique that perturbs the original data is data blurring, and the watermarking is incorporated by perturbing the data in such a way as to include a pattern within the perturbed values, or to generate the offset values according to an algorithm or secret key.
“12. (canceled)
“13. The process of claim 6 in which the technique that perturbs the original data is insertion of synthetic records, and watermarking is incorporated by generating the data in these synthetic records according to a pattern or digital key.
“14. (canceled)
“15. The process of claim 6 in which the technique that perturbs the original data is the removal of records, and watermarking is incorporated by the choice and recording of which records to include and suppress.
“16. (canceled)
“17. The process of claim 6 in which the technique that perturbs the original data is reordering records or data fields within the dataset.
“18. The process of claim 1 in which the watermark to be encoded into the anonymised data is a number or other ID (collectively a ‘number’) which is stored in a watermark registry or is a number that is related or mapped to another number which is stored in the watermark registry, and in which the number stored in the watermark registry is: a random number: a random decimal number: a non-random number; an e-mail address; a unique ID associated with a person; a unique text string; or any data string mapped or related to the foregoing.
“19-20. (canceled)
“21. The process of claim 18 in which the length of the number is determined by the number and size of the file’s available watermark carrier, in which a watermark carrier is the application of one or more of the following perturbation techniques: tokenization; generalization; data blurring; insertion of synthetic records; reordering records or data fields within the dataset.
“22. The process of claim 18 in which each watermark carrier uses its assigned digits as the probability of performing some mutation to each value it processes.
“23. The process of claim 18 in which reprocessing of the resultant output file and observing how often the mutation occurs allows deduction of the probability with which it was applied, and hence enables reconstruction of the watermark.
“24. The process of claim 18 in which an audit trail of who authorized data access to which user, and for what purpose is encoded within the watermark or stored in the registry, with the key to that record encoded within the watermark.
“25. The process of claim 18 in which, for each watermarked data release, the registry also stores details including one or more of: the source data location, schema and description, the policy and techniques applied to create the anonymised copy, the level of sensitivity of the source and anonymised data, the name and contact details user or group of users approved to use the anonymised data, the name and contact details of the approver, and the purpose and duration for which the data is to be used.
“26. The process of claim 1 in which the watermark is encoded at the row level of a file, so that the removal or modification or addition of individual rows in the output has negligible effect on the ability to reconstruct the watermark.
“27. The process of claim 1 in which mutations are applied to each cell or row of a file individually, without any knowledge of what mutations will be applied to other rows, to allow the watermark to be applied to the data in a distributed, streaming fashion.
“28. The process of claim 1 in which the watermark is encoded by altering the frequency distribution of bits or digits in the anonymised data.
“29. The process of claim 15 in which the watermark requires row removal, then the watermark digits define a band of the hash number space and if the watermark component is an N digit decimal number D with a first digit that is >0, then this range is given by [H.sub.L+((H.sub.U-H.sub.L)*(D-1)*10.sup.-N), H.sub.L+((H.sub.U-H.sub.L)*D*10.sup.-N)), where H.sub.L and H.sub.U are the lower and upper bounds of the hash number space: and in which the digits can then be reconstructed from the output file by hashing each row and building up a histogram of hash frequency, where each bin has width 10.sup.-N of the hash number space so that the bin that contains no values reveals the digits for this watermark carrier.
“30-31. (canceled)
“32. The process of claim 1 in which, where the watermark requires row addition, then N watermark digits define a slice of the hash space and synthetic data is then generated that hashes to a number within this range and in which the digits can be reconstructed from the output file by hashing each row and building up a histogram of hash frequency so that a bin that is overrepresented in this histogram reveals the digits for this watermark carrier.
“33. (canceled)
“34. The process of claim 1 in which the watermark carriers are at the cell level and depend on whether the cell data type is for numeric values or tokenised values, and in which, where the cell data type is for numeric values, N*M digits of the watermark are used to mutate the N least significant bits of each value, with a precision of M, and for each of the N watermark digits, the digit is divided by 10.sup.M to derive the probability that this bit will be set in one of the values, and values in the cell are then mutated, setting this bit with the required probability; and when reading the file back, the process is to stream through the values and derive the probability of zero for each of the N carrier bits, to a precision of M, to reveal the N*M original digits.
“35-36. (canceled)
“37. The process of claim 34 in which N is chosen depending on the range of the numeric values to constrain value distortion to an acceptable range.
“38. The process of claim 34, in which, where the cell data type is for tokenised values, tokenised cell values are generated consistent with some regular expression and analysis of this regular expression gives a lexicographically ordered list of all possible output tokens.
“39. The process of claim 34 in which the watermark component is an N digit decimal number D with a first digit that is >0, which is used to exclude any output tokens that have an ordinal that is divisible by D, and to reconstruct the watermark digits, the process is to create a histogram with a bin for each number from 10.sup.N-1 to 10.sup.N-1 and for each token the process is to increment the bin count for all of the token ordinal’s factors, so that the lowest ordinal bin with a zero count or small count reveals the watermark digits.
“40. The process of claim 38 in which N is set with regard to the number of unique input values that require tokenisation, so that a greater volume of data requires a larger value of N.
“41. The process of claim 1 in which there is no requirement to exactly reconstitute the watermark but merely to be able to perform a fuzzy match of a calculated value to the distinct possibilities recorded in a watermark registry that stores watermarks.
“42. The process of claim 1 in which at least some of original data defines private medical or health data; or private banking or financial data; or private communications data; or human resources or payroll data; or retail or e-commerce data; government records, including records relating to one or more of: taxation; health insurance; mortgages; pensions; benefits; education; health.
“43-50. (canceled)
“51. A computing device or computing system programmed to alter original data in a dataset, the device or system being configured to: (a) anonymise the original data by using a non-hashing algorithm; (b) include a digital watermark in the anonymized data to generate an altered dataset, the digital watermark being taken from a source that is extrinsic to the dataset; and © provide the altered dataset.
“52. The process of claim 34 in which M is chosen based on the data volume.”
URL and more information on this patent application, see: MCFALL, Jason; MELLOR, Paul. Digital Watermarking Without Significant Information Loss In Anonymized Datasets. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)
A New Article Explains Why Drivers Should Buy Car Insurance Online
Proposed Flood Hazard Determinations
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News