Patent Issued for System and method for detecting and transferring sensitive data by inferring context in unstructured domains (USPTO 11810379): United Services Automobile Association
2023 NOV 24 (NewsRx) -- By a
The patent’s assignee for patent number 11810379 is
News editors obtained the following quote from the background information supplied by the inventors: “Sensitive Data Management (SDM) has become an important area of data management, with the near constant reporting of data breaches and inadvertent exposure of personal and financial information. The methods and technologies being used to counter these critical issues have not kept pace with the explosion of data housed and maintained in countless datacenters and cloud providers. The major problem with existing technologies is the over dependence on rudimentary pattern matching for identification of sensitive data. Relying exclusively on patterns alone creates the issue of very high false positive readings. This can lead to data saturation, where the volume of potential findings becomes so large and polluted with suspect false readings that the data in effect becomes useless. To avoid too many false positives, many systems use overly narrow patterns, thereby allowing true positives to pass undetected.
“There is a need in the art for a system and method that addresses the shortcomings discussed above.”
As a supplement to the background information on this patent, NewsRx correspondents also obtained the inventors’ summary information for this patent: “In one aspect, a method of detecting sensitive data in a set of documents include a step of scanning the set of documents, using a pattern matching module, for data elements that have a predetermined data pattern. The method also includes steps of identifying a matching data element in at least one document of the set of documents, using a part of speech module to determine a part of speech label for the matching data element, and determining a part of speech score, where a value of the part of speech score depends on the part of speech label. The method also includes steps of determining a sensitive data score using at least the part of speech score, retrieving a threshold score, identifying the matching data element as comprising sensitive data if the sensitive data score is greater than or equal to the threshold score, and taking a mitigating action when the matching data element comprises sensitive data.
“In another aspect, a method of detecting sensitive data in a set of documents includes steps of scanning the set of documents, using a pattern matching module, for data elements that have a predetermined data pattern, identifying a matching data element in at least one document of the set of documents, and retrieving at least one key term, where the at least one key term is a term associated with documents including sensitive data. The method also includes steps of using a vector space model to represent the matching data element as a first vector and to represent the at least one key term as a second vector, calculating a difference between the first vector and the second vector, determining a vector similarity score using the calculated difference between the first vector and the second vector, determining a sensitive data score using at least the vector similarity score, retrieving a threshold score, identifying the matching data element as comprising sensitive data if the sensitive data score is greater than or equal to the threshold score, and taking a mitigating action when the matching data element comprises sensitive data.
“In another aspect, a method of detecting sensitive data in a set of documents includes steps of scanning the set of documents, using a pattern matching module, for data elements that have a predetermined data pattern, identifying a matching data element in at least one document of the set of documents, using a part of speech module to determine a part of speech label for the matching data element, determining a part of speech score, wherein a value of the part of speech score depends on the part of speech label, and retrieving at least one key term, where the at least one key term is a term associated with documents including sensitive data. The method also includes steps of using a vector space model to represent the matching data element as a first vector and to represent the at least one key term as a second vector, calculating a difference between the first vector and the second vector, determining a vector similarity score using the calculated difference between the first vector and the second vector, determining a sensitive data score using the part of speech score and the vector similarity score, retrieving a threshold score, identifying the matching data element as comprising sensitive data if the sensitive data score is greater than or equal to the threshold score, and taking a mitigating action when the matching data element comprises sensitive data.
“Other systems, methods, features, and advantages of the disclosure will be, or will become, apparent to one of ordinary skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description and this summary, be within the scope of the disclosure, and be protected by the following claims.”
The claims supplied by the inventors are:
“1. A computer implemented method for transferring sensitive data, comprising executing on a processor the steps of: receiving documents stored in a first data storage system; scanning the documents and identifying sensitive data within a document by: identifying a matching data element in the document; retrieving at least one key term, wherein the at least one key term is a term associated with documents including sensitive data; using a vector space model to represent the matching data element as a first vector and to represent the at least one key term as a second vector; calculating a difference between the first vector and the second vector; determining a vector similarity score using the calculated difference between the first vector and the second vector; determining a sensitive data score using at least the vector similarity score; retrieving a threshold score; identifying the matching data element as comprising sensitive data by determining that the sensitive data score is greater than or equal to the threshold score; transferring the document with the matching data element to a second data storage system that is at a lower risk for a data breach than the first data storage system; wherein calculating the difference between the first vector and the second vector includes taking a cosine distance between the first vector and the second vector; wherein the vector similarity score is equal to the cosine distance times a normalizing factor; and wherein the normalizing factor has a value of one half.
“2. The method according to claim 1, wherein the method includes calculating a part of speech score for the matching data element.
“3. The method according to claim 2, wherein determining the sensitive data score includes using the vector similarity score and the part of speech score.
“4. The method according to claim 3, wherein determining the sensitive data score includes summing the vector similarity score and the part of speech score.
“5. The method according to claim 2, wherein the method further includes determining a part of speech label for the matching data element, wherein the part of speech score is greater than zero when the part of speech label is associated with the subject of a sentence, and wherein the part of speech score is zero when the part of speech label is not associated with the subject of a sentence.
“6. The method according to claim 2, wherein the method further includes determining a part of speech label for the matching data element, wherein the part of speech score is greater than zero when the part of speech label is a noun, and wherein the part of speech score is zero when the part of speech label is not a noun.
“7. The method according to claim 1, wherein the method further includes alerting an owner of the document including the matching data element that the document includes sensitive data.
“8. The method according to claim 1, wherein the method further includes encrypting the document.
“9. The method according to claim 1, wherein the vector space model includes a plurality of vectors corresponding to a plurality of key terms, and wherein the vector similarity score uses the difference between the first vector and each of the plurality of vectors.
“10. The method according to claim 1, wherein the first vector and the second vector are determined by calculating a set of term frequency-inverse document frequency weights for a set of documents including the document.
“11. A system, comprising: a processor; and a non-transitory computer readable medium storing instructions executable by the processor to: receive documents stored in a first data storage system; scan the documents and identify sensitive data within a document by: identifying a matching data element in the document; retrieving at least one key term, wherein the at least one key term is a term associated with documents including sensitive data; using a vector space model to represent the matching data element as a first vector and to represent the at least one key term as a second vector; calculating a difference between the first vector and the second vector; determining a vector similarity score using the calculated difference between the first vector and the second vector; determining a sensitive data score using at least the vector similarity score; retrieving a threshold score; identifying the matching data element as comprising sensitive data by determining that the sensitive data score is greater than or equal to the threshold score; transfer the document with the matching data element to a second data storage system that is at a lower risk for a data breach than the first data storage system; wherein calculating the difference between the first vector and the second vector includes taking a cosine distance between the first vector and the second vector; wherein the vector similarity score is equal to the cosine distance times a normalizing factor; and wherein the normalizing factor has a value of one half.
“12. The system according to claim 11, wherein the instructions are further executable to calculate a part of speech score for the matching data element.
“13. The system according to claim 12, wherein the instructions are further executable to determine the sensitive data score using the vector similarity score and the part of speech score.
“14. The system according to claim 13, wherein the instructions are further executable to determine the sensitive data score by summing the vector similarity score and the part of speech score.
“15. The system according to claim 12, wherein the instructions are further executable to determine a part of speech label for the matching data element, wherein the part of speech score is greater than zero when the part of speech label is associated with the subject of a sentence, and wherein the part of speech score is zero when the part of speech label is not associated with the subject of a sentence.
“16. The system according to claim 12, wherein the instructions are further executable to determine a part of speech label for the matching data element, wherein the part of speech score is greater than zero when the part of speech label is a noun, and wherein the part of speech score is zero when the part of speech label is not a noun.
“17. The system according to claim 11, wherein the instructions are further executable to alert an owner of the document including the matching data element that the document includes sensitive data.
“18. The system according to claim 11, wherein the instructions are further executable to encrypt the document.
“19. The system according to claim 11, wherein the vector space model includes a plurality of vectors corresponding to a plurality of key terms, and wherein the vector similarity score uses the difference between the first vector and each of the plurality of vectors.
“20. The system according to claim 11, wherein the instructions are further executable to determine the first vector and the second vector by calculating a set of term frequency-inverse document frequency weights for a set of documents including the document.”
For additional information on this patent, see: Vickers,
(Our reports deliver fact-based news of research and discoveries from around the world.)
Patent Issued for Systems and methods for managing and processing vehicle operator accounts based on vehicle operation data (USPTO 11810139): State Farm Mutual Automobile Insurance Company
Free Updated COVID-19 Vaccines Available to Adults
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News