Researchers Submit Patent Application, “Mapping Of Personally-Identifiable Information To A Person Based On Traversal Of A Graph”, for Approval (USPTO 20220083601): Box Inc.
2022 APR 05 (NewsRx) -- By a
The patent’s assignee is
News editors obtained the following quote from the background information supplied by the inventors: “Cloud-based content management services and systems have impacted the way personal and enterprise computer-readable content objects (e.g., files, documents, spreadsheets, images, programming code files, etc.) are stored, and have also impacted the way such personal and enterprise content objects are shared and managed. Content management systems provide the ability to securely share large volumes of content objects among trusted users (e.g., collaborators) on a variety of user devices, such as mobile phones, tablets, laptop computers, desktop computers, and/or other devices. Modern content management systems can host many thousands or, in some cases, millions of files for a particular enterprise that are shared by hundreds or thousands of users. To further promote collaboration over the users and content objects, content management systems often provide various user communication tools, such as instant messaging or “chat” services. These communications may also be saved to create additional content objects managed by the systems.
“The foregoing content objects managed by the content management systems may include personally identifiable information (PII). The PII may be included in some content objects (e.g., social security numbers in tax forms) or may be extemporaneously embedded in other content objects (e.g., a contact phone number entered in a chat conversation). In many cases, neither the person nor even the candidate persons that are potentially associated with the PII in the content objects are necessarily known a priori. For example, the person associated with an instance of PII in a content object may or may not be a user of the content management system that manages the content object. Even with this as a backdrop, stewards of large volumes of electronic or computer-readable content objects (e.g., content management systems) must comply with the various laws, regulations, guidelines, and other types of governance that have been established to monitor and control the use and dissemination of personally identifiable information (PII) contained in the content objects.
“In the United States, for example, the federal statutes known as the Security Rule of the Health Insurance Portability and Accountability Act (HIPAA) was established to protect a patient’s PII while still allowing digital health ecosystem participants access to needed protected health information (PHI). As another example, the California Consumer Privacy Act (CCPA) is a state statute intended to enhance privacy rights and consumer protection to
“Unfortunately, there are no known techniques for identifying and controlling personally identifiable information embedded in large volumes of content objects. While certain approaches exist for identifying instances of PII in content objects, such approaches are limited in their ability to correlate that PII to specific people. Specifically, when the context surrounding an instance of PII does not explicitly identify a person associated with the PII, existing approaches are deficient in determining-with an acceptable level of confidence-who owns or is associated with the PII. What is needed is are ways to confidently and securely associate a particular instance of PII to a particular person. Furthermore, what is needed are techniques that address ongoing management of personally identifiable information that is embedded across arbitrary corpora of content objects.”
As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventor’s summary information for this patent application: “This summary is provided to introduce a selection of concepts that are further described elsewhere in the written description and in the figures. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Moreover, the individual embodiments of this disclosure each have several innovative aspects, no single one of which is solely responsible for any particular desirable attribute or end result.
“Disclosed herein are various techniques for determining that a particular set of personally identifiable (PII) information belongs to a subject entity-even when the PII is not explicitly or directly associated with the subject entity’s name. Various of the disclosed techniques serve to identify aliases that are deemed to, at least potentially, refer to the subject entity. The PII of the aliases that are deemed to be aliases of the subject entity can thus be deemed to be PII of the subject entity.
“Identification of aliases of a subject entity can be carried out by processing a corpus of content objects to (1) identify a first set of personally identifiable information associated with a name or alias and (2) to identify a second set of personally identifiable information associated with another name or alias. The first set of personally identifiable information associated with the name or alias is codified in a first portion of a graph. Similarly, the second set of personally identifiable information is codified in a second portion of the graph. Upon a determination that the identified names and/or aliases refer to the same person, then the first portion of the graph and the second portion of the graph are deemed to be associated with each other. Since those two portions of the graph refer to the same person (e.g., the subject entity), then the graph can be queried and traversed so as to recognize that both the first set of personally identifiable information as well as the second set of personally identifiable information belong to the same person.
“As can be seen, the second set of personally identifiable information can be deemed to be PII of the subject entity, even though the information that is used to form the second portion of the graph does not explicitly identify by name the person associated with the PII. In some embodiments, the determination that a first identified name or alias and a second identified alias refer to the same person can be made on the basis that the first identified name or alias and the second identified alias share PII in common. For example, given the phrase, “Johnathan Smith has a social security number of 123-45-6789”, and given the phrase, “John’s social security number is 123-45-6789”, then “Johnathan Smith” and that occurrence of the alias “John’s” can be deemed to refer to the same person.
“In some embodiments, the determination that a name and an alias refer to the same person can be made on the basis of linguistic analysis (e.g., by identifying and analyzing pronominal anaphoric references) to determine that the alias is referring to the same person who is identified by name.
“Some of the techniques used in the disclosed systems, methods, and computer program products for mapping personally-identifiable information to a person using natural language coreference resolution rely on natural language processing techniques that advance the relevant technologies over legacy approaches.
“The disclosed embodiments modify and improve over legacy approaches. In particular, the herein-disclosed techniques provide technical solutions that address the technical problems attendant to managing personally identifiable information pertaining to known and unknown persons that is embedded across arbitrary corpora of content objects. Such technical solutions involve specific implementations (i.e., data organization, data communication paths, module-to-module interrelationships, etc.) that relate to the software arts.
“The embodiments disclosed herein improve to the way a computer stores and retrieves data through implementation of specific graph-oriented data structures. These specific graph-oriented data structures overcome challenges with processing large corpora of content objects. To illustrate, to identify a named person and that named person’s alias that may be disparately distributed over a large corpus of content objects (e.g., terabytes or more) would require a virtual or real memory space to contain all of the content of the content objects, or would require making multiple passes over the corpora of content objects. In contrast, when the specific graph-oriented data structures as disclosed herein are employed, the amount of virtual or real memory needed is reduced by an order of magnitude.
“Still further, when forming these aforementioned graph-oriented data structures, an index can be constructed such that graph nodes that correspond to all names, and/or suspected names and all aliases and/or suspected aliases, and/or all PII entries are identified directly in the index. As such, and based on a given name or alias that is used to query over the index, the graph can then be accessed directly, starting from a particular node as determined from the results of the query over the index. As such, the only portions of the graph that need to be traversed are those portions that are directly or indirectly connected to the identified node. This results in a decrease in the amount of memory needed, a decrease in the amount of computer processor cycles needed and, in some cases results in a decrease in network bandwidth demanded when processing large graph-oriented data structures.
“The ordered combination of steps of the embodiments serve in the context of practical applications that perform steps for mapping persons to respective PII discovered over a corpus of content objects by automatically correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information. These techniques for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information overcome long standing yet heretofore unsolved technological problems associated with managing personally identifiable information that is embedded across arbitrary corpora of computer-accessible content objects.
“The herein-disclosed embodiments are technological solutions pertaining to technological problems that arise in the hardware and software arts that underlie data storage facilities. Aspects of the present disclosure achieve performance and other improvements in peripheral technical fields including, but not limited to computer-implemented governance of data privacy.
“Some embodiments include a sequence of instructions that are stored on a non-transitory computer readable medium. Such a sequence of instructions, when stored in memory and executed by one or more processors, cause the one or more processors to perform a set of acts for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information.
“Some embodiments include the aforementioned sequence of instructions that are stored in a memory, which memory is interfaced to one or more processors such that the one or more processors can execute the sequence of instructions to cause the one or more processors to implement acts for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information.
“In various embodiments, any combinations of any of the above can be combined to perform any variation of acts for coreference-resolved mapping of persons to personally identifiable information, and many such combinations of aspects of the above elements are contemplated.
“Further details of aspects, objectives and advantages of the technological embodiments are described herein, and in the figures and claims.”
The claims supplied by the inventors are:
“1. A method for associating personally identifiable information to an entity from a plurality of content objects, the method comprising: performing an analysis of the plurality of content objects; forming, based on the analysis, an entity graph by: identifying, from a first content object, first PII and its first associated entity; identifying, from a second content object, second PII and its second associated entity; and processing the entity graph to determine that the first associated entity and the second associated entity refer to the same entity based on the first PII and the second PII being in common between the first associated entity and the second associated entity.
“2. The method of claim 1, wherein determining that the first and second entities refer to the same person is carried out by accessing a database or directory that contains entries for known persons.
“3. The method of claim 1, further comprising: querying the entity graph to identify at least one of, the personally identifiable information, or at least one of the first entity or the second entity, or at least one of the plurality of content objects that contains the personally identifiable information.
“4. The method of claim 3, wherein at least one result that is produced by querying the entity graph is presented to facilitate compliance with at least one of, at least one PII detection rule, or at least one PII preference.
“5. The method of claim 1, wherein the same item of personally identifiable information is identified, based at least in part on at least one of, a set of PII detection rules, or a classification model.
“6. The method of claim 1, further comprising performing linguistic analysis to determine that the first associated entity and the second associated entity refer to the same person.
“7. The method of claim 6, wherein the linguistic analysis comprises natural language processing of words and phrases that are in proximity to the personally identifiable information.
“8. The method of claim 7, wherein the linguistic analysis comprises decomposing passages to identify compound nominative clauses, possessive constructions, subject-attribute relationships in words and phrases that are in proximity to the personally identifiable information.
“9. The method of claim 7, wherein at least one of, a likelihood value, a score, or a source of context is assigned to either the first associated entity or the second associated entity.
“10. The method of claim 1, wherein a first amount of memory needed to contain the entity graph is smaller than a second amount of memory needed to contain the plurality of content objects.
“11. A non-transitory computer readable medium having stored thereon a sequence of instructions which, when stored in memory and executed by one or more processors causes the one or more processors to perform a set of acts for associating personally identifiable information to an entity from a plurality of content objects, the set of acts comprising: performing an analysis of the plurality of content objects; forming, based on the analysis, an entity graph by: identifying, from a first content object, first PII and its first associated entity; identifying, from a second content object, second PII and its second associated entity; and processing the entity graph to determine that the first associated entity and the second associated entity refer to the same entity based on the first PII and the second PII being in common between the first associated entity and the second associated entity.
“12. The non-transitory computer readable medium of claim 11, wherein determining that the first and second entities refer to the same person is carried out by accessing a database or directory that contains entries for known persons.
“13. The non-transitory computer readable medium of claim 11, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of: querying the entity graph to identify at least one of, the personally identifiable information, or at least one of the first entity or the second entity, or at least one of the plurality of content objects that contains the personally identifiable information.
“14. The non-transitory computer readable medium of claim 13, wherein at least one result that is produced by querying the entity graph is presented to facilitate compliance with at least one of, at least one PII detection rule, or at least one PII preference.
“15. The non-transitory computer readable medium of claim 11, wherein the same item of personally identifiable information is identified, based at least in part on at least one of, a set of PII detection rules, or a classification model.
“16. The non-transitory computer readable medium of claim 11, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of performing linguistic analysis to determine that the first associated entity and the second associated entity refer to the same person.
“17. The non-transitory computer readable medium of claim 16, wherein the linguistic analysis comprises natural language processing of words and phrases that are in proximity to the personally identifiable information.
“18. The non-transitory computer readable medium of claim 17, wherein the linguistic analysis comprises decomposing passages to identify compound nominative clauses, possessive constructions, subject-attribute relationships in words and phrases that are in proximity to the personally identifiable information.
“19. A system for associating personally identifiable information to an entity from a plurality of content objects, the system comprising: a storage medium having stored thereon a sequence of instructions; and one or more processors that execute the sequence of instructions to cause the one or more processors to perform a set of acts, the set of acts comprising, performing an analysis of the plurality of content objects; forming, based on the analysis, an entity graph by: identifying, from a first content object, first PII and its first associated entity; identifying, from a second content object, second PII and its second associated entity; and processing the entity graph to determine that the first associated entity and the second associated entity refer to the same entity based on the first PII and the second PII being in common between the first associated entity and the second associated entity.
“20. The system of claim 19, wherein determining that the first and second entities refer to the same person is carried out by accessing a database or directory that contains entries for known persons.”
For additional information on this patent application, see: Ojha, Alok. Mapping Of Personally-Identifiable Information To A Person Based On Traversal Of A Graph. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)
New Findings from Columbia University in Healthcare Economics Provides New Insights (Medicaid Social Risk Adjustment In Oregon: Perspectives From Stakeholders): Economics – Healthcare Economics
“Evaluation Of An Ultrasound-Based Investigation” in Patent Application Approval Process (USPTO 20220084239): Patent Application
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News