Patent Application Titled “Mapping Of Personally-Identifiable Information To A Person Based On Natural Language Coreference Resolution” Published Online (USPTO 20220083604): Box Inc.
2022 MAR 31 (NewsRx) -- By a
The assignee for this patent application is
Reporters obtained the following quote from the background information supplied by the inventors: “Cloud-based content management services and systems have impacted the way personal and enterprise computer-readable content objects (e.g., files, documents, spreadsheets, images, programming code files, etc.) are stored, and have also impacted the way such personal and enterprise content objects are shared and managed. Content management systems provide the ability to securely share large volumes of content objects among trusted users (e.g., collaborators) on a variety of user devices, such as mobile phones, tablets, laptop computers, desktop computers, and/or other devices. Modern content management systems can host many thousands or, in some cases, millions of files for a particular enterprise that are shared by hundreds or thousands of users. To further promote collaboration over the users and content objects, content management systems often provide various user communication tools, such as instant messaging or “chat” services. These communications may also be saved to create additional content objects managed by the systems.
“The foregoing content objects managed by the content management systems may include personally identifiable information (PII). The PII may be included in some content objects (e.g., social security numbers in tax forms) or may be extemporaneously embedded in other content objects (e.g., a contact phone number entered in a chat conversation). In many cases, neither the person nor even the candidate persons that are potentially associated with the PII in the content objects are necessarily known a priori. For example, the person associated with an instance of PII in a content object may or may not be a user of the content management system that manages the content object. Even with this as a backdrop, stewards of large volumes of electronic or computer-readable content objects (e.g., content management systems) must comply with the various laws, regulations, guidelines, and other types of governance that have been established to monitor and control the use and dissemination of personally identifiable information (PII) contained in the content objects.
“In the United States, for example, the federal statutes known as the Security Rule of the Health Insurance Portability and Accountability Act (HIPAA) was established to protect a patient’s PII while still allowing digital health ecosystem participants access to needed protected health information (PHI). As another example, the California Consumer Privacy Act (CCPA) is a state statute intended to enhance privacy rights and consumer protection to
“Unfortunately, there are no known techniques for identifying and controlling personally identifiable information embedded in large volumes of content objects. While certain approaches exist for identifying instances of PII in content objects, such approaches are limited in their ability to correlate that PII to specific people. Specifically, when the context surrounding an instance of PII does not explicitly identify a person associated with the PII, existing approaches are deficient in determining-with an acceptable level of confidence-who owns or is associated with the PII. What is needed is are ways to confidently and securely associate a particular instance of PII to a particular person. Furthermore, what is needed are techniques that address ongoing management of personally identifiable information that is embedded across arbitrary corpora of content objects.”
In addition to obtaining background information on this patent application, NewsRx editors also obtained the inventor’s summary information for this patent application: “This summary is provided to introduce a selection of concepts that are further described elsewhere in the written description and in the figures. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Moreover, the individual embodiments of this disclosure each have several innovative aspects, no single one of which is solely responsible for any particular desirable attribute or end result.
“Disclosed herein are various techniques for determining that a particular set of personally identifiable (PII) information belongs to a subject entity-even when the PII is not explicitly or directly associated with the subject entity’s name. Various of the disclosed techniques serve to identify aliases that are deemed to, at least potentially, refer to the subject entity. The PII of the aliases that are deemed to be aliases of the subject entity can thus be deemed to be PII of the subject entity.
“Identification of aliases of a subject entity can be carried out by processing a corpus of content objects to (1) identify a first set of personally identifiable information associated with a name or alias and (2) to identify a second set of personally identifiable information associated with another name or alias. The first set of personally identifiable information associated with the name or alias is codified in a first portion of a graph. Similarly, the second set of personally identifiable information is codified in a second portion of the graph. Upon a determination that the identified names and/or aliases refer to the same person, then the first portion of the graph and the second portion of the graph are deemed to be associated with each other. Since those two portions of the graph refer to the same person (e.g., the subject entity), then the graph can be queried and traversed so as to recognize that both the first set of personally identifiable information as well as the second set of personally identifiable information belong to the same person.
“As can be seen, the second set of personally identifiable information can be deemed to be PII of the subject entity, even though the information that is used to form the second portion of the graph does not explicitly identify by name the person associated with the PII. In some embodiments, the determination that a first identified name or alias and a second identified alias refer to the same person can be made on the basis that the first identified name or alias and the second identified alias share PII in common. For example, given the phrase, “Johnathan Smith has a social security number of 123-45-6789”, and given the phrase, “John’s social security number is 123-45-6789”, then “Johnathan Smith” and that occurrence of the alias “John’s” can be deemed to refer to the same person.
“In some embodiments, the determination that a name and an alias refer to the same person can be made on the basis of linguistic analysis (e.g., by identifying and analyzing pronominal anaphoric references) to determine that the alias is referring to the same person who is identified by name.
“Some of the techniques used in the disclosed systems, methods, and computer program products for mapping personally-identifiable information to a person using natural language coreference resolution rely on natural language processing techniques that advance the relevant technologies over legacy approaches.
“The disclosed embodiments modify and improve over legacy approaches. In particular, the herein-disclosed techniques provide technical solutions that address the technical problems attendant to managing personally identifiable information pertaining to known and unknown persons that is embedded across arbitrary corpora of content objects. Such technical solutions involve specific implementations (i.e., data organization, data communication paths, module-to-module interrelationships, etc.) that relate to the software arts.
“The embodiments disclosed herein improve to the way a computer stores and retrieves data through implementation of specific graph-oriented data structures. These specific graph-oriented data structures overcome challenges with processing large corpora of content objects. To illustrate, to identify a named person and that named person’s alias that may be disparately distributed over a large corpus of content objects (e.g., terabytes or more) would require a virtual or real memory space to contain all of the content of the content objects, or would require making multiple passes over the corpora of content objects. In contrast, when the specific graph-oriented data structures as disclosed herein are employed, the amount of virtual or real memory needed is reduced by an order of magnitude.
“Still further, when forming these aforementioned graph-oriented data structures, an index can be constructed such that graph nodes that correspond to all names, and/or suspected names and all aliases and/or suspected aliases, and/or all PII entries are identified directly in the index. As such, and based on a given name or alias that is used to query over the index, the graph can then be accessed directly, starting from a particular node as determined from the results of the query over the index. As such, the only portions of the graph that need to be traversed are those portions that are directly or indirectly connected to the identified node. This results in a decrease in the amount of memory needed, a decrease in the amount of computer processor cycles needed and, in some cases results in a decrease in network bandwidth demanded when processing large graph-oriented data structures.
“The ordered combination of steps of the embodiments serve in the context of practical applications that perform steps for mapping persons to respective PII discovered over a corpus of content objects by automatically correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information. These techniques for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information overcome long standing yet heretofore unsolved technological problems associated with managing personally identifiable information that is embedded across arbitrary corpora of computer-accessible content objects.
“The herein-disclosed embodiments are technological solutions pertaining to technological problems that arise in the hardware and software arts that underlie data storage facilities. Aspects of the present disclosure achieve performance and other improvements in peripheral technical fields including, but not limited to computer-implemented governance of data privacy.
“Some embodiments include a sequence of instructions that are stored on a non-transitory computer readable medium. Such a sequence of instructions, when stored in memory and executed by one or more processors, cause the one or more processors to perform a set of acts for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information.
“Some embodiments include the aforementioned sequence of instructions that are stored in a memory, which memory is interfaced to one or more processors such that the one or more processors can execute the sequence of instructions to cause the one or more processors to implement acts for mapping persons to respective PII discovered over a corpus of content objects by correlating the PII to coreferenced entities contained in the context associated with the personally identifiable information.
“In various embodiments, any combinations of any of the above can be combined to perform any variation of acts for coreference-resolved mapping of persons to personally identifiable information, and many such combinations of aspects of the above elements are contemplated.
“Further details of aspects, objectives and advantages of the technological embodiments are described herein, and in the figures and claims.”
The claims supplied by the inventors are:
“1. A method for associating personally identifiable (PII) information to an entity and the entity’s aliases, the method comprising: processing one or more content objects to identify a first set of personally identifiable information associated with a first alias; further processing the one or more content objects to identify a second set of personally identifiable information associated with a second alias; determining that the first alias and the second alias refer to the same person; forming a first portion of a graph that connects first PII nodes that represent the first set of personally identifiable information to a first entity node that represents the first alias; forming a second portion of a graph that connects second PII nodes that represent the second set of personally identifiable information to a second entity node that represents the second alias; adding one or more edges that connect the first portion of the graph to the second portion of the graph; and processing queries to find personally identifiable information of the entity by traversing over at least one of the one or more edges that connect the first portion of the graph to the second portion of the graph.
“2. The method of claim 1, further comprising: querying the graph, with a person of interest’s name, wherein results of the query comprises (1) a first item of personally identifiable information of the person, and (2) a different item of personally identifiable information of the alias.
“3. The method of claim 1, further comprising: querying the graph, with a person of interest’s alias, wherein results of the query comprises (1) a first item of personally identifiable information corresponding to the person of interest’s alias, and (2) a different item of personally identifiable information of the person of interest.
“4. The method of claim 1, further comprising querying the graph to determine a subject content object that includes at least one of, (1) a first item of the first set of personally identifiable information of the person, and (2) a second item of the second set of personally identifiable information of the alias.
“5. The method of claim 4, further comprising modifying the subject content object to eliminate occurrences of (1) the first item of the first set of personally identifiable information of the person, and (2) the second item of the second set of personally identifiable information of the alias.
“6. The method of claim 1, wherein at least one item of the first set of personally identifiable information is identified, based at least in part on a PII detection rule.
“7. The method of claim 1, wherein at least one item of the first set of personally identifiable information is identified, based at least in part on identification of pronominal anaphoric references.
“8. The method of claim 7, wherein the pronominal anaphoric references are taken from context of a PII instance.
“9. The method of claim 1 wherein a database is accessed to determine that the first alias and the second alias refer to the same person.
“10. The method of claim 9 wherein the database is accessed using a lightweight directory access protocol (LDAP).
“11. A non-transitory computer readable medium having stored thereon a sequence of instructions which, when stored in memory and executed by one or more processors causes the one or more processors to perform a set of acts for associating personally identifiable (PIT) information to an entity and the entity’s aliases, the set of acts comprising: processing one or more content objects to identify a first set of personally identifiable information associated with a first alias; further processing the one or more content objects to identify a second set of personally identifiable information associated with a second alias; determining that the first alias and the second alias refer to the same person; forming a first portion of a graph that connects first PII nodes that represent the first set of personally identifiable information to a first entity node that represents the first alias; forming a second portion of a graph that connects second PII nodes that represent the second set of personally identifiable information to a second entity node that represents the second alias; adding one or more edges that connect the first portion of the graph to the second portion of the graph; and processing queries to find personally identifiable information of the entity by traversing over at least one of the one or more edges that connect the first portion of the graph to the second portion of the graph.
“12. The non-transitory computer readable medium of claim 11, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of: querying the graph, with a person of interest’s name, wherein results of the query comprises (1) a first item of personally identifiable information of the person, and (2) a different item of personally identifiable information of the alias.
“13. The non-transitory computer readable medium of claim 11, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of: querying the graph, with a person of interest’s alias, wherein results of the query comprises (1) a first item of personally identifiable information corresponding to the person of interest’s alias, and (2) a different item of personally identifiable information of the person of interest.
“14. The non-transitory computer readable medium of claim 11, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of querying the graph to determine a subject content object that includes at least one of, (1) a first item of the first set of personally identifiable information of the person, and (2) a second item of the second set of personally identifiable information of the alias.
“15. The non-transitory computer readable medium of claim 14, further comprising instructions which, when stored in memory and executed by the one or more processors causes the one or more processors to perform acts of modifying the subject content object to eliminate occurrences of (1) the first item of the first set of personally identifiable information of the person, and (2) the second item of the second set of personally identifiable information of the alias.
“16. The non-transitory computer readable medium of claim 11, wherein at least one item of the first set of personally identifiable information is identified, based at least in part on a PII detection rule.
“17. The non-transitory computer readable medium of claim 11, wherein at least one item of the first set of personally identifiable information is identified, based at least in part on identification of pronominal anaphoric references.
“18. The non-transitory computer readable medium of claim 17, wherein the pronominal anaphoric references are taken from context of a PII instance.
“19. A system for associating personally identifiable (PII) information to an entity and the entity’s aliases, the system comprising: a storage medium having stored thereon a sequence of instructions; and one or more processors that execute the sequence of instructions to cause the one or more processors to perform a set of acts, the set of acts comprising, processing one or more content objects to identify a first set of personally identifiable information associated with a first alias; further processing the one or more content objects to identify a second set of personally identifiable information associated with a second alias; determining that the first alias and the second alias refer to the same person; forming a first portion of a graph that connects first PII nodes that represent the first set of personally identifiable information to a first entity node that represents the first alias; forming a second portion of a graph that connects second PII nodes that represent the second set of personally identifiable information to a second entity node that represents the second alias; adding one or more edges that connect the first portion of the graph to the second portion of the graph; and processing queries to find personally identifiable information of the entity by traversing over at least one of the one or more edges that connect the first portion of the graph to the second portion of the graph.
“20. The system of claim 19, further comprising: querying the graph, with a person of interest’s name, wherein results of the query comprises (1) a first item of personally identifiable information of the person, and (2) a different item of personally identifiable information of the alias.”
For more information, see this patent application: Ojha, Alok. Mapping Of Personally-Identifiable Information To A Person Based On Natural Language Coreference Resolution. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)
Researchers Submit Patent Application, “Health Management System”, for Approval (USPTO 20220084654): Patent Application
Patent Issued for Method and system for a scalable computer-telephony integration system (USPTO 11277517): State Farm Mutual Automobile Insurance Company
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News