Newswires

March 31, 2020 Newswires

“Machine-Learning Based Detection And Classification Of Personally Identifiable Information” in Patent Application Approval Process (USPTO 20200081978)

Insurance Daily News

2020 MAR 31 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- A patent application by the inventors Ahmed, Mohamed N. (Loudoun County, VA); Toor, Andeep S. (Chantilly, VA), filed on September 7, 2018, was made available online on March 12, 2020, according to news reporting originating from Washington, D.C., by NewsRx correspondents.

This patent application is assigned to International Business Machines Corporation (Armonk, New York, United States).

The following quote was obtained by the news editors from the background information supplied by the inventors: “Personally identifiable information (PII) is information that can be using on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context. Corporations and agencies are often under an obligation to protect content containing PII to prevent exposure of the PII to unauthorized parties. Because of the significant reputational and financial consequences of failing to protect content containing PII, corporations and governmental agencies have made it a major goal to identify and protect such content. Privacy expectations arise from a number of laws in different jurisdictions such as the Health Insurance Portability and Accountability Act (HIPPA) and Payment Card Industry (PCI) data security standards. One of the most challenging aspects related to identifying and protecting PII is how to deal with ‘unstructured’ content. Unstructured content refers to information that does not have a pre-defined data model or is not organized in a pre-defined manner. Examples of unstructured content may include, for example, documents or files on file shares, personal computing devices, and content management systems. These documents and files may be generated within or outside of an organization using many applications, can be converted to multiple file formats (e.g., Portable Document Format (PDF), and seemingly have unlimited form and content. By contrast, structured data such as data stored in in databases and support systems have often have defined fields in tables that have defined relationships with each other. For example, to protect social security numbers in a database, access to the field for social security numbers is controlled. With unstructured documents, the detection of PII is more challenging.”

In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “The illustrative embodiments provide a method, system, and computer program product. An embodiment of a method for detection and classification of personally identifiable information includes identifying a document with a known author, and extracting a first set of features of the document using natural language processing. The embodiment further includes extracting a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network, and classifying the first set of features and the second set of features using a classifier to produce classified extracted features. The embodiment further includes labelling personally identifiable information in the document based upon the classified extracted features.

“In another embodiment, the document is an unstructured document. In another embodiment, the first set of features includes text-based features. In another embodiment, the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures. In another embodiment, the second set of features includes user-specific features.

“In another embodiment, the classifier includes a deep neural network classifier. In another embodiment, the classifier includes a maximum entropy classifier.

“Another embodiment further includes training the classifier based upon the classified extracted features. Another embodiment further includes receiving feedback associated with the classified extracted features, and modifying the training of the classifier based upon the feedback. In another embodiment, the feedback is received from a subject matter expert. In another embodiment, extracting the second set of features is based upon a user-specific model.

“In another embodiment, the user-specific model is trained based upon past results and user provided documents including labelled personally identifiable information.

“An embodiment includes a computer usable program product. The computer usable program product includes one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices.

“An embodiment includes a computer system. The computer system includes one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories.”

The claims supplied by the inventors are:

“1. A method for detection and classification of personally identifiable information, the method comprising: identifying a document with a known author; extracting a first set of features of the document using natural language processing; extracting a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; classifying the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and labelling personally identifiable information in the document based upon the classified extracted features.

“2. The method of claim 1, wherein the document is an unstructured document.

“3. The method of claim 1, wherein the first set of features includes text-based features.

“4. The method of claim 1, wherein the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures.

“5. The method of claim 1, wherein the second set of features includes user-specific features.

“6. The method of claim 1, wherein the classifier includes a deep neural network classifier.

“7. (canceled)

“8. The method of claim 1, further comprising training the classifier based upon the classified extracted features.

“9. The method of claim 8, further comprising: receiving feedback associated with the classified extracted features; and modifying the training of the classifier based upon the feedback.

“10. The method of claim 9, wherein the feedback is received from a user.

“11. The method of claim 1, wherein extracting the second set of features is based upon a user-specific model.

“12. The method of claim 11, wherein the user-specific model is trained based upon past results and user provided documents including labelled personally identifiable information.

“13. A computer usable program product comprising one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices, the stored program instructions comprising: program instructions to identify a document with a known author; program instructions to extract a first set of features of the document using natural language processing; program instructions to extract a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; program instructions to classify the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and program instructions to label personally identifiable information in the document based upon the classified extracted features.

“14. The computer usable program product of claim 13, wherein the document is an unstructured document.

“15. The computer usable program product of claim 13, wherein the first set of features includes text-based features.

“16. The computer usable program product of claim 13, wherein the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures.

“17. The computer usable program product of claim 13, wherein the second set of features includes user-specific features.

“18. The computer usable program product of claim 13, wherein the stored program instructions are stored in a computer readable storage device in a data processing system, and wherein the stored program instructions are transferred over a network from a remote data processing system.

“19. The computer usable program product of claim 13, wherein the stored program instructions are stored in a computer readable storage device in a server data processing system, and wherein the stored program instructions are downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system.

“20. A computer system comprising one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the stored program instructions comprising: program instructions to identify a document with a known author; program instructions to extract a first set of features of the document using natural language processing; program instructions to extract a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; program instructions to classify the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and program instructions to label personally identifiable information in the document based upon the classified extracted features.”

URL and more information on this patent application, see: Ahmed, Mohamed N.; Toor, Andeep S. Machine-Learning Based Detection And Classification Of Personally Identifiable Information. Filed September 7, 2018 and posted March 12, 2020. Patent URL: http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220200081978%22.PGNR.&OS=DN/20200081978&RS=DN/20200081978

(Our reports deliver fact-based news of research and discoveries from around the world.)

Older

Census count continues in spite of COVID-19

Advisor News

More Advisor News

Annuity News

More Annuity News

Health/Employee Benefits News

More Health/Employee Benefits News

Life Insurance News

More Life Insurance News

“Machine-Learning Based Detection And Classification Of Personally Identifiable Information” in Patent Application Approval Process (USPTO 20200081978)

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account