“Machine-Learning Based Detection And Classification Of Personally Identifiable Information” in Patent Application Approval Process (USPTO 20200081978)
2020 MAR 31 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “Personally identifiable information (PII) is information that can be using on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context. Corporations and agencies are often under an obligation to protect content containing PII to prevent exposure of the PII to unauthorized parties. Because of the significant reputational and financial consequences of failing to protect content containing PII, corporations and governmental agencies have made it a major goal to identify and protect such content. Privacy expectations arise from a number of laws in different jurisdictions such as the Health Insurance Portability and Accountability Act (HIPPA) and Payment Card Industry (PCI) data security standards. One of the most challenging aspects related to identifying and protecting PII is how to deal with ‘unstructured’ content. Unstructured content refers to information that does not have a pre-defined data model or is not organized in a pre-defined manner. Examples of unstructured content may include, for example, documents or files on file shares, personal computing devices, and content management systems. These documents and files may be generated within or outside of an organization using many applications, can be converted to multiple file formats (e.g., Portable Document Format (PDF), and seemingly have unlimited form and content. By contrast, structured data such as data stored in in databases and support systems have often have defined fields in tables that have defined relationships with each other. For example, to protect social security numbers in a database, access to the field for social security numbers is controlled. With unstructured documents, the detection of PII is more challenging.”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “The illustrative embodiments provide a method, system, and computer program product. An embodiment of a method for detection and classification of personally identifiable information includes identifying a document with a known author, and extracting a first set of features of the document using natural language processing. The embodiment further includes extracting a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network, and classifying the first set of features and the second set of features using a classifier to produce classified extracted features. The embodiment further includes labelling personally identifiable information in the document based upon the classified extracted features.
“In another embodiment, the document is an unstructured document. In another embodiment, the first set of features includes text-based features. In another embodiment, the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures. In another embodiment, the second set of features includes user-specific features.
“In another embodiment, the classifier includes a deep neural network classifier. In another embodiment, the classifier includes a maximum entropy classifier.
“Another embodiment further includes training the classifier based upon the classified extracted features. Another embodiment further includes receiving feedback associated with the classified extracted features, and modifying the training of the classifier based upon the feedback. In another embodiment, the feedback is received from a subject matter expert. In another embodiment, extracting the second set of features is based upon a user-specific model.
“In another embodiment, the user-specific model is trained based upon past results and user provided documents including labelled personally identifiable information.
“An embodiment includes a computer usable program product. The computer usable program product includes one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices.
“An embodiment includes a computer system. The computer system includes one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories.”
The claims supplied by the inventors are:
“1. A method for detection and classification of personally identifiable information, the method comprising: identifying a document with a known author; extracting a first set of features of the document using natural language processing; extracting a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; classifying the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and labelling personally identifiable information in the document based upon the classified extracted features.
“2. The method of claim 1, wherein the document is an unstructured document.
“3. The method of claim 1, wherein the first set of features includes text-based features.
“4. The method of claim 1, wherein the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures.
“5. The method of claim 1, wherein the second set of features includes user-specific features.
“6. The method of claim 1, wherein the classifier includes a deep neural network classifier.
“7. (canceled)
“8. The method of claim 1, further comprising training the classifier based upon the classified extracted features.
“9. The method of claim 8, further comprising: receiving feedback associated with the classified extracted features; and modifying the training of the classifier based upon the feedback.
“10. The method of claim 9, wherein the feedback is received from a user.
“11. The method of claim 1, wherein extracting the second set of features is based upon a user-specific model.
“12. The method of claim 11, wherein the user-specific model is trained based upon past results and user provided documents including labelled personally identifiable information.
“13. A computer usable program product comprising one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices, the stored program instructions comprising: program instructions to identify a document with a known author; program instructions to extract a first set of features of the document using natural language processing; program instructions to extract a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; program instructions to classify the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and program instructions to label personally identifiable information in the document based upon the classified extracted features.
“14. The computer usable program product of claim 13, wherein the document is an unstructured document.
“15. The computer usable program product of claim 13, wherein the first set of features includes text-based features.
“16. The computer usable program product of claim 13, wherein the natural language processing includes one or more of n-grams, word embedding, part of speech, and dictionary-based natural language processing procedures.
“17. The computer usable program product of claim 13, wherein the second set of features includes user-specific features.
“18. The computer usable program product of claim 13, wherein the stored program instructions are stored in a computer readable storage device in a data processing system, and wherein the stored program instructions are transferred over a network from a remote data processing system.
“19. The computer usable program product of claim 13, wherein the stored program instructions are stored in a computer readable storage device in a server data processing system, and wherein the stored program instructions are downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system.
“20. A computer system comprising one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the stored program instructions comprising: program instructions to identify a document with a known author; program instructions to extract a first set of features of the document using natural language processing; program instructions to extract a second set of features of the document based upon one or more past documents for the known author using a recurrent neural network; program instructions to classify the first set of features and the second set of features using a classifier to produce classified extracted features, wherein the classifier employs a maximum entropy model to estimate a probability of a certain class occurring, and wherein the maximum entropy model uses a number of items extracted from a first word and from other words within a window of the first word; and program instructions to label personally identifiable information in the document based upon the classified extracted features.”
URL and more information on this patent application, see: Ahmed, Mohamed N.; Toor, Andeep S. Machine-Learning Based Detection And Classification Of Personally Identifiable Information. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)



Census count continues in spite of COVID-19
Advisor News
- Millennials are inheriting billions and they want to know what to do with it
- What Trump Accounts reveal about time and long-term wealth
- Wellmark still worries over lowered projections of Iowa tax hike
- Wellmark still worries over lowered projections of Iowa tax hike
- Could tech be the key to closing the retirement saving gap?
More Advisor NewsAnnuity News
- How to elevate annuity discussions during tax season
- Life Insurance and Annuity Providers Score High Marks from Financial Pros, but Lag on User Friendliness, JD Power Finds
- An Application for the Trademark “TACTICAL WEIGHTING” Has Been Filed by Great-West Life & Annuity Insurance Company: Great-West Life & Annuity Insurance Company
- Annexus and Americo Announce Strategic Partnership with Launch of Americo Benchmark Flex Fixed Indexed Annuity Suite
- Rethinking whether annuities are too late for older retirees
More Annuity NewsHealth/Employee Benefits News
- Trump's Medicaid work mandate could kick thousands of homeless Californians off coverage
- Confidence is the new workplace currency
- Governor signs education package on reading, math, teacher benefits
- Findings from Belmont University College of Pharmacy Provide New Insights into Managed Care and Specialty Pharmacy (Comparing rates of primary medication nonadherence and turnaround time among patients at a health system specialty pharmacy …): Drugs and Therapies – Managed Care and Specialty Pharmacy
- Study Data from Ohio State University Update Knowledge of Managed Care (Preventive Care Utilization, Employer-sponsored Benefits, and Influences On Utilization By Healthcare Occupational Groups): Managed Care
More Health/Employee Benefits NewsLife Insurance News
- Kansas City Life: Q4 Earnings Snapshot
- Gulf Guaranty Life Insurance Company Trademark Application for “OPTIBEN” Filed: Gulf Guaranty Life Insurance Company
- Marv Feldman, life insurance icon and 2011 JNR Award winner, passes away at 80
- Continental General Partners with Reframe Financial to Bring the Next Evolution of Reframe LifeStage to Market
- ASK THE LAWYER: Your beneficiary designations are probably wrong
More Life Insurance News