“Extraction Of Information And Smart Annotation Of Relevant Information Within Complex Documents” in Patent Application Approval Process (USPTO 20190251182)
2019 SEP 02 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “Present invention embodiments relate to extracting data from documents, and more specifically, to identifying the presence or absence of a key term or concept, which may be a unique key term or concept, in a complex document, extracting information related to the key term or concept, and/or comparing the extracted information.
“With the advent of sophisticated document generation and processing programs, it has become routine to generate complex documents comprising tables, graphs, and unstructured text. Such documents may be hundreds or even thousands of pages in length. Often, such documents are modified or merged with other documents, at least on an annual basis, making it difficult to compare information between various versions of documents.
“Additionally, in certain industries, such as the health insurance industry, documents may describe personalized or customized plan information specific to an organization. While each plan may vary from company to company, common concepts (e.g., deductible, co-pay, formulary, etc.) will be present in each document. In some cases, the same style of document, e.g., based on a common or similar template, may be provided, wherein each document is customized to the needs of a particular group. However, it is difficult to parse through such large, complex documents to find a key item or concept or an answer to a question about the content of the complex document.
“For example, Health Insurance Benefit Coverage Summary Plan Descriptions (SPDs), which describe medical, dental, vision and other health benefit coverage, are often more than several hundred pages long. SPDs are structured to comply with regulatory requirements laid down by government and other statutory bodies. Because these documents are required to comply with federal law regulations, SPDs contain many similarities, but are tailored to individual health plans for the respective organization for which a health care plan is provided, which is the underlying cause for the significant variations in the content.
“Such documents are difficult to understand by most people, and participants in the health plan often have difficulty locating information that is key to their question within such large, complex document(s), particularly when participants are in need of care for themselves or someone in their family. Additionally, it is difficult to identify specific changes in coverage from year to year as the same terminology may not be consistently applied.
“Further, it is difficult to apply business analytics (e.g., to benefit health care plan design and benchmarking) to unstructured text data with complex terminology, and it is difficult for employees to understand the terminology to determine which aspects of medical care are covered, and which are not.
“In particular, unstructured data provides particular challenges with regard to: usability, volume, variability, and quality. Regarding content format variability, documents may have information represented in multiple formats (e.g., unstructured text, diagrams, tables, figures, charts, etc.) and consumption of this type of data for decision making is challenging.
“Regarding volume, unstructured data is growing at a rate of approximately 62% per year, further complicating collection and extraction of data. Regarding variability, such documents often have a wide range of styles, formats, and codes with similar intents. Regarding quality, such documents frequently originate from different sources and have a high level of ambiguity in natural language. Accordingly, managing such documents is difficult, time consuming, and complex. Health benefits is one such type of complex document, other complex documents include but are not limited to insurance documents (e.g., home or auto), policy documents (e.g., employment, government), legal documents, etc.”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “According to embodiments of the present invention, methods and systems are provided for extraction of information from complex documents comprising unstructured data to create a structured data repository. Such techniques may include using Natural Language Processing (NLP) in combination with machine learning and cognitive systems to identify relevant data.
“According to embodiments of the invention, information may be extracted from a plurality of complex documents, and the extracted information may be mapped to a semantic representation. Information may be extracted from text or non-text elements in the plurality of complex documents. Extracted information may include text extraction, symbol extraction, numerical extraction, and so forth, with such information extracted from any suitable location in the complex document, including text, tables, lists, charts, graphs, etc. Natural language processing and machine learning may be utilized to extract one or more entities from the semantic representation. Structured data comprising the extracted entities is generated from the semantic representation and corresponding attributes. In response to receiving a user query, a subset of the structured data corresponding to the query is returned, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for each complex document. Information, unless otherwise indicated, generally refers to unstructured text, which may include but is not limited to symbols, numbers, and alphanumeric characters, etc. associated with free text as well as other types of formatted data including but not limited to tables, charts, graphs, lists, etc.
“It is to be understood that the Summary is not intended to identify key or essential features of embodiments of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure. Other features of the present disclosure will become easily comprehensible through the description below.”
The claims supplied by the inventors are:
“1. A method, in a cognitive data processing system comprising at least one processor and at least one memory, the at least one memory comprising instructions executed by the at least one processor to cause the at least one processor to implement an intelligent annotation and data extraction system for extracting and processing information across a plurality of complex documents to provide an answer to a user query comprising: extracting information from the plurality of complex documents; mapping the extracted information to a semantic representation; utilizing a natural language process and a machine learning process to extract one or more entities from the semantic representation; generating structured data comprising the extracted entities from the semantic representation along with corresponding attributes; and providing a subset of the structured data corresponding to the query, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for one or more complex document.
“2. The method of claim 1, wherein the extracted entities are compared to determine differences with regard to context and meaning as presented in the respective complex document.
“3. The method of claim 1, further comprising: extracting, for a given type of information, the information presented as a text element in at least one complex document and as a non-text element in at least one complex document.
“4. The method of claim 1, wherein the semantic representation includes the extracted information and a context indicating where in the complex document the extracted information is located.
“5. The method of claim 1, wherein the extracted information includes information extracted from both unformatted portions and formatted portions of the plurality of complex documents.
“6. The method of claim 5, wherein the formatted portions include a table, a list, a picture, a spreadsheet, a graph, or a chart.
“7. The method of claim 5, wherein the unformatted portions include text, numbers, or symbols.
“8. The method of claim 1, wherein the complex documents are summary plan descriptions.
“9. The method of claim 1, wherein an annotated data set is provided to the machine learning process and the machine learning process generates a machine learning model to extract entities based on the provided annotated data set.
“10. The method of claim 1, wherein an annotated data set is provided to the machine leaning process to generate and train a machine learning model, and wherein the trained machine learning model is utilized to automatically annotate a received unannotated complex document.
“11. The method of claim 1, comprising: receiving a query requesting a type of information common to each of the plurality of complex documents; and returning the requested information in a readable format allowing side-by-side comparison of the extracted information for each of the plurality of complex documents.
“12. A cognitive data processing system for intelligent annotation and data extraction across a plurality of complex documents comprising: at least one processor configured to: extract information from the plurality of complex documents; map the extracted information to a semantic representation; utilize a natural language process and a machine learning process to extract one or more entities from the semantic representation; generate structured data comprising the extracted entities from the semantic representation along with corresponding attributes; and provide a subset of the structured data corresponding to the query, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for one or more complex document.
“13. The system of claim 12, wherein the extracted entities are compared to determine differences with regard to context and meaning as presented in the respective complex document.
“14. The system of claim 12, wherein the processor is further configured to extract a type of information presented as a text element in at least one complex document and as a non-text element in at least one other complex document.
“15. The system of claim 12, wherein the semantic representation includes the extracted information and a context indicating where in the complex document the extracted information is located.
“16. The system of claim 12, wherein the extracted information includes information extracted from both unformatted portions and formatted portions of the plurality of complex documents, wherein the formatted portions include a table, a list, a picture, a spreadsheet, a graph, or a chart, and the unformatted portions include text, numbers, or symbols.
“17. The system of claim 12, wherein an annotated data set is provided to the machine learning process and the machine learning process generates a machine learning model to extract entities based on the provided annotated data set.
“18. The system of claim 12, wherein an annotated data set is provided to the machine leaning process to generate and train a machine learning model, and wherein the trained machine learning model is utilized to automatically annotate a received unannotated complex document.
“19. The system of claim 12, wherein the processor is further configured to: receive a query requesting a type of information common to each of the plurality of complex documents; and return the requested information in a readable format allowing side-by-side comparison of the extracted information for each of the plurality of complex documents.
“20. A computer program product for intelligent annotation and data extraction across a plurality of complex documents, the computer program product comprising a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by at least one processor to cause the at least one processor to: extract information from the plurality of complex documents; map the extracted information to a semantic representation; utilize a natural language process and a machine learning process to extract one or more entities from the semantic representation; generate structured data comprising the extracted entities from the semantic representation along with corresponding attributes; and provide a subset of the structured data corresponding to the query, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for one or more complex document.”
URL and more information on this patent application, see: Ray, Ritwik; Angelopoulos, Marie; Roberts, Frederick; Gagen, Christopher; Gabrani, Maria. Extraction Of Information And Smart Annotation Of Relevant Information Within Complex Documents. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)


Australian Department of Agriculture and Water Resources: $20m Basin Research Program Increases Security
Hurricane Dorian pounds Bahamas; no change in track or forecast for South Florida
Advisor News
- Are the holidays a good time to have a long-term care conversation?
- Gen X unsure whether they can catch up with retirement saving
- Bill that could expand access to annuities headed to the House
- Private equity, crypto and the risks retirees can’t ignore
- Will Trump accounts lead to a financial boon? Experts differ on impact
More Advisor NewsAnnuity News
- Hildene Capital Management Announces Purchase Agreement to Acquire Annuity Provider SILAC
- Removing barriers to annuity adoption in 2026
- An Application for the Trademark “EMPOWER INVESTMENTS” Has Been Filed by Great-West Life & Annuity Insurance Company: Great-West Life & Annuity Insurance Company
- Bill that could expand access to annuities headed to the House
- LTC annuities and minimizing opportunity cost
More Annuity NewsHealth/Employee Benefits News
Life Insurance News
- On the Move: Dec. 4, 2025
- Judge approves PHL Variable plan; could reduce benefits by up to $4.1B
- Seritage Growth Properties Makes $20 Million Loan Prepayment
- AM Best Revises Outlooks to Negative for Kansas City Life Insurance Company; Downgrades Credit Ratings of Grange Life Insurance Company; Revises Issuer Credit Rating Outlook to Negative for Old American Insurance Company
- AM Best Affirms Credit Ratings of Bao Minh Insurance Corporation
More Life Insurance News