2021 NOV 24 (NewsRx) -- By a
The assignee for this patent, patent number 11163837, is
Reporters obtained the following quote from the background information supplied by the inventors:
“1. Technical Field
“Present invention embodiments relate to extracting data from documents, and more specifically, to identifying the presence or absence of a key term or concept, which may be a unique key term or concept, in a complex document, extracting information related to the key term or concept, and/or comparing the extracted information.
“2. Discussion of the Related Art
“With the advent of sophisticated document generation and processing programs, it has become routine to generate complex documents comprising tables, graphs, and unstructured text. Such documents may be hundreds or even thousands of pages in length. Often, such documents are modified or merged with other documents, at least on an annual basis, making it difficult to compare information between various versions of documents.
“Additionally, in certain industries, such as the health insurance industry, documents may describe personalized or customized plan information specific to an organization. While each plan may vary from company to company, common concepts (e.g., deductible, co-pay, formulary, etc.) will be present in each document. In some cases, the same style of document, e.g., based on a common or similar template, may be provided, wherein each document is customized to the needs of a particular group. However, it is difficult to parse through such large, complex documents to find a key item or concept or an answer to a question about the content of the complex document.
“For example, Health Insurance Benefit Coverage Summary Plan Descriptions (SPDs), which describe medical, dental, vision and other health benefit coverage, are often more than several hundred pages long. SPDs are structured to comply with regulatory requirements laid down by government and other statutory bodies. Because these documents are required to comply with federal law regulations, SPDs contain many similarities, but are tailored to individual health plans for the respective organization for which a health care plan is provided, which is the underlying cause for the significant variations in the content.
“Such documents are difficult to understand by most people, and participants in the health plan often have difficulty locating information that is key to their question within such large, complex document(s), particularly when participants are in need of care for themselves or someone in their family. Additionally, it is difficult to identify specific changes in coverage from year to year as the same terminology may not be consistently applied.
“Further, it is difficult to apply business analytics (e.g., to benefit health care plan design and benchmarking) to unstructured text data with complex terminology, and it is difficult for employees to understand the terminology to determine which aspects of medical care are covered, and which are not.
“In particular, unstructured data provides particular challenges with regard to: usability, volume, variability, and quality. Regarding content format variability, documents may have information represented in multiple formats (e.g., unstructured text, diagrams, tables, figures, charts, etc.) and consumption of this type of data for decision making is challenging.
“Regarding volume, unstructured data is growing at a rate of approximately 62% per year, further complicating collection and extraction of data. Regarding variability, such documents often have a wide range of styles, formats, and codes with similar intents. Regarding quality, such documents frequently originate from different sources and have a high level of ambiguity in natural language. Accordingly, managing such documents is difficult, time consuming, and complex. Health benefits is one such type of complex document, other complex documents include but are not limited to insurance documents (e.g., home or auto), policy documents (e.g., employment, government), legal documents, etc.”
In addition to obtaining background information on this patent, NewsRx editors also obtained the inventors’ summary information for this patent: “According to embodiments of the present invention, methods and systems are provided for extraction of information from complex documents comprising unstructured data to create a structured data repository. Such techniques may include using Natural Language Processing (NLP) in combination with machine learning and cognitive systems to identify relevant data.
“According to embodiments of the invention, information may be extracted from a plurality of complex documents, and the extracted information may be mapped to a semantic representation. Information may be extracted from text or non-text elements in the plurality of complex documents. Extracted information may include text extraction, symbol extraction, numerical extraction, and so forth, with such information extracted from any suitable location in the complex document, including text, tables, lists, charts, graphs, etc. Natural language processing and machine learning may be utilized to extract one or more entities from the semantic representation. Structured data comprising the extracted entities is generated from the semantic representation and corresponding attributes. In response to receiving a user query, a subset of the structured data corresponding to the query is returned, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for each complex document. Information, unless otherwise indicated, generally refers to unstructured text, which may include but is not limited to symbols, numbers, and alphanumeric characters, etc. associated with free text as well as other types of formatted data including but not limited to tables, charts, graphs, lists, etc.
“It is to be understood that the Summary is not intended to identify key or essential features of embodiments of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure. Other features of the present disclosure will become easily comprehensible through the description below.”
The claims supplied by the inventors are:
“1. A method, in a cognitive data processing system comprising at least one processor and at least one memory, the at least one memory comprising instructions executed by the at least one processor to cause the at least one processor to implement an intelligent annotation and data extraction system for extracting and processing information across a plurality of complex documents to provide an answer to a user query comprising: extracting information from the plurality of complex documents, wherein a text structure model, trained with a first set of labeled data, extracts information from text of the plurality of complex documents, wherein a table structure model, trained with a second set of labeled data, extracts information from tables of the plurality of complex documents, wherein the first set of labeled data includes text from one or more Summary Plan Description documents, wherein each Summary Plan Description document describes specific health care benefit coverage for plan members, and wherein the second set of labeled data includes tables from one or more Summary Plan Description documents, wherein the Summary Plan Description documents of the first and second sets of labeled data include annotations of exclusions to the health care benefit coverage, and wherein the annotations of exclusions are determined according to an extracted exclusion knowledge base; mapping the extracted information to a semantic representation; utilizing a natural language process and a machine learning process to extract one or more entities from the semantic representation, wherein the machine learning process is trained with a third set of labeled data and is trained using supervised machine learning, and wherein the one or more entities include one or more of: a plan sponsor entity, a plan entity, a conditions/exclusions entity, a prior authorizations entity, an eligibility entity, a copay entity, a coinsurance entity, and an out of pocket limits entity; generating structured data comprising the extracted entities from the semantic representation along with corresponding attributes; and providing a subset of the structured data corresponding to the query, wherein the subset of structured data may be arranged to correlate entities with corresponding attributes for one or more complex documents.
“2. The method of claim 1, wherein the extracted entities are compared to determine differences with regard to context and meaning as presented in the respective complex document.
“3. The method of claim 1, further comprising: extracting, for a given type of information, the information presented as a text element in at least one complex document and as a non-text element in at least one complex document.
“4. The method of claim 1, wherein the semantic representation includes the extracted information and a context indicating where in the complex document the extracted information is located.
“5. The method of claim 1, wherein the extracted information includes information extracted from both unformatted portions and formatted portions of the plurality of complex documents.
“6. The method of claim 5, wherein the formatted portions include a table, a list, a picture, a spreadsheet, a graph, or a chart.
“7. The method of claim 5, wherein the unformatted portions include text, numbers, or symbols.
“8. The method of claim 1, wherein the complex documents are summary plan descriptions.
“9. The method of claim 1, wherein the third set includes an annotated data set that is provided to the machine learning process and the machine learning process generates a machine learning model to extract entities based on the provided annotated data set.
“10. The method of claim 1, wherein an annotated data set is provided to the machine learning process to generate and train a machine learning model, and wherein the trained machine learning model is utilized to automatically annotate a received unannotated complex document.
“11. The method of claim 1, comprising: receiving a query requesting a type of information common to each of the plurality of complex documents; and returning the requested information in a readable format allowing side-by-side comparison of the extracted information for each of the plurality of complex documents.”
For more information, see this patent: Angelopoulos, Marie. Extraction of information and smart annotation of relevant information within complex documents.
(Our reports deliver fact-based news of research and discoveries from around the world.)