Patent Issued for Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof (USPTO 11836584): Swiss Reinsurance Company Ltd.
2023 DEC 27 (NewsRx) -- By a
The patent’s inventor is Mueller, Felix (Waedenswil, CH).
This patent was filed on
From the background information supplied by the inventors, news correspondents obtained the following quote: “The vast majority of projects and/or proprietary data nowadays are based on structured data. However, it is estimated that eighty percent of an organization’s data is unstructured, or only semi-structured. A significant portion of such unstructured and semi-structured data is in the form of documents. The industry and/or organizations try applying analytics tools for handling their structured data in efforts of easing access, data processing and data management. However, this does not mean that such unstructured and semi-structured documents no longer exist. They have been, and will continue to be, an important aspect of an organization’s data inventory. Further, semi-structured and unstructured documents are often voluminous. Such documents can consist of hundreds or even thousands of individual papers. For example, a risk-transfer underwriting document, a risk-transfer claim document or a purchaser’s mortgage document can be stored as a single 500-page or even larger document comprising individual papers, such as, for the latter case, the purchaser’s income tax returns, credit reports, appraisers reports, and so forth, bundled into a single mortgage document. Each purchaser is associated with a different mortgage document. Thus, the size and volume of documents can be very large. Documents may be stored across various storage systems and/or devices and may be accessed by multiple departments and individuals. Documents may include different types of information and may comprise various formats. They may be used in many applications, as e.g. mortgages and lending, healthcare, environmental, and the like; moreover, they are draw their information from multiple sources, like social networks, server logs, and information from banking transactions, web content, GPS trails, financial or stock market data, etc.
“More than data accumulation within organizational structures, the recent years have further been characterized by a tremendous growth in natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents and social media data, such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful data processing tools and engines to help individuals manage and analyze vast amounts of structured, semi-structured, and unstructured data as pure text data, effectively and efficiently. Unlike data generated by a computer system, sensors or measuring devices, these text data are often generated by humans for humans without intermediary instance. In particular, such text data that are generated by humans for humans accumulate into an important and valuable data source for exploring human opinions and preferences or for analyzing or triggering other human-driven factors, in addition to many other types of knowledge that people encode in text form. Also, since these text data are written for consumption by humans, humans play a critical role in any prior art text data application system; a text management and analysis system must therefore typically involve the human element in the text analysis loop.
“According to the prior art, existing tools and engines supporting text management and analysis can be divided into two categories. The first category includes search engines and search engine toolkits, which are especially suitable for building a search engine application but tend to have limited support for text analysis/mining functions. Examples include Lucene, Terrier, and Indri/Lemur. The second category is text mining or general data mining and machine learning toolkits, which tend to selectively support some text analysis functions but generally do not support a search capability. Nonetheless, combining and seamlessly integrating search engine capabilities with sophisticated text analysis capabilities is necessary for two reasons: While the raw data may be large for any particular problem, as discussed above, it is often a relatively small subset of the relevant data. Search engines are an essential tool for quickly discovering a small subset of relevant text data in a large text collection. On the other hand, however, we need search engines that will help analysts interpret any patterns that are discovered within the data by, allowing them to examine any relevant original text data in order to make sense of any discovered pattern. A possible solution should therefore emphasize a tightly controlled integration of search capabilities (or text access capabilities in general) with text analysis functions, thus facilitating a fully supported function for building a powerful text analysis engine and tool.
“Further, in the prior art, there already exist different classifiers, as e.g. MeTA (cf. meta-toolkit.org), Mallet (cf. mallet.cs.umass.edu) or Torch (cf. torch.ch), whereas the latter is, technically speaking, an example for a deep learning system, not a classifier as such, which, however, can be used to build an appropriate classifier. Typically, these classifiers allow for the application of one or more standard classifiers, either alone or in combination, as for example MeTA with Naive Bayes, SVM, and/or kNN (using BM25 as a ranking structure), Mallet with Naive Bayes, Max Entropy, Decision Tree, and/or Winnow trainer, or Torch using deep learning approaches (
“Prior art systems and engines, able to analyze large data sets in various ways and implementing big data approaches, normally depend heavily on the expertise of the engineer, who has considered the data set and its expected structure. The larger the number of features of a data set, sometimes called fields or attributes of a record, the greater the number of possibilities for analyzing combinations of features and feature values. Accordingly, there is a demand for modalities that facilitate automatically analyzing large data sets quickly and effectively. Examples of uses for the disclosed inventive classification system are, inter alia, the analysis of complex life or non-life insurance or risk-transfer submission documents, means for identifying fraudulent registrations, for identifying the purchasing likelihood of contacts, and identifying feature dependencies to enhance an existing classification application. The disclosed classification technology consistently and significantly outperforms the known prior art systems, such as MeTA, Mallet, or Torch and their implemented classification structures, such as Naive Bayes, SVM, and/or kNN, Max Entropy, Decision Tree, Winnow trainer, and/or deep learning approaches.”
There is additional summary information. Please visit full patent to read further.
Supplementing the background information on this patent, NewsRx reporters also obtained the inventor’s summary information for this patent: “It is one object of the present invention to provide an automated or at least semi-automated labeling and classification engine and system without the above-discussed drawbacks. The automated or semi-automated engine and system should be able to generate test data without significant human interference. The system shall have the technical capability to apply a particular decision based on which a particular form can be easily and automatically added to the whole training set. On the other hand, the system should be able to automatically refer similar pages to a selected page or form, if it is not clear how a page or data element should be classified. The system should also provide structures for automatically detecting inconsistently or wrongly labeled data, and it should provide appropriate warnings. Instead of having to label thousands of training pages, which are mostly identical and thus not adding much value, the system should be able to detect and flag gaps in the training data so that these gaps can be closed easily. Once all training and testing data have been labeled, the system should also provide automated features for detecting and flagging where inconsistent labels have been assigned for the same forms. Finally, the system should allow for an automated rating, specifically regarding the good or bad quality of an OCR text quality or classification. This allows for preventing that bad data are added to the training set. In summary, it is an object of the present invention to provide a new labeling, classification and metalearning system, wherein the number of cases that is classified correctly may be used to arrive at an estimated accuracy of the operation of the system. The aim is to provide a highly accurate, and thereby automated or semi-automated system, and engine that is easy to implement and achieves a novel level of efficiency when dealing with large and multiple data set.
“According to the present invention, these objects are achieved, particularly, with the features of the independent claims. In addition, further advantageous embodiments can be derived from the dependent claims and the related descriptions.”
There is additional summary information. Please visit full patent to read further.
The claims supplied by the inventors are:
“1. A semi- or fully automated, integrated learning and labeling and classification learning system with closed, self-sustaining pattern recognition, labeling and classification operation, comprising: circuitry configured to implement a machine learning classifier; select unclassified data sets and convert the unclassified data sets into an assembly of graphic and text data forming compound data sets to be classified, wherein, by generated feature vectors of training data sets, the machine learning classifier is trained for improving the classification operation of the automated labeling and classification system generically during training as a measure of classification performance, if the automated labeling and classification system is applied to unlabeled and unclassified data sets, and wherein unclassified data sets are classified by applying the machine learning classifier of the automated labeling and classification system to the compound data set of the unclassified data sets; generate training data sets, wherein for each data set of selected test data sets, a feature vector is generated comprising a plurality of labeled features associated with different selected test data sets; generate a two-dimensional confusion matrix based on the feature vector of the selected test data sets, wherein a first dimension of the two-dimensional confusion matrix comprises pre-processed labeled features of the feature vectors of the selected test data sets and a second dimension of the two-dimensional confusion matrix comprises classified and verified features of the feature vectors of the selected test data sets by applying the machine learning classifier to the selected test data sets; and in case an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assign the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets, and generate additional training data sets based on the confusion matrix, which are added to the training data sets for filling in gaps in the training data sets and improving the measurable performance of the automated labeling and classification system, wherein the circuitry is configured to ignore a given page of a data set if the given page comprises non-relevant text compared to average pages, and the label of a previous page is assigned during inference.
“2. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to apply distribution scaling to the data sets scaling word counts so that pages with a small number of words are not underrepresented.
“3. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to boost a probability of words that are unique for a certain class as compared to other words that occur relatively frequently in other classes.
“4. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to filter out data sets with spikes as representing unlikely scenarios.
“5. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to accept a selection of defined features to be ignored by the machine learning classifier.
“6. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, and to pre-process the unclassified data sets by optical character recognition converting images of typed, handwritten or printed text into machine-encoded text.
“7. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, to pre-process and store the graphic data as raster graphics images in tagged image file format, and to store the text data in plain text format or rich text format.
“8. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that each feature vector comprises a plurality of invariant features associated with a specific data set or an area of interest of a data set.
“9. The automated learning, labeling and classification system according to claim 8, wherein the circuitry is configured such that the invariant features of the graphic data of the compound data set of the specific data set comprise scale invariant, rotation invariant, and position invariant features.
“10. The automated learning, labeling and classification system according to claim 8, wherein the circuitry is configured such that the area of interest comprises a representation of at least a portion of a subject object within image or graphic data of the compound data set of the specific data set, the representation comprising at least one of an object axis, an object base point, or an object tip point, and wherein the invariant features comprise at least one of a normalized object length, a normalized object width, a normalized distance from an object base point to a center of a portion of the image or graphic data, an object or portion radius, a number of detected distinguishable parts of the portion or the object, a number of detected features pointing in the same direction, a number of features pointing in the opposite direction of a specified feature, or a number of detected features perpendicular to a specified feature.
“11. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the pre-processed labeled features of the feature vectors of the selected test data sets comprise manually labeled pre-processed features of the feature vectors of the selected test data sets as a verified gold standard.
“12. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured, in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, to assign the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets if comparable training data sets are triggered within the training data sets based on the confusion matrix, and to create a new labeling feature of recognizable feature vector if no comparable training data sets are triggered within the training data sets.
“13. A fully or partially automated, integrated learning, labeling and classification learning method for closed, self-sustaining pattern recognition, labeling and classification operation, comprising: implementing, by circuitry of an automated labeling and classification system, a machine learning classifier; selecting, by the circuitry, unclassified data sets and converting the unclassified data sets into an assembly of graphic and text data forming a compound data set to be classified, wherein, by feature vectors of training data sets, the machine learning classifier is trained for generically improving the classification operation of the automated labeling and classification system during training as a measure of classification performance if the automated labeling and classification system is applied to unclassified data sets, and to classifiying unclassified data sets by applying the machine learning classifier of the automated labeling and classification system to the compound data set; generating, by the circuitry, training data sets, wherein for each data set of selected test data sets, a feature vector is generated comprising a plurality of labeled features associated with different selected test data sets; generating, by the circuitry, a two-dimensional confusion matrix based on the feature vector of the selected test data sets, wherein a first dimension of the two-dimensional confusion matrix comprises pre-processed labeled features of the feature vectors of the selected test data sets, and a second dimension of the two-dimensional confusion matrix comprises classified and verified features of the feature vectors of the selected test data sets by applying the machine learning classifier to the selected test data sets; and in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assigning, by the circuitry, the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets, and generating additional training data sets, based on the confusion matrix by the automated labeling and classification system, which are added to the training data sets, thereby filling in gaps in the training data sets and improving the measurable performance of the automated labeling and classification system, wherein the method further comprises ignoring a given page of a data set if the given page comprises non-relevant text compared to average pages, and the label of a previous page is assigned during inference.
“14. The fully or partially automated, integrated learning, labeling and classification method for closed, self-sustaining pattern recognition, labeling and classification operation according to claim 13, further comprising, in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assigning, based on the confusion matrix, the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets if comparable training data sets are triggered within the training data sets, and creating a new labeling feature of recognizable feature vector if no comparable training data sets are triggered within the training data sets.”
There are additional claims. Please visit full patent to read further.
For the URL and additional information on this patent, see: Mueller, Felix. Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof.
(Our reports deliver fact-based news of research and discoveries from around the world.)
Patent Issued for Automatic update of user information (USPTO 11836241): Allstate Insurance Company
Patent Issued for Systems and methods of assisting a user in discovering medical services (USPTO 11837369): IX Innovation LLC
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News