Patent Issued for Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof (USPTO 11836584): Swiss Reinsurance Company Ltd.

2023 DEC 27 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- Swiss Reinsurance Company Ltd. (Zurich, Switzerland) has been issued patent number 11836584, according to news reporting originating out of Alexandria, Virginia, by NewsRx editors.

The patent’s inventor is Mueller, Felix (Waedenswil, CH).

This patent was filed on September 22, 2020 and was published online on December 5, 2023.

From the background information supplied by the inventors, news correspondents obtained the following quote: “The vast majority of projects and/or proprietary data nowadays are based on structured data. However, it is estimated that eighty percent of an organization’s data is unstructured, or only semi-structured. A significant portion of such unstructured and semi-structured data is in the form of documents. The industry and/or organizations try applying analytics tools for handling their structured data in efforts of easing access, data processing and data management. However, this does not mean that such unstructured and semi-structured documents no longer exist. They have been, and will continue to be, an important aspect of an organization’s data inventory. Further, semi-structured and unstructured documents are often voluminous. Such documents can consist of hundreds or even thousands of individual papers. For example, a risk-transfer underwriting document, a risk-transfer claim document or a purchaser’s mortgage document can be stored as a single 500-page or even larger document comprising individual papers, such as, for the latter case, the purchaser’s income tax returns, credit reports, appraisers reports, and so forth, bundled into a single mortgage document. Each purchaser is associated with a different mortgage document. Thus, the size and volume of documents can be very large. Documents may be stored across various storage systems and/or devices and may be accessed by multiple departments and individuals. Documents may include different types of information and may comprise various formats. They may be used in many applications, as e.g. mortgages and lending, healthcare, environmental, and the like; moreover, they are draw their information from multiple sources, like social networks, server logs, and information from banking transactions, web content, GPS trails, financial or stock market data, etc.

“More than data accumulation within organizational structures, the recent years have further been characterized by a tremendous growth in natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents and social media data, such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful data processing tools and engines to help individuals manage and analyze vast amounts of structured, semi-structured, and unstructured data as pure text data, effectively and efficiently. Unlike data generated by a computer system, sensors or measuring devices, these text data are often generated by humans for humans without intermediary instance. In particular, such text data that are generated by humans for humans accumulate into an important and valuable data source for exploring human opinions and preferences or for analyzing or triggering other human-driven factors, in addition to many other types of knowledge that people encode in text form. Also, since these text data are written for consumption by humans, humans play a critical role in any prior art text data application system; a text management and analysis system must therefore typically involve the human element in the text analysis loop.

“According to the prior art, existing tools and engines supporting text management and analysis can be divided into two categories. The first category includes search engines and search engine toolkits, which are especially suitable for building a search engine application but tend to have limited support for text analysis/mining functions. Examples include Lucene, Terrier, and Indri/Lemur. The second category is text mining or general data mining and machine learning toolkits, which tend to selectively support some text analysis functions but generally do not support a search capability. Nonetheless, combining and seamlessly integrating search engine capabilities with sophisticated text analysis capabilities is necessary for two reasons: While the raw data may be large for any particular problem, as discussed above, it is often a relatively small subset of the relevant data. Search engines are an essential tool for quickly discovering a small subset of relevant text data in a large text collection. On the other hand, however, we need search engines that will help analysts interpret any patterns that are discovered within the data by, allowing them to examine any relevant original text data in order to make sense of any discovered pattern. A possible solution should therefore emphasize a tightly controlled integration of search capabilities (or text access capabilities in general) with text analysis functions, thus facilitating a fully supported function for building a powerful text analysis engine and tool.

“Further, in the prior art, there already exist different classifiers, as e.g. MeTA (cf. meta-toolkit.org), Mallet (cf. mallet.cs.umass.edu) or Torch (cf. torch.ch), whereas the latter is, technically speaking, an example for a deep learning system, not a classifier as such, which, however, can be used to build an appropriate classifier. Typically, these classifiers allow for the application of one or more standard classifiers, either alone or in combination, as for example MeTA with Naive Bayes, SVM, and/or kNN (using BM25 as a ranking structure), Mallet with Naive Bayes, Max Entropy, Decision Tree, and/or Winnow trainer, or Torch using deep learning approaches (Conv Net for image and text processing). Under the standard classifiers, Naive Bayes Classifier (NBC) is one of the most widely used standard classifiers in the field of the machine-learning systems. Often, NBC is the first classifying structure that tried in the context of a new field. However, one of the inherent weakness of its technical assumption, that features are independent. Bayesian networks bypass this assumption by encoding feature dependence into the structure of the network. This normally works well in classification applications with substantial dependence among certain sets of features, and such dependence is either known or learnable from a sufficiently large training set. In the latter case, one may use one of the various structure-learning algorithms; however, they are are not always operable, because the structure-learning problem is very complex. Undirected graphical models are another option for bypassing this assumption. Like Bayesian networks, they also require domain knowledge or a structure-learning procedure. Domain knowledge can be manually coded into the structure; but this only works if domain knowledge is available and, even then, manual coding is usually very laborious. Alternatively, structure-learning algorithms require extensive training data and present a technically and computationally complex problem; in addition, there is the concern of over-fitting a model due to the use of a limited data set, thereby creating a model that predicts structures that are less likely to work well on data not seen during training. Undirected graphical models typically also operate poorly if either the domain knowledge is wrong or if the instance of the structure-learning problem is either intrinsically difficult or there is insufficient training data.

“Prior art systems and engines, able to analyze large data sets in various ways and implementing big data approaches, normally depend heavily on the expertise of the engineer, who has considered the data set and its expected structure. The larger the number of features of a data set, sometimes called fields or attributes of a record, the greater the number of possibilities for analyzing combinations of features and feature values. Accordingly, there is a demand for modalities that facilitate automatically analyzing large data sets quickly and effectively. Examples of uses for the disclosed inventive classification system are, inter alia, the analysis of complex life or non-life insurance or risk-transfer submission documents, means for identifying fraudulent registrations, for identifying the purchasing likelihood of contacts, and identifying feature dependencies to enhance an existing classification application. The disclosed classification technology consistently and significantly outperforms the known prior art systems, such as MeTA, Mallet, or Torch and their implemented classification structures, such as Naive Bayes, SVM, and/or kNN, Max Entropy, Decision Tree, Winnow trainer, and/or deep learning approaches.”

There is additional summary information. Please visit full patent to read further.

Supplementing the background information on this patent, NewsRx reporters also obtained the inventor’s summary information for this patent: “It is one object of the present invention to provide an automated or at least semi-automated labeling and classification engine and system without the above-discussed drawbacks. The automated or semi-automated engine and system should be able to generate test data without significant human interference. The system shall have the technical capability to apply a particular decision based on which a particular form can be easily and automatically added to the whole training set. On the other hand, the system should be able to automatically refer similar pages to a selected page or form, if it is not clear how a page or data element should be classified. The system should also provide structures for automatically detecting inconsistently or wrongly labeled data, and it should provide appropriate warnings. Instead of having to label thousands of training pages, which are mostly identical and thus not adding much value, the system should be able to detect and flag gaps in the training data so that these gaps can be closed easily. Once all training and testing data have been labeled, the system should also provide automated features for detecting and flagging where inconsistent labels have been assigned for the same forms. Finally, the system should allow for an automated rating, specifically regarding the good or bad quality of an OCR text quality or classification. This allows for preventing that bad data are added to the training set. In summary, it is an object of the present invention to provide a new labeling, classification and metalearning system, wherein the number of cases that is classified correctly may be used to arrive at an estimated accuracy of the operation of the system. The aim is to provide a highly accurate, and thereby automated or semi-automated system, and engine that is easy to implement and achieves a novel level of efficiency when dealing with large and multiple data set.

“According to the present invention, these objects are achieved, particularly, with the features of the independent claims. In addition, further advantageous embodiments can be derived from the dependent claims and the related descriptions.”

There is additional summary information. Please visit full patent to read further.

The claims supplied by the inventors are:

“1. A semi- or fully automated, integrated learning and labeling and classification learning system with closed, self-sustaining pattern recognition, labeling and classification operation, comprising: circuitry configured to implement a machine learning classifier; select unclassified data sets and convert the unclassified data sets into an assembly of graphic and text data forming compound data sets to be classified, wherein, by generated feature vectors of training data sets, the machine learning classifier is trained for improving the classification operation of the automated labeling and classification system generically during training as a measure of classification performance, if the automated labeling and classification system is applied to unlabeled and unclassified data sets, and wherein unclassified data sets are classified by applying the machine learning classifier of the automated labeling and classification system to the compound data set of the unclassified data sets; generate training data sets, wherein for each data set of selected test data sets, a feature vector is generated comprising a plurality of labeled features associated with different selected test data sets; generate a two-dimensional confusion matrix based on the feature vector of the selected test data sets, wherein a first dimension of the two-dimensional confusion matrix comprises pre-processed labeled features of the feature vectors of the selected test data sets and a second dimension of the two-dimensional confusion matrix comprises classified and verified features of the feature vectors of the selected test data sets by applying the machine learning classifier to the selected test data sets; and in case an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assign the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets, and generate additional training data sets based on the confusion matrix, which are added to the training data sets for filling in gaps in the training data sets and improving the measurable performance of the automated labeling and classification system, wherein the circuitry is configured to ignore a given page of a data set if the given page comprises non-relevant text compared to average pages, and the label of a previous page is assigned during inference.

“2. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to apply distribution scaling to the data sets scaling word counts so that pages with a small number of words are not underrepresented.

“3. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to boost a probability of words that are unique for a certain class as compared to other words that occur relatively frequently in other classes.

“4. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to filter out data sets with spikes as representing unlikely scenarios.

“5. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to accept a selection of defined features to be ignored by the machine learning classifier.

“6. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, and to pre-process the unclassified data sets by optical character recognition converting images of typed, handwritten or printed text into machine-encoded text.

“7. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, to pre-process and store the graphic data as raster graphics images in tagged image file format, and to store the text data in plain text format or rich text format.

“8. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that each feature vector comprises a plurality of invariant features associated with a specific data set or an area of interest of a data set.

“9. The automated learning, labeling and classification system according to claim 8, wherein the circuitry is configured such that the invariant features of the graphic data of the compound data set of the specific data set comprise scale invariant, rotation invariant, and position invariant features.

“10. The automated learning, labeling and classification system according to claim 8, wherein the circuitry is configured such that the area of interest comprises a representation of at least a portion of a subject object within image or graphic data of the compound data set of the specific data set, the representation comprising at least one of an object axis, an object base point, or an object tip point, and wherein the invariant features comprise at least one of a normalized object length, a normalized object width, a normalized distance from an object base point to a center of a portion of the image or graphic data, an object or portion radius, a number of detected distinguishable parts of the portion or the object, a number of detected features pointing in the same direction, a number of features pointing in the opposite direction of a specified feature, or a number of detected features perpendicular to a specified feature.

“11. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the pre-processed labeled features of the feature vectors of the selected test data sets comprise manually labeled pre-processed features of the feature vectors of the selected test data sets as a verified gold standard.

“12. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured, in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, to assign the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets if comparable training data sets are triggered within the training data sets based on the confusion matrix, and to create a new labeling feature of recognizable feature vector if no comparable training data sets are triggered within the training data sets.

“13. A fully or partially automated, integrated learning, labeling and classification learning method for closed, self-sustaining pattern recognition, labeling and classification operation, comprising: implementing, by circuitry of an automated labeling and classification system, a machine learning classifier; selecting, by the circuitry, unclassified data sets and converting the unclassified data sets into an assembly of graphic and text data forming a compound data set to be classified, wherein, by feature vectors of training data sets, the machine learning classifier is trained for generically improving the classification operation of the automated labeling and classification system during training as a measure of classification performance if the automated labeling and classification system is applied to unclassified data sets, and to classifiying unclassified data sets by applying the machine learning classifier of the automated labeling and classification system to the compound data set; generating, by the circuitry, training data sets, wherein for each data set of selected test data sets, a feature vector is generated comprising a plurality of labeled features associated with different selected test data sets; generating, by the circuitry, a two-dimensional confusion matrix based on the feature vector of the selected test data sets, wherein a first dimension of the two-dimensional confusion matrix comprises pre-processed labeled features of the feature vectors of the selected test data sets, and a second dimension of the two-dimensional confusion matrix comprises classified and verified features of the feature vectors of the selected test data sets by applying the machine learning classifier to the selected test data sets; and in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assigning, by the circuitry, the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets, and generating additional training data sets, based on the confusion matrix by the automated labeling and classification system, which are added to the training data sets, thereby filling in gaps in the training data sets and improving the measurable performance of the automated labeling and classification system, wherein the method further comprises ignoring a given page of a data set if the given page comprises non-relevant text compared to average pages, and the label of a previous page is assigned during inference.

“14. The fully or partially automated, integrated learning, labeling and classification method for closed, self-sustaining pattern recognition, labeling and classification operation according to claim 13, further comprising, in case that an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assigning, based on the confusion matrix, the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets if comparable training data sets are triggered within the training data sets, and creating a new labeling feature of recognizable feature vector if no comparable training data sets are triggered within the training data sets.”

There are additional claims. Please visit full patent to read further.

For the URL and additional information on this patent, see: Mueller, Felix. Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof. U.S. Patent Number 11836584, filed September 22, 2020, and published online on December 5, 2023. Patent URL (for desktop use only): https://ppubs.uspto.gov/pubwebapp/external.html?q=(11836584)&db=USPAT&type=ids

(Our reports deliver fact-based news of research and discoveries from around the world.)

Patent Issued for Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof (USPTO 11836584): Swiss Reinsurance Company Ltd.

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Patent Issued for Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof (USPTO 11836584): Swiss Reinsurance Company Ltd.

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account