Patent Issued for Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof (USPTO 11475072): Swiss Reinsurance Company Ltd.
2022 NOV 03 (NewsRx) -- By a
The patent’s inventor is Mueller, Felix (Waedenswil, CH).
This patent was filed on
From the background information supplied by the inventors, news correspondents obtained the following quote: “The vast majority of projects and/or proprietary data nowadays are based on structured data. However, it is estimated that eighty percent of an organization’s data is unstructured, or only semi-structured. A significant portion of such unstructured and semi-structured data is in the form of documents. The industry and/or organizations try applying analytics tools for handling their structured data in efforts of easing access, data processing and data management. However, this does not mean that such unstructured and semi-structured documents no longer exist. They have been, and will continue to be, an important aspect of an organization’s data inventory. Further, semi-structured and unstructured documents are often voluminous. Such documents can consist of hundreds or even thousands of individual papers. For example, a risk-transfer underwriting document, a risk-transfer claim document or a purchaser’s mortgage document can be stored as a single 500-page or even larger document comprising individual papers, such as, for the latter case, the purchaser’s income tax returns, credit reports, appraiser’s reports, and so forth, bundled into a single mortgage document. Each purchaser is associated with a different mortgage document. Thus, the size and volume of documents can be very large. Documents may be stored across various storage systems and/or devices and may be accessed by multiple departments and individuals. Documents may include different types of information and may comprise various formats. They may be used in many applications, as e.g. mortgages and lending, healthcare, environmental, and the like; moreover, they are draw their information from multiple sources, like social networks, server logs, and information from banking transactions, web content, GPS trails, financial or stock market data, etc.
“More than data accumulation within organizational structures, the recent years have further been characterized by a tremendous growth in natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents and social media data, such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful data processing tools and engines to help individuals manage and analyze vast amounts of structured, semi-structured, and unstructured data as pure text data, effectively and efficiently. Unlike data generated by a computer system, sensors or measuring devices, these text data are often generated by humans for humans without intermediary instance. In particular, such text data that are generated by humans for humans accumulate into an important and valuable data source for exploring human opinions and preferences or for analyzing or triggering other human-driven factors, in addition to many other types of knowledge that people encode in text form. Also, since these text data are written for consumption by humans, humans play a critical role in any prior art text data application system; a text management and analysis system must therefore typically involve the human element in the text analysis loop.
“According to the prior art, existing tools and engines supporting text management and analysis can be divided into two categories. The first category includes search engines and search engine toolkits, which are especially suitable for building a search engine application but tend to have limited support for text analysis/mining functions. Examples include Lucene, Terrier, and Indri/Lemur. The second category is text mining or general data mining and machine learning toolkits, which tend to selectively support some text analysis functions but generally do not support a search capability. Nonetheless, combining and seamlessly integrating search engine capabilities with sophisticated text analysis capabilities is necessary for two reasons: While the raw data may be large for any particular problem, as discussed above, it is often a relatively small subset of the relevant data. Search engines are an essential tool for quickly discovering a small subset of relevant text data in a large text collection. On the other hand, however, we need search engines that will help analysts interpret any patterns that are discovered within the data by, allowing them to examine any relevant original text data in order to make sense of any discovered pattern. A possible solution should therefore emphasize a tightly controlled integration of search capabilities (or text access capabilities in general) with text analysis functions, thus facilitating a fully supported function for building a powerful text analysis engine and tool.
“Further, in the prior art, there already exist different classifiers, as e.g. MeTA (cf. meta-toolkit.org), Mallet (cf. mallet.cs.umass.edu) or Torch (cf. torch.ch), whereas the latter is, technically speaking, an example for a deep learning system, not a classifier as such, which, however, can be used to build an appropriate classifier. Typically, these classifiers allow for the application of one or more standard classifiers, either alone or in combination, as for example MeTA with Naive Bayes, SVM, and/or kNN (using BM25 as a ranking structure), Mallet with Naive Bayes, Max Entropy, Decision Tree, and/or Winnow trainer, or Torch using deep learning approaches (
“Prior art systems and engines, able to analyze large data sets in various ways and implementing big data approaches, normally depend heavily on the expertise of the engineer, who has considered the data set and its expected structure. The larger the number of features of a data set, sometimes called fields or attributes of a record, the greater the number of possibilities for analyzing combinations of features and feature values. Accordingly, there is a demand for modalities that facilitate automatically analyzing large data sets quickly and effectively. Examples of uses for the disclosed inventive classification system are, inter alia, the analysis of complex life or non-life insurance or risk-transfer submission documents, means for identifying fraudulent registrations, for identifying the purchasing likelihood of contacts, and identifying feature dependencies to enhance an existing classification application. The disclosed classification technology consistently and significantly outperforms the known prior art systems, such as MeTA, Mallet, or Torch and their implemented classification structures, such as Naive Bayes, SVM, and/or kNN, Max Entropy, Decision Tree, Winnow trainer, and/or deep learning approaches.”
There is additional summary information. Please visit full patent to read further.
Supplementing the background information on this patent, NewsRx reporters also obtained the inventor’s summary information for this patent: “It is one object of the present invention to provide an automated or at least semi-automated labeling and classification engine and system without the above-discussed drawbacks. The automated or semi-automated engine and system should be able to generate test data without significant human interference. The system shall have the technical capability to apply a particular decision based on which a particular form can be easily and automatically added to the whole training set. On the other hand, the system should be able to automatically refer similar pages to a selected page or form, if it is not clear how a page or data element should be classified. The system should also provide structures for automatically detecting inconsistently or wrongly labeled data, and it should provide appropriate warnings. Instead of having to label thousands of training pages, which are mostly identical and thus not adding much value, the system should be able to detect and flag gaps in the training data so that these gaps can be closed easily. Once all training and testing data have been labeled, the system should also provide automated features for detecting and flagging where inconsistent labels have been assigned for the same forms. Finally, the system should allow for an automated rating, specifically regarding the good or bad quality of an OCR text quality or classification. This allows for preventing that bad data are added to the training set. In summary, it is an object of the present invention to provide a new labeling, classification and meta learning system, wherein the number of cases that is classified correctly may be used to arrive at an estimated accuracy of the operation of the system. The aim is to provide a highly accurate, and thereby automated or semi-automated system, and engine that is easy to implement and achieves a novel level of efficiency when dealing with large and multiple data set.
“According to the present invention, these objects are achieved, particularly, with the features of the independent claims. In addition, further advantageous embodiments can be derived from the dependent claims and the related descriptions.”
There is additional summary information. Please visit full patent to read further.
The claims supplied by the inventors are:
“1. An automated, integrated learning and labeling and classification learning system with closed, self-sustaining pattern recognition, labeling and classification operation, comprising: circuitry configured to implement a machine learning classifier, the machine learning classifier comprising a non-probabilistic, binary, linear, support vector machines classifier and/or a non-parametric k-Nearest Neighbors classifier, and/or an exponential, probabilistic, max entropy classifier, and/or decision tree classifier based on a finite set of values, and/or Balanced Winnow classifier, and/or deep learning classifiers using multiple processing layers composed of multiple linear and non-linear transformations; select unclassified data sets and convert the unclassified data sets into an assembly of graphic and text data forming compound data sets to be classified, wherein, by generated feature vectors of training data sets, the machine learning classifier is trained for improving the classification operation of the automated system during training with respect to a measure of the classification performance, in case of applying the automated system to unlabeled and unclassified data sets, and wherein unclassified data sets are classified by applying the machine learning classifier of the automated system to the compound data set of the unclassified data sets, the machine learning classifier comprising at least a population of separate rule sets, such that a learning operation recombines and reproduces a best of the rule sets, or the machine learning classifier comprising a single set of rules in a defined population, such that the learning operation selects best classifiers within the single set of rules; generate training data sets, wherein for each data set of selected test data sets, a feature vector is generated comprising a plurality of labeled features associated with the different selected test data sets; generate a two-dimensional confusion matrix based on the feature vector of the test data sets, wherein a first dimension of the two-dimensional confusion matrix comprises pre-processed labeled features of the feature vectors of the test data sets and a second dimension of the two-dimensional confusion matrix comprises classified and verified features of the feature vectors of the test data sets by applying the machine learning classifier to the test data sets; in case an inconsistently or wrongly classified test data set and/or feature of a test data set is detected, assign the inconsistently or wrongly classified test data set and/or feature of the test data set to the training data sets, and generate additional training data sets based on the confusion matrix, which are added to the training data sets for filling in gaps in the training data sets and improving the measurable performance of the automated system, wherein additional training data sets are generated to fill the gaps in the training data sets in response to comparable training data sets being triggered within the training data sets, and wherein a new labeling feature of a recognizable feature vector is created in response to no comparable training data sets being triggered within the training data sets; and trigger data sets triggered for spikes in the data sets, wherein data sets with spikes are filtered out representing unlikely data sets by providing a correcting action of spike features for the machine learning classifier by filtering out the unlikely data sets to improve overall performance of the automated system by selecting the spike features to be ignored by the machine learning classifier.
“2. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the machine learning classifier comprises at least a scalable Naive Bayes classifier based on a linear number of parameters in the number of features and predictors, respectively.
“3. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the machine learning classifier comprises a non-probabilistic, binary, linear, support vector machines classifier and a non-parametric k-Nearest Neighbors classifier, and an exponential, probabilistic, max entropy classifier, and decision tree classifier based on a finite set of values, and Balanced Winnow classifier, and deep learning classifiers using multiple processing layers composed of multiple linear and non-linear transformations.
“4. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the machine learning classifier applies unigrams and bigrams, and/or a combination of unigrams and bigrams or n-grams to the machine learning classifier.
“5. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to apply distribution scaling to the data sets scaling word counts so that pages with a small number of words are not underrepresented.
“6. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to boost a probability of words that are unique for a certain class as compared to other words that occur relatively frequently in other classes.
“7. The automated learning and labeling and classification system according to claim 1, wherein the circuitry is configured to ignore a given page of a data set if the given page comprises only little or non-relevant text compared to average pages, and the label of the previous page is assigned during inference.
“8. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to accept a selection of defined features to be ignored by the machine learning classifier.
“9. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to have a predefined threshold value for a performance strength-based and/or accuracy-based classification of the operation performance.
“10. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, and to pre-process the unclassified data sets by optical character recognition converting images of typed, handwritten or printed text into machine-encoded text.
“11. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured to convert the selected unclassified data sets to an assembly of graphic and text data forming a compound data set to be classified, to pre-process and store the graphic data as raster graphics images in tagged image file format, and to store the text data in plain text format or rich text format.
“12. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that each feature vector comprises a plurality of invariant features associated with a specific data set or an area of interest of a data set.
“13. The automated learning, labeling and classification system according to claim 12, wherein the circuitry is configured such that the invariant features of the graphic data of the compound data set of the specific data set comprise scale invariant, rotation invariant, and position invariant features.
“14. The automated learning, labeling and classification system according to claim 12, wherein the circuitry is configured such that the area of interest comprises a representation of at least a portion of a subject object within the image or graphic data of the compound data set of the specific data set, the representation comprising at least one of an object axis, an object base point, or an object tip point, and wherein the invariant features comprise at least one of a normalized object length, a normalized object width, a normalized distance from an object base point to a center of a portion of the image or graphic data, an object or portion radius, a number of detected distinguishable parts of the portion or the object, a number of detected features pointing in the same direction, a number of features pointing in the opposite direction of a specified feature, or a number of detected features perpendicular to a specified feature.
“15. The automated learning, labeling and classification system according to claim 1, wherein the circuitry is configured such that the pre-processed labeled features of the feature vectors of the test data sets comprise manually labeled pre-processed features of the feature vectors of the test data sets as a verified gold standard.
“16. The automated learning, labeling and classification system according to claim 1, wherein the spike features selected to be ignored by the machine learning classifier correspond to non-relevant textual information that is not relevant compared to other textual information.”
There are additional claims. Please visit full patent to read further.
For the URL and additional information on this patent, see: Mueller, Felix. Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof.
(Our reports deliver fact-based news of research and discoveries from around the world.)
Researcher at University of Houston College of Pharmacy Targets Pharmacy Practice (Patterns of Lipid Lowering Therapy Use Among Older Adults in a Managed Care Advantage Plan in the United States): Drugs and Therapies – Pharmacy Practice
New Risk Management Study Findings Have Been Reported from Szechenyi Istvan University (Inhomogeneous Financial Markets in a Low Interest Rate Environment-A Cluster Analysis of Eurozone Economies): Insurance – Risk Management
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News