Newswires

November 18, 2019 Newswires

Patent Issued for Systems And Methods For Integration And Analysis Of Data Records (USPTO 10,467,201)

Insurance Daily News

2019 NOV 18 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- Massachusetts Mutual Life Insurance Company (Springfield, Massachusetts, United States) has been issued patent number 10,467,201, according to news reporting originating out of Alexandria, Virginia, by NewsRx editors.

The patent’s inventors are Merritt, Sears (Groton, MA); Neale, Thom (Bedford, MA).

This patent was filed on December 15, 2016 and was published online on November 18, 2019.

From the background information supplied by the inventors, news correspondents obtained the following quote: “As the processing power of computers allow for greater computer functionality and the Internet technology era allows for interconnectivity between computing systems, more records of data are generated, stored, maintained, and queried every day. As a result, the size and number of datasets and databases available continues to grow and expand exponentially. The datasets and records within these databases may be generated in a variety of ways, from a variety of related or unrelated sources. Furthermore, the datasets and records may be generated at different times, and stored in different formats and locations. As a result, problems occur when users try to query large seemingly unrelated datasets because the relationships between the data records stored in the unrelated databases and records may not be obvious since the records may be stored in different formats or related to different entities and subject matters that do not share common identifiers.

“Conventionally, querying different datasets has been accomplished using a ‘brute force’ method of analyzing all datasets and databases. Existing and conventional methods fail to provide fast and efficient analysis due to a high volume of data existing on different networks and computing infrastructures. Managing and organizing such data on different platforms is difficult due to number, size, content, or relationships of the data within a database. Furthermore, existing and conventional methods consume a large amount of computing power, which is not ideal.”

Supplementing the background information on this patent, NewsRx reporters also obtained the inventors’ summary information for this patent: “For the aforementioned reasons, there is a need for a more efficient and faster system and method for processing large databases, which would determine relationships between different data points and datasets in a more efficient manner than possible with human-intervention or conventional computer data-driven analysis. There is a need for methods and systems to determine relationships between large seemingly unrelated datasets such that the end result is a new dataset that enables the originally unconnected datasets to be queried as though they contains foreign key references that all point back to a single unified set of entities. These features allow performing large work such as time-consuming analysis and/or querying different datasets in a more efficient manner by using less computing power than other approaches. The methods and systems disclosed may be implemented using a modular set of sub-systems and instructions that each perform a step of the computation required to scale up the linking of billions of pairs of records. The sub-systems may include preprocessing, blocking, classification and graphical analysis. Each of these sub-systems may comprise one or more sub-steps to complete the sub-system. For example, preprocessing may include converting the data sets to the same type of character encoding and applying a schema normalization step.

“The methods and systems of the present disclosure may improve performance and speed of current databases systems by determining probabilistic relationships between sets of apparently disparate data that might exist in multiple databases and datasets. In this manner, the methods and systems of the present disclosure may improve the performance of current database management systems because it enables the searching of multiple apparently unrelated databases using a single series of queries. By determining the probability of the relationship between various data in various datasets allows for improved, speed, and performance to the relationship established utilizing the present disclosure.

“In one embodiment, during the preprocessing step, a server execute instructions to convert every input record into the format required by a final database tables produced by the method as a whole. To accomplish this, the records must undergo a schema normalization step in which semantically equivalent fields are identified so that a blocking function can uniformly access attributes from each record without regard to idiosyncrasies of column naming in each dataset. After schema normalization, blocking serves to reduce the total number of pairwise comparisons of records in all input datasets, to one requiring a much smaller number of pairwise comparisons. Once the records have been grouped into blocks, the records within each block may be compared on a pairwise basis to determine which are related. The systems and methods described herein then analyzes a classified block pairs as a graph structure in order to cluster the linked records into groups representing distinct entities. The connected components of the graph structure are then computed, and each is assigned a unique ID. The connected component IDs may be then joined with the original input datasets and function as a unique entity ID that all input records will reference. The entities can then be used to join records across datasets based on the underlying linked entity table created during the linkage process.

“In another embodiment, a system comprises a first and second database configured to store a plurality of data points, and a server. The server is configured to convert the plurality of data points stored in the first database associated with a first format and the second database associated with a second format, to a common format in a single dataset wherein the plurality of data points are arranged in a plurality of fields. The server is further configured to, upon converting the plurality of data points to the common format, assign a unique id to each field of the plurality of data points. The server is further configured to normalize the plurality of data points arranged in the plurality of fields by applying a master schema, wherein the master schema identifies each semantically equivalent field. The server is further configured to group, based on a blocking key, to produce a hashable value for the converted plurality of data points into one or more groups upon the converted plurality of data points satisfying a first pre-defined criteria, wherein the blocking key utilizes a by-key aggression technique. The server is further configured to match the one or more groups with each other based on a relationship corresponding to the hashable value associated with each group satisfying a second pre-defined criteria. The server is further configured to generate a graph comprising the classification data comprising a set of vertices, wherein each vertex is represented as a unique long integer and a set of edges in an undirected graph, wherein an edge associated with each vertix corresponds to each matched group. The server is further configured to store the relationships between the grouped data with the unique id in a third database.

“In another embodiment, a computer-implemented method comprises converting, by a server, a plurality of data points stored in a first database associated with a first format and a second database associated with a second format, to a common format in a single dataset wherein the plurality of data points are arranged in a plurality of fields. The computer-implemented method further comprises upon converting the plurality of data points to the common format, assigning, by the server, a unique id to each field of the plurality of data points. The computer-implemented method further comprises normalizing, by the server, the plurality of data points arranged in the plurality of fields by applying a master schema, wherein the master schema identifies each semantically equivalent field. The computer-implemented method further comprises grouping, by the server, based on a blocking key, to produce a hashable value for the converted plurality of data points into one or more groups upon the converted plurality of data points satisfying a first pre-defined criteria, wherein the blocking key utilizes a by-key aggression technique. The computer-implemented method further comprises matching, by the server, the one or more groups with each other based on a relationship corresponding to the hashable value associated with each group satisfying a second pre-defined criteria. The computer-implemented method further comprises generating, by the server, a graph comprising the classification data comprising a set of vertices, wherein each vertex is represented as a unique long integer and a set of edges in an undirected graph, wherein an edge associated with each vertix corresponds to each matched group. The computer-implemented method further comprises storing, by the server, the relationships between the grouped data with the unique id in a third database.

“Numerous other aspects, features and benefits of the present disclosure may be made apparent from the following detailed description taken together with the drawing figures.”

The claims supplied by the inventors are:

“What is claimed is:

“1. A system comprising: a first and second database configured to store a plurality of data points; and a server configured to: convert the plurality of data points stored in the first database associated with a first format and the second database associated with a second format, to a common format in a single dataset wherein the plurality of data points are arranged in a plurality of fields; upon converting the plurality of data points to the common format, assign a unique id to each field of the plurality of data points; normalize the plurality of data points arranged in the plurality of fields by applying a master schema, wherein the master schema identifies each semantically equivalent field; group, based on a blocking key, to produce a hashable value for the plurality of data points into one or more groups upon the plurality of data points satisfying a first pre-defined criteria; select a subset of groups from the one or more groups having a group size that is less than a pre-defined group size; classify the plurality of data points by matching each pair of data points within each group from the subset of groups with each other based on a relationship corresponding to the hashable value associated with each data point satisfying a second pre-defined criteria; and generate a graph comprising classification data comprising a set of vertices, wherein each vertex is represented as a unique long integer and a set of edges in an undirected graph, and wherein an edge associated with each vertex corresponds to each matched pair of the data points, and wherein the graph presents each cluster of data points that are determined to be related to each other based on matched pairs of the data points as a single entity having a distinct identification number.

“2. The system of claim 1, wherein the common format is a single text encoding format.

“3. The system of claim 1, wherein the plurality of fields in the first and second databases are mapped to a common set of fields selected from a predetermined set of fields.

“4. The system of claim 1, wherein the matching is based on a characteristic of the data points.

“5. The system of claim 1, wherein the matching is based on a probability of a relationship between the data points.

“6. The system of claim 1, wherein a set of matching instructions causes the server to generate a pair of the data points from the plurality of data points in each group.

“7. The system of claim 6, wherein the set of matching instructions determines a probabilistic relationship between pair of the data points.

“8. The system of claim 1, wherein the second pre-defined criteria is a probability of a relationship between the data points in each group.

“9. The system of claim 1, wherein the normalization of the plurality of fields is mapped to a common set of fields selected from a library of fields stored in a database.

“10. A computer-implemented method comprising: converting, by a server, a plurality of data points stored in a first database associated with a first format and a second database associated with a second format, to a common format in a single dataset wherein the plurality of data points are arranged in a plurality of fields; upon converting the plurality of data points to the common format, assigning, by the server, a unique id to each field of the plurality of data points; normalizing, by the server, the plurality of data points arranged in the plurality of fields by applying a master schema, wherein the master schema identifies each semantically equivalent field; grouping, by the server, based on a blocking key, to produce a hashable value for the plurality of data points into one or more groups upon the plurality of data points satisfying a first pre-defined criteria; selecting, by the server, a subset of groups from the one or more groups having a group size that is less than a pre-defined group size; classifying, by the server, the plurality of data points by matching each data point within each group from the subset of groups with each other based on a relationship corresponding to the hashable value associated with each data point satisfying a second pre-defined criteria; and generating, by the server, a graph comprising classification data comprising a set of vertices, wherein each vertex is represented as a unique long integer and a set of edges in an undirected graph, wherein an edge associated with each vertex corresponds to each matched pair of the data points, and wherein the graph presents each cluster of data points that are determined to be related to each other based on matched pairs of the data points as a single entity having a distinct identification number.

“11. The computer-implemented method of claim 10, wherein the common format is a single text encoding format.

“12. The computer-implemented method of claim 10, wherein the plurality of fields in the first and second databases are mapped to a common set of fields selected from a predetermined set of fields.

“13. The computer-implemented method of claim 10, wherein the matching is based on a characteristic of the data points.

“14. The computer-implemented method of claim 10, wherein the matching is based on a probability of a relationship between the data points.

“15. The computer-implemented method of claim 10, wherein a set of matching instructions causes the server to generate a pair of the data points from the plurality of data points in each group.

“16. The computer-implemented method of claim 15, wherein the set of matching instructions determines a probabilistic relationship between pair of the data points.

“17. The computer-implemented method of claim 10, wherein the second pre-defined criteria is a probability of a relationship between the data points in each group.

“18. The computer-implemented method of claim 10, wherein the normalization of the plurality of fields is mapped to a common set of fields selected from a library of fields stored in a database.”

For the URL and additional information on this patent, see: Merritt, Sears; Neale, Thom. Systems And Methods For Integration And Analysis Of Data Records. U.S. Patent Number 10,467,201, filed December 15, 2016, and published online on November 18, 2019. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=10,467,201.PN.&OS=PN/10,467,201RS=PN/10,467,201

(Our reports deliver fact-based news of research and discoveries from around the world.)

Older

Mountain-Pacific secured $15.5 million in federal funding for operations

Newer

Technical Mapping Advisory Council

Advisor News

More Advisor News

Annuity News

More Annuity News

Health/Employee Benefits News

More Health/Employee Benefits News

Life Insurance News

More Life Insurance News

Patent Issued for Systems And Methods For Integration And Analysis Of Data Records (USPTO 10,467,201)

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account