Researchers Submit Patent Application, “System and Method for Generating Synthetic Test Data”, for Approval (USPTO 20230251959): Patent Application
2023 AUG 25 (NewsRx) -- By a
No assignee for this patent application has been made.
News editors obtained the following quote from the background information supplied by the inventors: “With advancement of technology, the use of software applications for processing data associated with healthcare has increased rapidly. Generally, software applications used in the field of healthcare are configured to at least record, maintain and process data associated with members/patients, service providers, third party payer services, and the like, for activities including, but not limited to, enrollment of members for a service or a healthcare insurance plan, billing, insurance claim assessment and processing, and transactions including payments etc. The third party payer services referred to herein may include, but are not limited to, health care insurance companies or health care payers. Further, a service provider may include, but is not limited to, any person, such as a doctor, pharmacist, etc., or an institution, such as a hospital, clinic, or medical equipment provider.
“The healthcare specific software applications require testing during various stages of development and post-development similar to any other data processing software application to determine if different features and/or configurations within the application are performing as expected. For example, a third party payer service, such as a healthcare payer may deploy a software application including features, such as database management for enrollment of members. As the Healthcare Payers unique business needs constantly change, the software application prior to deployment requires large volumes of test cases with many variations of test data and unique scenarios to ensure that there is no gap in the test-case coverage and the software works as expected. As a result, the required test data should include member data records with variables and volume to confidently validate the features of the software application to avoid rework or re-adjudication by quickly identifying defects in the features. However, unlike any other data processing application, the test data required for testing many of the healthcare specific applications should be realistic, voluminous and free from protected health information (PHI). As a result, the production data from a source cannot be used directly by the testing teams for application testing due to confidentiality concerns. Moreover, the task of importing large volumes of data can be cumbersome and often requires high processing as wells as interfacing speeds.
“In order to overcome the above problems, the test data required for testing healthcare applications that require data records associated with one or more entities is often generated manually or through complex manipulation of production data from existing sources. However, manual creation and assembling of large volumes of data is time consuming. Further, the process of creating large volumes of data manually often leads to modifications in the data, resulting in creation of unusable data. Additionally, if the data from an existing source is manipulated to create new data, the new data is often not masked resulting in a security risk.
“In light of the above drawbacks, there is a need for a system and a method that can generate synthetic test data in real-time for testing data processing applications. In particular, there is a need for a system and a method that can generate synthetic test data comprising a plurality of data records associated with an entity, such as a member/patient for testing healthcare data processing applications. Further, there is a need for a system and a method that can generate healthcare test data comprising large volumes of realistic synthetic data records encompassing various attributes of the entity without using confidential data, such as Protected Health Information (PHI). Furthermore, there is a need for a system and a method that can generate optimized combinations between various attribute values with positive and negative test data combinations to complete test cases and improve test-case coverage. Yet further, there is a need for a system and a method that can generate on-demand, versatile, scalable, and secure synthetic data agnostic to healthcare applications. Yet further, there is a need for a system and a method that significantly reduces the time required for generation of test data. Yet further, there is a need for a system and a method, which is economical, secure and relatively accurate.”
As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventors’ summary information for this patent application: “In accordance with various embodiments of the present invention, a method for generating synthetic test data is provided. The method is implemented by a processor executing program instructions stored in a memory. The method comprises generating a data structure, where the data structure is populated with one or more predefined segments based on a selected operating field, each of the one or more predefined segments further comprising one or more customizable sub-segments. The method further comprises, evaluating most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments. Further, the method comprises generating synthetic test data comprising a plurality of data records based on the generated data structure and the evaluated combinations. Each of the plurality of data records are populated with the one or more customizable sub-segments of the data structure, where the one or more customizable sub-segments are arranged within each of the plurality of data records and populated with data values based on one or more parameters associated with said one or more customizable sub-segments.
“In accordance with various embodiments of the present invention, a system for generating synthetic test data is provided. The system comprises a memory storing program instructions, a processor configured to execute program instructions stored in the memory, and a synthetic data generation engine executed by the processor. The system is configured to generate a data structure, where the data structure is populated with one or more predefined segments based on a selected operating field, each of the one or more predefined segments further comprise one or more customizable sub-segments. Further, the system is configured to evaluate most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments. Yet further, the system is configured to generate synthetic test data comprising a plurality of data records based on the generated data structure and the evaluated combinations. Each of the plurality of data records are populated with the one or more customizable sub-segments of the data structure, where the one or more customizable sub-segments are arranged within each of the plurality of data records and populated with data values based on one or more parameters associated with said one or more customizable sub-segments.
“In accordance with various embodiments of the present invention, a computer program product is provided. The computer program product comprises a non-transitory computer-readable medium having computer-readable program code stored thereon, the computer-readable program code comprising instructions that, when executed by a processor, cause the processor to generate a data structure, where the data structure is populated with one or more predefined segments based on a selected operating field, each of the one or more predefined segments further comprise one or more customizable sub-segments. Further, most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments are evaluated. Yet further synthetic test data comprising a plurality of data records is generated based on the generated data structure and the evaluated combinations. Each of the plurality of data records are populated with the one or more customizable sub-segments of the data structure, where the one or more customizable sub-segments are arranged within each of the plurality of data records and populated with data values based on one or more parameters associated with said one or more customizable sub-segments.”
The claims supplied by the inventors are:
“1. A method for generating synthetic test data, wherein the method is implemented by a processor executing program instructions stored in a memory, the method comprising: generating, by the processor, a data structure, wherein the data structure is populated with one or more predefined segments based on a selected operating field, each of the one or more predefined segments further comprising one or more customizable sub-segments; evaluating, by the processor, most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments; and generating, by the processor, synthetic test data comprising a plurality of data records based on the generated data structure and the evaluated combinations, each of the plurality of data records populated with the one or more customizable sub-segments of the data structure, wherein the one or more customizable sub-segments are arranged within each of the plurality of data records and populated with data values based on one or more parameters associated with said one or more customizable sub-segments.
“2. The method as claimed in claim 1, wherein the selected operating field is healthcare insurance, further wherein the plurality of data records are associated with respective members of a healthcare payer, and a file type of the plurality of data records is 834.
“3. The method as claimed in claim 1, wherein the selected operating field is healthcare insurance, and the one or more customizable sub-segments of the data structure are selected based on a practice area associated with healthcare insurance.
“4. The method as claimed in claim 3, wherein the practice area is selected from Medicaid, Medicare, Dental, Vision,
“5. The method as claimed in claim 1, wherein the data structure is a tree data structure.
“6. The method as claimed in claim 1, wherein the one or more predefined segments are representative of data fields required in each of the plurality of data records for the selected operating field, and the one or more customizable sub-segments are representative of attributes of corresponding predefined segment.
“7. The method as claimed in claim 1, wherein a predetermined repetition value representative of number of times a predefined segment and associated customizable sub-segments are required in each of the plurality of data records is defined for each of the predefined segments as per user requirement.
“8. The method as claimed in claim 1, wherein the generated data structure populated with the one or more predefined segments and the one or more customizable sub-segments is graphically represented for modification, deletion and addition of new segments and sub-segments as per user requirement.
“9. The method as claimed in claim 1, wherein the one or more predefined segments are populated based on the selected operating field using a mapping table comprising the one or more predefined segments mapped with respective operating fields; or the one or more predefined segments are populated based on the selected operating field using machine learning or data analytics information associated with pre-generated data records of the selected operating field.
“10. The method as claimed in claim 2, wherein the one or more predefined segments are selected from a group comprising member details specific to a healthcare payer, demographic details of the member, the coverage details of the member or a combination thereof.
“11. The method as claimed in claim 10, wherein the member detail segment comprises sub-segments specific to a payer, including subscriber identifier, policy number, supplemental identifier, and member level dates; the demographic segment comprises sub-segments including member name, Date of Birth (DOB), Gender, address, language, responsible person details, employer, school, and custodial parent; and the coverage detail segment comprises sub-segment attributing health insurance arrangement, including Health coverage, plan, disability information, cost, provider information, Coordination of Benefits (COB) and details of the care for a member.
“12. The method as claimed in claim 1, wherein evaluating most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments comprises: populating data values for selected one or more customizable sub-segments of the one or more predefined segments based on the one or more parameters associated with said one or more customizable sub-segments; deriving a Cartesian product of the data values of the selected one or more customizable sub-segments; and evaluating most probable and optimized combinations between the data values of the one or more sub-segments from the derived Cartesian product using orthogonal array testing techniques.
“13. The method as claimed in claim 1, wherein the one or more parameters define characteristics of respective sub-segments for representation on a data record, further wherein the one or more parameters are selected from predefined parameters or user-customizable parameters or a combination thereof, further wherein the values of the one or more predefined parameters and the user-customizable parameters is populated based on user inputs.
“14. The method as claimed in claim 13, wherein the predefined parameters comprise Position of Sub-Segment (POS), sub-segment ID, name of the sub-segment, usage, repetition, max repetition, syntax, ID, and loop, wherein the parameter Position of Sub-Segment (POS) is representative of position of a sub-segment within a selected file format of the data record, the parameter sub-segment ID is indicative of an identifier of a sub-segment, the parameter usage defines if a sub-segment is required or situational, the parameter repetition defines the number of times a sub-segment may be repeated, the parameter max repetition is indicative of maximum number of repetitions allowed for a sub-segment, the parameter syntax defines the representation format of a sub-segment, and the parameter loop defines the number of times a segment and its associated sub-segments may be repeated.
“15. The method as claimed in claim 13, wherein the user-customizable parameters comprise datatype of a sub-segment, usage indicating if a sub-segment is mandatory or optional, minimum size, maximum size, duplicability, identical-ability, specific format pattern, influenced, and preset data value, further wherein, the datatype comprises a list of auto-generated datatype fields and the realistic datatype fields, the user-customizable parameter minimum size defines the minimum number of characters for data value of a sub-segment, the parameter maximum size defines the maximum number of characters for data value of a sub-segment, the parameter duplicability defines if data value associated with a sub-segment may be duplicated for other data records, the parameter identical-ability enables a sub-segment to mimic another selected sub-segment, the parameter specific format pattern defines a specific pattern for a sub-segment structure, the parameter influenced enables linking of a sub-segment with other sub-segments and the parameter preset data value enables configuring of a set of values for a sub-segment.
“16. The method as claimed in claim 2, wherein the synthetic test data is generated in real-time, further wherein the plurality of data records are customized based on record specific variables including date, gender, address, and age.
“17. The method as claimed in claim 2, wherein one or more dependent data records representative of dependents of respective members of the healthcare payer are defined by linking one or more member data records with other member data records using a unique family Link ID, said family Link ID generated based on a selection of family relationship code or an extended family relationship code or a combination thereof, wherein the family relationship code is selected from spouse and children, and the extended family relationship code is selected from a group comprising father, mother, grandfather, grandmother, aunty, and uncle.
“18. A system for generating synthetic test data, the system comprising: a memory storing program instructions; a processor configured to execute program instructions stored in the memory; and a synthetic data generation engine executed by the processor, and configured to: generate a data structure, wherein the data structure is populated with one or more predefined segments based on a selected operating field, each of the one or more predefined segments further comprise one or more customizable sub-segments; evaluate most probable and optimized combinations between the one or more customizable sub-segments of the one or more predefined segments; and generate synthetic test data comprising a plurality of data records based on the generated data structure and the evaluated combinations, each of the plurality of data records populated with the one or more customizable sub-segments of the data structure, wherein the one or more customizable sub-segments are arranged within each of the plurality of data records and populated with data values based on one or more parameters associated with said one or more customizable sub-segments.
“19. The system as claimed in claim 18, wherein the selected operating field is healthcare insurance, and the plurality of data records are associated with respective members of a healthcare payer, and a file type of the plurality of data records is 834.
“20. The system as claimed in claim 19, wherein the one or more customizable sub-segments of the data structure are selected based on a practice area associated with the operating field, further wherein the practice area is selected from Medicaid, Medicare, Dental, Vision,
There are additional claims. Please visit full patent to read further.
For additional information on this patent application, see: Dussault,
(Our reports deliver fact-based news of research and discoveries from around the world.)
Reports from University of South Carolina Highlight Recent Research in Opioid Crisis (Changes in Medicaid Fee-for-Service Benefit Design for Substance Use Disorder Treatment During the Opioid Crisis, 2014 to 2021): Opioids – Opioid Crisis
New Risk Management Study Findings Recently Were Reported by a Researcher at Middle Tennessee State University (Distributed Least-Squares Monte Carlo for American Option Pricing): Insurance – Risk Management
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News