Researchers Submit Patent Application, “Overcoming Data Missingness For Improving Predictions”, for Approval (USPTO 20230044574): Janssen Research & Development LLC
2023 FEB 27 (NewsRx) -- By a
The patent’s assignee is
News editors obtained the following quote from the background information supplied by the inventors: “Predictive models have long been used for generating health-based predictions. In the modern era, increasingly sophisticated machine learning models have been designed to process and gain clinical insights from large patient datasets and increase the accuracy of health-based predictions. Further, modern data collection methods have allowed insurance companies, doctors’ offices, pharmacies, and other providers to vastly increase the amount and quality of patient datasets that can be adapted for use in training and deploying health-based predictive models.
“Generally, comprehensive histories of subjects are helpful for generating accurate health-based predictions for the subjects. However, not all available patient datasets will have the comprehensive patient history that can be informative for generating health-based predictions for patients.”
As a supplement to the background information on this patent application, NewsRx correspondents also obtained the inventors’ summary information for this patent application: “There may be significant portions of data missing from certain datasets. While a closed healthcare dataset (also referred to herein as a “closed dataset” or a “closed claims dataset”) compiled by an insurance provider may comprise comprehensive patient history data, an open healthcare dataset (also referred to herein as an “open dataset” or an “open claims dataset”) compiled by a third-party entity, e.g., from at least one of a healthcare clearinghouse, doctor’s office records, pharmacy records, or other patient data, may be incomplete. For example, certain treatments or therapies may be blocked in open datasets (e.g., blocked by the therapeutic manufacturer or distributor). Thus, the missing data in these datasets can severely hamper the ability to generate predictions for subjects (e.g., predictions as to whether subjects are eligible to receive a therapy).
“Disclosed herein are methods, non-transitory computer readable medium, and computer system for training and deploying predictive models for generating health-based predictions. As an example, predictive models disclosed herein are useful for determining patient eligibility, such as patient eligibility for chimeric antigen receptor T-cell (CAR-T) therapy. Generally, predictive models are trained on datasets that leverage closed claims datasets with comprehensive patient data. The closed dataset is used to derive the ground truth label in a training example. In various embodiments, the closed dataset is linked to a corresponding open claims dataset such that the paired closed dataset and open claims dataset can be used to train a predictive model. In various embodiments, the closed dataset is modified to simulate data missingness of a target open claims dataset. These modified datasets can be generated by selectively dropping certain datapoints and/or features from closed datasets that include comprehensive patient data. Therefore, trained predictive models can be deployed to analyze open claims datasets to predict patient eligibility. In some embodiments, a superset may be generated from a combination of data from an open dataset and a closed dataset and used to derive the ground truth label in a training example or the predictive features used to train a predictive model. For example, a superset may be generated from a union of data points from an open dataset and a closed dataset.
“In a method disclosed herein, a first dataset for one or more subjects is obtained, where the first dataset is obtained from a first source of first datasets that are missing data in comparison to second datasets from a second source. A machine learning model is applied to the obtained first dataset for at least one of the one or more subjects to generate a healthcare outcome, where the machine learning model is trained using training data comprising: a training incomplete dataset that shares one or more features of the obtained first dataset; and a ground truth label derived from a training complete dataset that comprises more features and/or data points than the training incomplete dataset. An action is taken regarding at least one of the one or more subjects based at least on the outcome.
“In various embodiments, the training incomplete dataset may be derived from the training complete dataset by dropping one or more data points from the training complete dataset, which may comprise dropping each of the data points of a feature from the training complete dataset. The training incomplete dataset may further comprise patient-level data that is generated by transforming claim-level data. In various embodiments, dropping one or more data points from the training complete dataset may comprise: defining a first patient cohort in a target open claims dataset; defining a second patient cohort in the training complete dataset; generating a first distribution from the first patient cohort in the target open claims dataset; generating a second distribution from the second patient cohort in the training complete dataset at one of a patient level, product level, or pharmacy level; comparing the first distribution to the second distribution; and based on the comparison, selectively removing data points from the training complete dataset to align the second distribution of the training complete dataset with the first distribution of the target open claims dataset.
“In various embodiments, defining a first patient cohort or defining a second patient cohort comprises identifying subjects in the target dataset or the training complete dataset that meet one or more criteria. The one or more criteria may comprise any of one or more diagnoses within a first period of time, a provided therapy within a second period of time, total time of enrollment, or one or more diagnoses within a first period of time and a provided therapy within a second period of time.
“In various embodiments, the first distribution and the second distribution may represent a number of claims of each type per patient in the first patient cohort and the second patient cohort, respectively. In various embodiments, selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the second patient cohort such that a percentage of patients in the target dataset with no claims aligns with a percentage of patients in the training complete dataset with no claims. In various embodiments, the first distribution and the second distribution represent number of claims for a healthcare event of interest (e.g., filling a script for a specific prescription drug, or having a specific medical procedure) per patient in the first patient cohort and the second patient cohort, respectively. In various embodiments, selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the dataset of the second patient cohort to generate a modified dataset such that the first distribution aligns with a modified second distribution of the modified dataset.
“In various embodiments, the first distribution and the second distribution represent a number of claims across pharmacies. In various embodiments, selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the second patient cohort to generate a modified second distribution of the modified second patient cohort. In various embodiments, data points are selectively removed from the training complete dataset such that a percentage of pharmacies of the first distribution with no claims aligns with a percentage of pharmacies of the modified second distribution with no claims. In various embodiments, data points are selectively removed from the training complete dataset such that the first distribution aligns with a modified second distribution of the modified second patient cohort.
“In various embodiments, the training incomplete dataset is previously matched to the training complete dataset. In various embodiments, the training incomplete dataset is previously matched to the training complete dataset by identifying that both datasets correspond to a common patient. In various embodiments, the matched training incomplete dataset and the training complete dataset are used to generate additional training incomplete datasets or training complete supersets.
“In various embodiments, features of the obtained first dataset may comprise one or more of a number of prior lines of therapies provided to the subject, whether one or more types of therapies were provided to the subject, enrollment data, demographics, diagnoses, procedures, provider data, clinical utilization, prescription medications, expenditures, and timing of previously listed medical events. In various embodiments, the one or more features shared between the training incomplete dataset and the obtained first dataset comprise one or more of procedures and prescription medications. In various embodiments, the missing data of the first dataset comprise features of any one of diagnoses, procedures, provider data, clinical utilization, prescription drug claims, expenditures. At least a portion of the missing data of the first dataset is due to blocked claims, e.g., as requested by a drug manufacturer or distributor. In various embodiments, the training complete dataset comprises at least data points for a feature of enrollment data that is not included in the training incomplete dataset.
“In various embodiments, the diagnoses comprise diagnoses for refractory or relapsed multiple myeloma. In various embodiments, the one or more types of therapies may comprise a proteasome inhibitor, immunomodulatory agent, or anti-CD38 monoclonal antibody therapy. In various embodiments, the proteasome inhibitor comprises one of bortezomib, carfilzomib, or ixazomib. In various embodiments, the immunomodulatory agent comprises one of lenalidomide, thalidomide, or pomalidomide. In various embodiments, the anti-CD38 monoclonal antibody therapy may comprise one of daratumumab, daratumumab and hyaluronidase-fihj, or isatuximab-irfc.
“In various embodiments, the number of prior lines of therapies is a threshold number of zero or more prior lines of therapy. In various embodiments, the number of prior lines of therapies is a threshold number of one or more prior lines of therapy. In various embodiments, the number of prior lines of therapies is a threshold number of two or more prior lines of therapy. In various embodiments, the number of prior lines of therapies is a threshold number of three or more prior lines of therapy. In various embodiments, the number of prior lines of therapies is a threshold number of four or more prior lines of therapy.
“In various embodiments, the first dataset and the second dataset are healthcare datasets, e.g., datasets comprising healthcare claims, events, payments, or other patient-related data.”
There is additional summary information. Please visit full patent to read further.”
The claims supplied by the inventors are:
“1. A method comprising: obtaining or having obtained a first dataset for one or more subjects, wherein the first dataset is obtained from a first source of datasets that are missing data in comparison to second datasets from a second source; feeding the obtained first dataset for at least one of the one or more subjects into a machine learning model configured to generate an outcome, wherein the machine learning model is trained using training data comprising: a training incomplete dataset that shares one or more features of the obtained first dataset; and a ground truth label derived from a training complete dataset that comprises more features or more data points than the training incomplete dataset; and taking an action with respect to at least one of the one or more subjects based at least on the outcome.
“2. The method of claim 1, wherein each of the training incomplete dataset and the training complete dataset comprise healthcare-related patient-level data that is generated by transforming healthcare-related claim-level data.
“3. The method of claim 1, wherein the training incomplete dataset is derived from the training complete dataset by dropping one or more data points from the training complete dataset.
“4. The method of claim 3, wherein dropping one or more data points from the training complete dataset comprises: defining a first patient cohort in a target dataset; defining a second patient cohort in the training complete dataset; generating a first distribution from the first patient cohort in the target dataset; generating a second distribution from the second patient cohort in the training complete dataset at one of a patient level, a product level, or a pharmacy level; comparing the first distribution to the second distribution; and based on the comparison, selectively removing data points from the training complete dataset to align the second distribution of the training complete dataset with the first distribution of the target dataset.
“5. The method of claim 4, wherein defining a first patient cohort or defining a second patient cohort comprises identifying subjects in the target dataset or the training complete dataset that meet one or more criteria comprising at least one of: one or more diagnoses within a first period of time; a provided therapy within a second period of time; a total time of enrollment; or one or more diagnoses within a first period of time and a provided therapy within a second period of time.
“6. The method of claim 4, wherein the first distribution and the second distribution represent a number of claims per patient in the first patient cohort and the second patient cohort, respectively.
“7. The method of claim 6, wherein selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the second patient cohort such that a percentage of patients in the target dataset with no claims aligns with a percentage of patients in the training complete dataset with no claims.
“8. The method of claim 6, wherein the first distribution and the second distribution represent a number of claims for a healthcare event of interest per patient in the first patient cohort and the second patient cohort, respectively.
“9. The method of claim 8, wherein selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the second patient cohort to generate a modified dataset such that the first distribution aligns with a modified second distribution of the modified dataset.
“10. The method of claim 4, wherein the first distribution and the second distribution represent a number of claims across pharmacies.
“11. The method of claim 10, wherein selectively removing data points from the training complete dataset comprises removing one or more claims from one or more patients in the second patient cohort to generate a modified second distribution of the second patient cohort.
“12. The method of claim 11, wherein the data points are selectively removed from the training complete dataset such that a percentage of pharmacies of the first distribution with no claims aligns with a percentage of pharmacies of the modified second distribution with no claims.
“13. The method of claim 11, wherein the data points are selectively removed from the training complete dataset such that the first distribution aligns with a modified second distribution of the second patient cohort.
“14. The method of claim 1, wherein the training incomplete dataset is matched to the training complete dataset.
“15. The method of claim 14, wherein the training incomplete dataset is matched to the training complete dataset to generate additional training incomplete datasets or training complete supersets.
“16. The method of claim 1, wherein features of the obtained first dataset comprise one or more of: a number of prior lines of therapies provided to the subject, an indication of whether one or more types of therapies were provided to the subject, enrollment data, a demographic, a diagnosis, a procedure, provider data, clinical utilization data, a prescription medication, an expenditure, and a timing of a medical event.
“17. The method of claim 16, wherein the one or more features shared between the training incomplete dataset and the obtained first dataset comprise one or more of procedures and prescription medications.
“18. The method of claim 16, wherein the missing data of the first dataset comprises features of at least one of: a diagnosis, a procedure, provider data, a clinical utilization, a prescription drug claim, or an expenditure.
“19. The method of claim 16, wherein the number of prior lines of therapies is at least one of the following: a threshold number of zero or more prior lines of therapy; a threshold number of one or more prior lines of therapy; a threshold number of two or more prior lines of therapy; a threshold number of three or more prior lines of therapy; or a threshold number of four or more prior lines of therapy.
“20. The method of claim 1, wherein the training complete dataset comprises at least one data point for a feature of enrollment data that is not included in the training incomplete dataset.
“21. The method of claim 1, wherein the first dataset and the second dataset are healthcare claims datasets.
“22. The method of claim 1, wherein the first dataset is an open dataset.
“23. The method of claim 1, wherein the first dataset is obtained from at least one of a clearinghouse, a pharmacy, or a software platform other than a health insurance provider software platform.
“24. The method of claim 1, wherein the second dataset is one of a closed dataset or a superset comprising data from an open dataset and a closed dataset.
“25. The method of claim 1, wherein the second source comprises one or more health insurance providers.
“26. The method of claim 1, wherein the training incomplete dataset is derived by pooling data from a plurality of closed datasets, or from a superset comprising data from an open dataset and a closed dataset.
“27. The method of claim 1, wherein the outcome is a prediction of whether a patient is eligible for a CAR-T therapy or other therapy for relapsed/refractory multiple myeloma (RRMM).
“28. The method of claim 1, wherein the ground truth label comprises an indication of whether a patient is eligible to receive one or more therapies.
“29. A non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform one or more operations for: obtaining a first dataset for one or more subjects, wherein the first dataset is obtained from a first source of datasets that are missing data in comparison to second datasets from a second source; feeding the obtained first dataset for at least one of the one or more subjects into a machine learning model configured to generate an outcome, wherein the machine learning model is trained using training data comprising: a training incomplete dataset that shares one or more features of the obtained first dataset; and a ground truth label derived from a training complete dataset that comprises more features and/or data points than the training incomplete dataset; and taking an action with respect to at least one of the one or more subjects based at least on the outcome.”
For additional information on this patent application, see: Harper,
(Our reports deliver fact-based news of research and discoveries from around the world.)



Patent Issued for Vehicle collision alert system and method (USPTO 11574544): State Farm Mutual Automobile Insurance Company
ProAssurance: Q4 Earnings Snapshot
Advisor News
- 6 in 10 Americans struggle with financial decisions
- New Trump administration rule seeks to bail out private equity, credit with workers’ 401(k) savings
- US paves way for private assets to be included in 401(k) retirement plans
- Reynolds signs temporary tax hike to address Medicaid shortfall
- The DOL wants to open the gates to private equity in 401(k)s. Good idea?
More Advisor NewsAnnuity News
- Three ways the Corebridge/Equitable merger could shake up the annuity market
- Corebridge, Equitable merge to create potential new annuity sales king
- LIMRA: Final retail annuity sales total $464.1 billion in 2025
- How annuities can enhance retirement income for post-pension clients
- We can help find a loved one’s life insurance policy
More Annuity NewsHealth/Employee Benefits News
- BearCare health plan for emergencies dies, for now
- Ohio Dems push affordability legislation; critics tout consequences
- Congress unlikely to take up major health care legislation this year
- She Owed Her Insurer A Nickel, So It Canceled Her Coverage
- I didn’t look sick enough — My painful battle with insurance
More Health/Employee Benefits NewsLife Insurance News
- AMERICA'S CREDIT UNIONS HIRES VETERAN WASHINGTON ADVOCATE TO LEAD POLICY STRATEGY
- Society of Actuaries announces Clar Rosso as next CEO
- AM Best Affirms Credit Ratings of Fidelity & Guaranty Life Holdings, Inc. and Its Life/Health Subsidiaries
- Hawai'i's Top Employers Profiles 2026
- Corebridge, Equitable Merger Creates $1.5tr Platfrom
More Life Insurance News