Synthetic Data Can Be Key To Insurers’ Profitability

By Tobi Hann

Data is the lifeblood of the insurance industry. For decades, teams of actuaries took months to analyze the information at their disposal and to complete the computations required to assess the financial risks associated with people, property and potential events. Later, the availability of computers accelerated the process.

Today, advanced data science, machine learning applications and AI make it possible to analyze existing data more effectively than ever before. The good news is that it has never been easier to uncover previously invisible insights, new revenue generating opportunities or unknown risks.

But there is also a downside. U.S. insurers compete in a crowded marketplace marked by razor-thin margins and more risk. In this environment, insurance leaders know that data, and the science needed to unlock the truths and probabilities hidden within it, is crucial.

Most strategic, business and operational questions are now based on consumer and policyholder information that machine learning applications use to create profitable pricing strategies, develop new products, increase customer satisfaction and mitigate risks. But there’s a problem: As any data scientist will attest, algorithms are only as good as the data used to create, train and test them.

Your Existing Data Is Expensive And Risky. It Also May Be Flawed.

By default, insurers still must base most of their business-critical decisions on people. In a perfect world for underwriters, data scientists would be able to attain, consolidate, parse and analyze data on policy holders and prospects whenever and however they wished. But, the world is anything but perfect.

Data on consumers and policyholders is exceptionally valuable. But it’s also problematic, expensive, risky and often flawed. Consider a few characteristics of even the best consumer and policyholder data set.

Data is difficult to attain. People first must agree to share their personal data. This is often a significant barrier, as many consumers are increasingly uncomfortable with their personal information being used. The alternative, working with what is available or publicly available data sets, is expensive and limiting.

Personal data typically must be anonymized for legal reasons. To avoid the risk of litigation, data scientists in the insurance industry typically must anonymize data before they can use it. It’s also important to be able to tie information back to a specific individual. This is a time-consuming, difficult and costly process that often takes months to complete. By their nature, data anonymization techniques are also destructive, stripping much of the data’s utility and value when properly executed.

All data must be handled legally and protected to ensure compliance. The sharing of consumers’ or policy holders’ information across departments, with third-party databases or across borders often is forbidden for legal reasons. For example, to avoid the costly fines associated with the Health Insurance Portability and Accountability Act, insurers must take all necessary steps regarding the use, handling and storage of any health-related personal information. There is also a move to enact consumer data protection laws, with statutes already in place in California, Colorado and Virginia.

Existing data on real people is often lacking. Today’s highly competitive insurance landscape requires machine learning applications that can effectively act on real-world insights. But data scientists need sufficiently large, clean data sets in order to create and test them. However, existing data is often prone to bias – for example, data often skews toward policyholders.

AI-Generated Synthetic Data Is A Proven Alternative

Synthetic data is created with artificial intelligence that reflects the key aspects of existing or real data. Unlike anonymized organic data, AI-generated synthetic data is not limited by sample sizes or the inclusion of risky personal information from consumers or policy holders.

For insurance companies that want to provide data scientists with the rich, statistically robust information they want and need without legal and compliance risks associated with actual personal information, synthetic data offers several compelling advantages.

Software testing and development can be accelerated with ease. Insurance companies typically have hundreds of software applications in use and new ones are constantly needed for brokers, employees and policy holders. The difficulty of attaining data for comprehensive development and testing is often the single greatest reason for developers’ delays, frustration and buggy production applications. In contrast, synthetic data enables access to rich repositories they can use to accelerate and optimize the coding, testing and refinement of new apps. Entire databases can be synthesized, shared and tested with ease. Instead of using existing databases with fields that include sensitive fields, a synthetic data set that retains business rules with undisciplined discipline and without risk can be quickly created from scratch with an AI engine.

A needed data set can be economically created and deployed almost immediately. A statistically representative data set can be created from even a small sample and spun up for a particular machine learning application in a fraction of the time it takes to build or acquire a similar collection of records on actual individuals. For example, for the pricing of a life insurance product, securing access to clients’ data requires a nondisclosure agreement and often takes up to six months. Then the data cannot be shared or retained internally. A comparable synthetic data set takes less than 24 hours to create and is ready to use.

Synthetic data can be parsed, used or augmented in any way. Data scientists are not limited in their use of synthetic data. For example, for the pricing of a home insurance product, it takes up to six months to gain access to policy holder and consumer data, but the resulting data set typically cannot be used with third-party weather databases – a key tool for identifying home-related risks – because of the potential for data re-identification when using addresses. Using synthetic data, the same data set can be created, combined with inexpensive weather databases and put into use in less than a day.

Data sets can be up-sampled for the accurate inclusion and prediction of rare events. Uncommon events, such as fraud or the incidence of rare diseases, can be particularly difficult to model for data scientists and software developers. Rare events often are drowned out in a sea of data points, resulting in inaccurate applications. The solution, “up-sampling,” is also problematic because it requires the multiplication of rare events with insurers’ data. This not only creates an imbalanced data set, but also relies on replicated personal information, amplifying privacy risks. In contrast, synthetic up-sampling makes it possible to create a perfectly balanced data set for even the rarest of occurrences that can then be used to train and refine applications.

AI-generated synthetic data shields insurers from liability and compliance concerns: Because AI-generated synthetic data does not contain information on any real individuals, there is no danger of misusing the data or failing to comply with existing rules or consumer data protection laws, but the patterns and behaviors are still based on real individuals.

Most importantly, AI-generated synthetic data is highly accurate. Insurers also can parse, share and combine it with other sources of information at will. This empowers insurers to be far more nimble – developing and testing new pricing strategies, products and customer communications with unprecedented frequency and speed. The value and advantage of synthetic data has never been more real.

Tobi Hann is CEO of MOSTLY AI. He may be contacted at [email protected].