CAN SYNTHETIC DATA REPLACE SENSITIVE INFORMATION IN AI TRAINING?

October 15, 2024

Sogeti Labs

Training AI models like GANs (Generative Adversarial Networks), LLMs (Large Language Models) and SLMs (Small Language Models) requires large amounts of high-quality data. However, acquiring sufficient real-world data is difficult or even impossible due to privacy concerns, regulations, or limited availability. This raises an important question: can synthetic data replace sensitive information in AI training?

One of the current challenges in training generative AI models is obtaining sufficient data that is diverse, well-labeled, high quality, while also minimizing bias. This is particularly challenging for sensitive data, such as healthcare or financial information, which falls under regulations like GDPR. If the AI needs to be specialized for specific tasks, acquiring enough relevant data becomes even more difficult. Synthetic data, if generated and used properly, can offer a viable solution.

Techniques for generating synthetic data

One approach to generating synthetic data is through Rule-Based Systems where domain experts create deterministic paths. For example, in healthcare, experts might use known patterns of patient symptoms to generate synthetic records. These systems can also be supplemented by existing high-quality datasets, if available. Although this is a rigorous process, it can yield great value and, when also continuously maintained, provide a source of relevant and fresh synthetic data.

Another technique is Agent-Based Modeling, which simulates autonomous agents with behavior defined by domain experts. For example, in finance, agents could simulate user transactions and interactions within a banking system. This technique is useful when needing to continuously simulate active interacting with a system.

What about anonymizing the sensitive data?

While synthetic data offers an alternative, anonymizing sensitive data is another approach often considered. Using synthetic data derived from real data is possible, but it comes with drawbacks. Firstly, consent is required to anonymize data in the first place. Furthermore, there’s a risk of re-identification, where anonymized data could be traced back to individuals by combining it with other datasets.

Anonymizing sensitive data while ensuring quality, usefulness and accuracy is a complex and costly process. Organizations must weigh the effort required against the benefits of generating other forms of synthetic data to reduce the risk of sensitive data being exposed.

Data augmentation techniques, such as adding noise or transforming data, can also help mask real data. The techniques can also help increase the size of datasets if needed.

Will Generative AI Models trained on synthetic data collapse on themselves?

There is a potential risk that generative AI models trained primarily on synthetic data could experience a “collapse”. Where the model’s output loses diversity, misses edge cases, and starts to lose accuracy over time. This may happen occur if the model fails to fully capture the complexity of the intended scenarios. To prevent this, it is essential to continuously refine the synthetic data using domain experts and monitor the model’s performance to detect biases or failures. This helps ensure the model remains accurate and robust across a wide range of situations.

The best approach: A Hybrid Solution

While synthetic data presents a promising alternative to using sensitive real-world information in AI training, it is not without its challenges. It can help address issues like data scarcity, privacy, and compliance with regulations such as GDPR, but it must be carefully generated and refined to avoid bias and ensure accuracy. Given the risks associated with using synthetic, a hybrid approach that combines synthetic and real-world data is often the most reliable solution for organizations today.

Practical Steps for Organizations Considering Synthetic Data

For organizations looking to implement synthetic data, conducting pilot tests to validate its effectiveness is recommended. Combining synthetic data with anonymized real data can further enhance model performance while mitigating risks. Keeping a continuous feedback loop involving domain experts can also help maintain the relevance and quality of synthetic data over time.

About the author

SogetiLabs gathers distinguished technology leaders from around the Sogeti world. It is an initiative explaining not how IT works, but what IT means for business.

Is AI now smart enough to incorporate “Churchillian Logic” and help us break the cycle of repeating poor decisions or outcomes?

15 Aug 2025

Generative AI

Cloud

Testing

Artificial intelligence

Security

CAN SYNTHETIC DATA REPLACE SENSITIVE INFORMATION IN AI TRAINING?

October 15, 2024

Techniques for generating synthetic data

What about anonymizing the sensitive data?

Will Generative AI Models trained on synthetic data collapse on themselves?

The best approach: A Hybrid Solution

Practical Steps for Organizations Considering Synthetic Data

About the author

Related Posts

The CortexIA project: Bridging Thought and Technology

Gen AI Quality Engineering: Architecture led Quality Engineering for Next Gen businesses

Scaling AI to Long Web Pages: Smarter Website Classification with Lightweight Strategies

Is AI now smart enough to incorporate “Churchillian Logic” and help us break the cycle of repeating poor decisions or outcomes?

Automating with Intelligence – Automation Engineer 2.0

Exploring the Power of AI Voice

Why Software Testers Won't Be Replaced by AI and ML Anytime Soon

Smarter Website Classification: How AI Helps Businesses Navigate the Web

Chaos to Clarity: How Anthropic's Framework Could Revolutionize Agentic AI Strategy

Challenges in Testing Generative AI: A Quality Engineering Perspective

Leave a Reply Cancel reply