Many believe that testing is not representative when we have not tested with production data. It’s hard to build confidence in a solution if people have not seen it behave correctly with actual production data. But testing with production data is often a bad idea.
Let’s first explore a major driver for testing with production data. This originates from the belief that many dangers are hiding in production and we cannot simulate those conditions with synthetic data. Using actual production data with all its quirks and edge cases will reveal issues that we could not imagine. Right?
But that’s not a good reason.
Firstly, this implies that we would learn only at a very late stage that our solution is not designed to handle the variation of data that happens in production. It’s good that our testing finds this. But it’s too late. Identifying those cases should happen very early in our development track. It is highly recommended to explore our production data thoroughly and then base our design on a solid understanding of actual production data.
Secondly, with how much production data would you need to test to cover all these cases. After all, many of these special data variations only occur sporadically. Will we find them if we take one day of production data? One week? One month? Before you know it, we have to test with extensive data volumes, only to have relative certainty that our solution can handle all special cases.
This brings us to another important point. Production data is yesterdays data. It’s in the past. It tells us what transactions our system handled in the past. Of course, it is important to ensure continuity. We need to test on regression. But the usage of our solution changes over time. So does our solution itself (or the services that we interface with). But we need to be confident that our solution handles tomorrows data well. Testing with yesterday’s production data gives us a false confidence. So we’ll need new or updated data to cover also tomorrow’s data patterns.
There are also practical concerns to testing with production data:
- Size: the larger our test data sets, the more time that is required for test execution. We should aim for the smallest possible test data sets that still provides sufficient test coverage. Especially as we aim for continuous (automated) testing that provides us fast feedback (we prefer minutes over hours, hours over days)
- Control: if we use production data, we must really understand how much variation is actually covered. So, we must know our data, which requires time to explore.
- Security & privacy concerns: although there are many excellent tooling solutions for these concerns, it deserves caution as the potential impact of breaches is significant
So why do we still use production data to test?
Often, it’s the simplest solution. The data is there so let’s use it. Creating synthetic data requires analysis and effort. It requires a good understanding of your data, how they connect, how they might change, what could be the edge cases…
On the other hand, these are exactly the reasons to use synthetic data. The additional effort pays itself back in test coverage, risk reduction and additional confidence. Often, we can construct very small data sets which provides a much higher coverage than using weeks of production data. Small test data sets mean fast test execution. This increases our agility and speed of learning.
This however also means that synthetic test data need to be good. They should not be based on a theoretical expectation about our solution. They should be based on sound analysis of our data & information architecture and complemented by good exploration of actual production data. If this analysis is not done properly, we will lose confidence.
This brings us to a last point. Humans are human. We will not believe that something works based on a theoretical exercise (which is how synthetic data is sometimes perceived). We only believe that it works, if we have seen it working with data that is recognizable to us. Especially business users require this confidence. There is nothing wrong with that. We can complement our synthetic data with (anonymized) production data sets. Especially during demonstrations and (user) acceptance testing, show the solution with production data. If it helps people sleep better, why not.
But let’s not rely anymore on only production data for actual testing when synthetic test data can help us maximize test coverage, speed up feedback loops and reduce the security & privacy concerns.
Synthetic data need to be good. Don’t just use the obvious theoretical cases. Do your homework & analysis.