Validata Blog: Talk AI-powered Testing

Synthetic Data Generation: Mitigate the risk of using production data in testing

Synthetic Data Generation: Mitigate the risk of using production data in testing

Synthetic data is data that is manually or artificially created contains all the characteristics of production minus the sensitive content, and is used for a wide range of use cases, including as test data for new products and tools, for model validation, and in AI model training. 

Why do banks and financial service providers need synthetic data?

Complex legacy systems, strict regulations, growing security concerns, the inability to move, share and scale data to drive data-centricity and innovation are main problems in the financial services sector. Banks need to leverage flexible and most of all compliant synthetic data to cover a variety of internal business functions.

Test data is essential to building, deploying and testing functional software. However, test data provisioning has become a bottleneck that threatens the efficiency and productivity gains. Up to now, testers usually spent up to 50% of their time looking for data, and as much as 20% of the total software development lifecycle is spent waiting for it.

As an example, let’s take a bank that wants to release a new mobile banking application, for which there are no available data yet, and the responsible product team is cut off from core production. Leveraging realistic synthetic test data enabled them deliver new product features, speed up testing and deployment by identifying and fixing issues earlier, and ultimately resulted in a more user-friendly product.

Synthetic data generation is not only more effective in terms of time, quality and money, but also often proves to be easier and more secure than fully masking production data.

QA and testing challenges

  • Product development is increasingly data-intensive, while data access is more restricted.
  • Large volume, quality production data is inaccessible due to privacy policies and limiting legacy systems.
  • Using traditional data masking tools to anonymize production data for testing endangers privacy and affects data integrity. Masking or subsetting production data or creating data manually from scratch can be a time-consuming task also affected by the inconsistent storage of data within different versions of spreadsheets. Masking production data does not guarantee compliance, as the real danger remains human error, whereby 58% of data breaches are staff related.
  • Data sharing with third-party vendors further complicates access issues. Organizations want to outsource testing and/or development without exposing production data to unauthorized users.
  • data provisioning can take weeks to months, requiring the involvement of various departments.
  • Developers often end up manually populating environments that fail to provide the scale and complexity necessary for releasing customer-centric products.
In the world of Agile and DevOps, testing is done within sprints, so test data has to be provisioned continuously to the teams and refreshed during each sprint. Instead of masking production data, automated synthetic test data generation should be used to systematically and quickly create all the data needed for testing.

Testers and developers need access to ‘fit for purpose’ data that can cover 100% of test cases, delivered to the right place, at the right time.

The real issue with using production data in non-production environments is quality. A lot of production data is very similar, being collected from common or “business as usual” transactions, and is sanitized to exclude the bad data that will break systems. Testing therefore tends to focus less on non-functional and negative testing.

Synthetic Data includes future scenarios that have never occurred before, as well as “bad data,” outliers and unexpected results. This enables effective negative testing. It provides a systematic way to test, generating extreme data cases and ‘bad path’ scenarios for maximum test coverage.

It is also an ideal solution when developing new features or applications, for which the quantity of data is insufficient, or data is not present at all, to perform the desired test scenarios to assess the quality of your application.

Synthetic data generation can also help resolve infrastructure, storage and system constraints. The actual copying of production data is a prohibitively slow and expensive process. Some organizations find themselves with as many as 20 copies of a single database incurring high expenditure on hardware, licenses and support costs.

ConnectIQ for AI-generated Synthetic Data:

  • Don’t waste time, money and effort searching for data manually!
  • Accelerate data provisioning times by 95%
  • Deliver zero-byte virtual data copies in just minutes!
Through a powerful synthetic data generation engine, it provides testers with data before testing starts; realistic data tailored to their specific testing and development needs. Any data that gets generated can be stored as a reusable asset that can be used on-demand.

ConnectIQ uses AI to rapidly generate large sets of synthetic test data to eliminate the risk of data breach by creating production-like data but without the sensitive content. These test data sets are reusable, include outliers, boundary conditions and erroneous data, and can be shared with outsourced data testers or uploaded for application testing in the cloud, as safely and easily as when used on-premise.

It also enhances existing subsets of production data with rich, sophisticated sets of synthetic data, reducing infrastructure by covering all combinations in the optimal minimum set of test data.


Copyright © 2018 Validata Group

powered by pxlblast
Our website uses cookies. By continuing to use this website you are giving consent to cookies being used. For more information on how we use cookies, please read our privacy policy