F inancial institutions are under pressure to do two things at once: innovate with artificial intelligence (AI) and protect sensitive commercial and client data. Synthetic data is increasingly seen as the solution. It allows firms to train models and test On the surface, this may appear to reduce risk, making it attractive for credit scoring, fraud detection, regulatory testing and a range of financial modelling applications. But our research on synthetic data governance in finance systems without using real customer or market data. discovered a more complicated reality. Synthetic data does not remove risk, it redistributes it. Instead of sitting in the data itself, risk shifts into the systems that generate the data and the assumptions built into them. If you overlook this shift, you may end up relying on systems you do not fully understand. The first issue is that synthetic data adds another layer of complexity to already complex systems. Financial models are rarely transparent, and introducing synthetic data means there are now two systems at work: one generating the data and another using it to make decisions. This makes outcomes harder to explain and much harder to validate. When a model produces an unexpected result, it is no longer clear whether the issue lies in the model itself or in the way the data was generated. Using synthetic data as a simple input adds a layer of opacity that can easily go unnoticed, because important choices are made during
the data generation process. For example, when generating synthetic mortgage data, model developers must decide how much variation or ‘noise’ to introduce. Too much can distort relationships between income, repayment history, and default risk, making borrower behaviour less realistic. Too little variation may leave the synthetic data too close to the original records, increasing privacy risks. As a result, it becomes harder to trace whether problematic decisions stem from the original data, the process of generating synthetic data, or the model itself. Over time, this can weaken model governance and reduce the ability of firms to explain, challenge, and control their own systems. “If you overlook this shift, you may end up relying on systems you do not fully understand” This matters even more when synthetic data is used for validation or testing, as firms may believe they are stress-testing or checking their models when they are actually relying on data produced by another model with its own limitations. This can create a false sense of confidence. A model may appear robust simply because it performs well on synthetic data that reflects the same underlying logic. The second issue is less obvious but more serious. Synthetic data
can create hidden connections between firms’ data infrastructures. If multiple organisations rely on similar synthetic data generation tools, they may begin to see risks and react in similar ways. This is how herding behaviour emerges. It does not require firms to coordinate. It only requires them to rely on similar data. Over time, this can amplify market movements and increase systemic risk. What looks like independent decision-making may in fact be driven by shared assumptions embedded in similar synthetic data. There is also a strategic risk that many managers overlook: loss of control. The process of generating synthetic data involves a series of judgements and design choices that shape outcomes in ways that are not always visible. When firms rely on external providers, they begin to lose control of key decisions about how their data is produced. The assumptions that shape their models sit inside systems they do not own and may not fully understand. Over time, this creates a form of dependency. It weakens internal expertise and makes it harder to question or challenge model outputs. You may still own the model, but you no longer fully control, or fully understand, the data that feeds it. Another common misconception is that synthetic data reduces bias. In reality, it reflects the data it is trained on. If the original data contains bias, synthetic data can reproduce it or even amplify it. Because the data appears artificial and ‘cleaner’, it may be trusted more than it should be, creating a false sense of security. The issue becomes even more complex when synthetic data is used to correct or rebalance biased datasets. While this can improve
Warwick Business School | wbs.ac.uk
28
Made with FlippingBook Learn more on our blog