Partnerblog

Synthetic data – a miracle cure or a data protection headache?

1. What is synthetic data?

Synthetic data, a term lacking a precise legal definition, broadly refers to data artificially generated to resemble the characteristics of real data, including their structure and statistical distribution. A more nuanced definition specifies that synthetic data is generated through the utilization of a mathematical model or algorithm, aiming to generate data that is statistically realistic yet inherently 'artificial'.

The generation of synthetic data can take various forms, including its production from real datasets or its creation "from scratch" by leveraging knowledge and expertise gathered by data analysts on specific dependencies. It can also result from a combination of these approaches, incorporating both real data and expert knowledge to create synthetic datasets.

The primary objective of synthetic data is to preserve the characteristics and properties of real data tailored to a specific use case. Notably, the determination of which properties of the real data should be preserved hinges on the intended purpose of the data usage. For instance, distinct data qualities are required when assessing the storage capacities of an IT system compared to using the data for training an artificial intelligence (AI) model in cancer detection.

In certain applications, the relevance of data quality, in the sense of the close resemblance between synthetic data and real data, may be nonessential. For example, when synthetic data is used to train self-driving vehicles, the occurrence of risky situations in this dataset may need to be more frequent than in real life driving conditions. Hence, the case-dependency factor plays a crucial role in shaping the approach to generating synthetic data.

2. Why is synthetic data useful?

The progress and evolution of technology, particularly in the realm of AI, hinge on the availability of extensive datasets. Synthetic data emerges as a crucial asset when real-life data is inaccessible or insufficient due to scarcity, lack of variability, or legal constraints such as General Data Protection Regulation (GDPR), intellectual property rights or trade secret protection. Synthetic data also assumes a pivotal role in overcoming the labour-intensive and costly nature of labelling real-life data.

In practical terms, since the data is generated, it can lower the costs and resources involved in collecting the required data. Using "dummy" data for initial AI model training provides developers with a strategic advantage, yielding faster results before transitioning to real data. Numerous practical examples underscore the utility of synthetic data, particularly in training machine learning models and conducting data analysis. Amazon’s Alexa, for instance, reportedly undergoes training on synthetic data. To witness the generation of synthetic data first-hand, one can explore the Random Face Generator at https://this-person-does-not-exist.com/en.

Synthetic data contributes to enriching virtual reality (VR) and augmented reality (AR) experiences by creating realistic virtual environments. In cybersecurity, the simulation of diverse cyber threats using synthetic data is crucial for training and testing defence mechanisms. Meteorology leverages synthetic data to enhance weather forecasting models, simulating a spectrum of atmospheric conditions for more accurate predictions. In autonomous vehicle development, synthetic data is used for simulating diverse road conditions and obstacles, aiding in the training of algorithms.

One of the most promising applications of synthetic data lies in health research and innovation. It is being explored whether virtual, computer-generated patients can prove valuable in the development of medical drugs and devices, potentially providing a way to reduce reliance on human testing and shorten testing times.

In another notable instance, synthetic data was employed to address the underrepresentation of diverse skin types in existing datasets. Recognizing a bias towards predominantly light skin samples in data repositories, a more inclusive set of skin images was created using synthetic data. This initiative aimed to train detection models capable of effectively recognizing potentially malignant skin conditions, such as melanoma, across a spectrum of shades.

In essence, synthetic data stands not merely as a solution to data challenges but as a transformative force, reshaping technology across diverse applications. Its seamless integration into various fields reflects its pivotal role in advancing and revolutionizing the capabilities of artificial intelligence and data-driven technologies.

3. Does the GDPR apply to synthetic data?

The relationship between synthetic data and the GDPR is a subject of debate, with most researchers agreeing that synthetic data is not automatically “private” or placed outside of the realm of data protection laws. Legal considerations predominantly arise when creating synthetic data from real-life datasets containing personal data, as seen for example in medical datasets. In such cases, the process begins with collecting and preparing actual personal data for training AI models that generate synthetic data. From a GDPR perspective, creating synthetic data based on personal data requires processing of the latter.

This imposes several requirements on the developers. For example, they need to implement the GDPR principle of data minimization (Article 5.1c), by pseudonymizing the input data and removing direct identifiers from it. Another crucial principle is ensuring the integrity and confidentiality of input personal data (Article 5.1f), particularly by incorporating technical and organizational security measures (Article 32) to safeguard it from unlawful disclosure. As with any personal data processing, there is a need for a legal basis for using input personal data for synthetic data generation.

Opinion 05/2014 of the Article 29 Working Party on Anonymisation Techniques, states that anonymisation as an instance of further processing of personal data can be compatible with the original purposes of the processing if the result is truly anonymous data. According to some authors, similar argument can be made for synthetic data generation "provided that the data synthesis is carried out adequately and synthetic data is reliably produced", or, with a higher standard, that the synthetic data is anonymous (non-personal).

This leads to the imminent question of whether synthetic data is ‘personal data’ governed by the data protection law. On the face of it, one may argue that since the data is purposely disrupted and changed (there is no one-to-one mapping from synthetic records back to the person), it is automatically non-personal. However, there have been studies that indicate that not in all cases sufficient level of anonymization is achieved. Even if the generation of the data was performed on initially de-identified data (where direct identifiers, such as names were removed), there remains a risk that an individual can be indirectly identifiable either from the synthetic data itself or with other available sources.

The potential risk becomes especially relevant in cases where a model is vulnerable to 'overfitting' . In such instances, the model excessively focuses on the details of the training data, essentially memorizing examples from that data and reproducing them in synthetic data and other sources cited there. Consequently, this phenomenon exposes a vulnerability in synthetic data, as it has “the capacity to leak information about the data it was derived from”, rendering it susceptible to privacy attacks.

As a result, conducting a thorough assessment of any synthetic data becomes imperative to ascertain its personal or non-personal status. Notably, the European Data Protection Supervisor (EDPS) has emphasized that this assessment should evaluate the extent to which data subjects can be identified in the synthetic data and the amount of new data about those subjects that would be revealed upon successful identification.

Nevertheless, such an assessment is not a straightforward process. From a legal perspective, the assessment of synthetic data under the GDPR is influenced by the ongoing debate on the limits of "personal data”. This topic is very complex (refer to recent rulings of CJEU Case C-319/22 and GC T-557/20), resulting in a lack of agreed standards and a potentially expansive definition of 'personal data.' Essentially, debates concerning the risk of identification within the GDPR definition of personal data often centre on determining whose perspective should decide if a piece of information qualifies as personal. Additionally, there is a need to establish a threshold for 'reasonable likenesses' as a measure to assess the risk of re-identification. Another persistent issue associated with synthetic data involves the potential deduction of sensitive information about an individual, even in cases where the identifiability test fails to yield a positive outcome.

Even if the synthetic data falls short of the anonymity threshold, replacing collected personal data with artificially generated data offers an additional layer of security to personal data. The AEPD and ICO consider synthetic data as a privacy-enhancing technology (PET) which aims to weaken or break the connection between an individual in the original personal data. Some researchers propose combining synthetic data with other PETs, such as differential privacy, to enhance privacy protection while retaining utility.

4. Can synthetic data be regulated so its status is made clear?

The term “synthetic data” is making its way into EU regulations. In particular, recital 7 of the Data Governance Act states that “There are techniques enabling analyses on databases that contain personal data, such as anonymisation, differential privacy, generalisation, suppression and randomisation, the use of synthetic data or similar methods and other state-of-the-art privacy-preserving methods that could contribute to a more privacy-friendly processing of data”. While the Data Governance Act acknowledges the value of synthetic data as a PET, it does not offer a legal definition nor position regarding its status as personal or non-personal data.

As mentioned above, the current stance of certain data protection authorities and privacy professionals is that synthetic data must be evaluated within the framework of the GDPR, and the privacy implications of any synthetic dataset are highly contingent on the specific context. This perspective is seen as a potential barrier to advancing the use of synthetic data in research. Concerns have been raised regarding the complex legal requirements and GDPR compliance processes that must be adhered to, which could impede technological progress and hinder the widespread adoption of synthetic data. It may be tempting to suggest that the complexities related to the qualification of synthetic data could be easily resolved by the establishment of a legal definition enacted by European Union legislators. Such hope was kindled by the Artificial Intelligence Act (AI Act) proposal where in Article 54.1 (b) it is mentioned that:

“In the AI regulatory sandbox personal data lawfully collected for other purposes may be processed for the purposes of developing, testing and training of innovative AI systems in the sandbox under the following cumulative conditions:

a) (…)

b) the data processed are necessary for complying with one or more of the requirements referred to in Title III, Chapter 2 [namely those applicable to high-risk AI systems] where those requirements cannot be effectively fulfilled by processing anonymised, synthetic or other non-personal data;”

Attention has been paid to the part of the provision in which the categories of anonymised, synthetic or other non-personal data are mentioned together. As some argue, this wording suggests – by legal implication – that synthetic data is considered as a type of non-personal data. However, in our assessment, this conclusion appears somewhat premature.

The origin of synthetic data is an important factor in assessing whether it qualifies as personal data. When synthetic data is created from original personal data, a crucial trade-off emerges where utility and anonymity are inherently interconnected. The more utility a synthetic dataset provides, the lower its anonymity (meaning the higher the risk of reidentification), and vice versa. Therefore, striking a balance between absolute anonymity and utility preservation is a nuanced task when synthetic data is generated from real personal data, and it is unlikely that a unanimous consensus will emerge asserting that synthetic data is unequivocally non-personal in all instances. Conversely, synthetic data generated through assumptions, bypassing the direct processing of personal data, will not need to face these challenges.

In this regard, critical perspectives caution policymakers against assuming equal effectiveness across all forms of data synthesis. Experts also advise that context and practice will have a major influence on the risk of re-identification. They argue that Data Protection Authorities (DPAs) and the community should arrive at "appropriate standards and approaches to assessing identifiability of specific synthetic data generation methods, utilizing quantitative metrics as far as possible".

Time will show whether these comments will be taken on board in the final version of the AI Act. In the AI Act amendments adopted on 14 June 2023 by the European Parliament, a reference to synthetic data in Article 10.5 was added, outlining conditions for processing special categories of data to detect negative biases in high-risk AI systems. One of the conditions is that “the bias detection and correction cannot be effectively fulfilled by processing synthetic or anonymised data”. This addition does not imply synthetic data to be a category of “non-personal data” like Article 54.1(b). Interestingly, the original proposal’s text in Article 54.1 (b) remains unchanged. At the time of writing of this blog, the final text of the provisional agreement reached between the Council presidency and the European Parliament’s negotiators is yet to be disclosed, and therefore it remains to be seen how (and whether) the final text tackles the status of synthetic data.

5. What should I do if I am planning to generate or use synthetic data?

Some of the good practices related to synthetic data include:

Establishing a clear legal basis: If the starting point for generating the synthetic data involves personal data, the processing of this personal data must comply with the GDPR. Accordingly, organizations should carefully assess the legal justification for processing personal input data, ensuring that it falls under an appropriate legal basis.
Transparency and Accountability: Organizations must be transparent when collecting and processing personal data of individuals for the purpose of generating synthetic data. Furthermore, keeping detailed records of processing personal data for the purpose of synthetic data generation is crucial, demonstrating the organization's dedication to transparency and accountability.
Striking a balance: Similar to data anonymization, producing synthetic data requires finding a balance between utility and anonymity. If the synthetic data too closely resembles actual data, while being valuable for researchers, it can compromise the privacy of data subjects and remain within the realm of personal data. This could pose significant challenges in terms of data protection compliance. For example, due to the unique nature of synthetic data ensuring data accuracy, addressing correction requests, and handling objections from individuals regarding their data, would be challenging if not impossible. Synthetic data is artificially generated and does not correspond to real-world information about specific individuals.
Privacy Assurance Assessments: In order to ensure that synthetic data does not qualify as personal data, it is crucial to conduct privacy assurance assessments. This involves evaluating the risk of re-identification, ensuring data minimization, and implementing appropriate safeguards to protect individual privacy. Ongoing research is exploring methods and metrics to assess the probability of re-identification of synthetic datasets.
Documentation and monitoring: As with any AI training, careful documentation of input data and the process of synthetic data creation is essential. Expert analysis and oversight, from both domain experts and data scientists, are important in the generation as well as evaluation stages of synthetic data. Organizations must ensure the level of data quality appropriate to the foreseen use case and incorporate the principle of data protection by design into the synthetic data generation life cycle.

It's important to acknowledge that since synthetic data is relatively new, the rules of its use and legal implications in various domains are still unclear. Extra caution is advised in scenarios where the data is to be used for training and validation of AI models intended to be classified as medical devices. Notably, concerns have been raised about the use of synthetic data for clinical validation, underlining the absence of a foundation in Medical Device Regulation. Amidst this evolving landscape, where standards for evaluating synthetic data quality are subject to ongoing refinement (in the dimension of both completeness and accuracy), which as mentioned above is highly context dependent, organizations must tread cautiously.

The risks associated to synthetic data include the potential for inaccuracies arising from flawed input data or background information, as well as the risk of bias in data creation due to inadequately balanced input information. Additionally, concerns arise about users' ability to understand the underlying logic applied by machine learning in generating synthetic values, raising questions about the transparency and trustworthiness of the data. In this dynamic realm of synthetic data, where standards and risks are under ongoing scrutiny, ensuring compliance and responsible data management calls for careful consideration.

Authors: Nayana Murali, Magdalena Kogut-Czarkowska

More Partner Blogs

25 juni 2024

Je slides voor je laten praten en andere fouten bij presentaties

Hoe vaak zat je al in een meeting of een seminar, waarbij je – verveeld – amper de aandacht kon...

Lees meer...

25 juni 2024

Transposition of the NIS 2 Directive into Belgian law to strengthen cybersecurity

The law establishing a framework for the cybersecurity of network and information systems of...

Lees meer...

24 juni 2024

Synthetic data – a miracle cure or a data protection headache?

Synthetic data, a term lacking a precise legal definition, broadly refers to data artificially...

Lees meer...

24 juni 2024

Takeaways from the Belgian Presidency of the Council of the EU on Climate and Energy Topics

In the aftermath of the European elections, institutions are adjusting their priorities for the...

Lees meer...

20 juni 2024

Chemicals PFAS restriction proposal

The introduction of the 'essential use' concept and its possible impact on the PFAS restriction...

Lees meer...