Training AI systems with user data is under fire. Public posts, photos and reactions are often used without explicit permission. This can lead to the loss of control over personal data. Large privacy objections and legislation have been tightened. But what are the alternatives? We list them.
Recently warned the Dutch Data Protection Authority Users of Instagram and Facebook for parent company Meta that wants to train artificial intelligence (AI) with data from social-media posts. Users do not want that, they have until 27 May 2025 to object. In that case Meta does not automatically use user data for training Meta AI.
In Europe there is often discussion about compliance with the General Data Protection Regulation (AVG). Some companies, such as Meta, use an opt-out model, where users must actively object. This is seen as problematic because explicit permission (opt-in) is often required. It is often unclear for users how their data is used or what the consequences are. This lack of transparency often feeds distrust.
1) Synthetic data
Instead of real user data, companies can use synthetic data, which are generated by algorithms that mimic patterns and properties of real data sets, without using personal or sensitive information. This type of data is used for statistical models. These models analyze the relationships and patterns in real data sets and use them to create similar data. For example, a data set with demographic data can be simulated by using statistical distributions.
Synthetic data is also often used for simulations. Hereby scenarios are simulated to generate data that simulate certain circumstances. An example is the simulation of traffic flows to create data for city planning. There are open source tools available such as Blueen.ai and Substra Those synthetic data generate for specific applications, such as healthcare and financial analyzes. Synthetic data offers benefits such as improved privacy and accessibility, but it requires careful implementation to guarantee its quality and reliability. A party that A lot works with synthetic data is the CBS.
2) Federated Learning
Another alternative to training AI systems with user data is ‘Federated Learning’. That is a decentralized and privacy -friendly form of machine learning. Instead of sending data to a central server, the machine learning model is brought to the data. This means that sensitive data remain locally and are not shared, while only anonymized results are exchanged. This approach protects the privacy of users and meets strict regulations such as the AVG.
The AI model is trained locally with the data of an organization. Only the anonymized interim results are shared with other organizations or a central server. Sensitive data remains locally and are not exposed to risks such as data breaches, which should improve privacy protection. Federated Learning is used in sectors such as healthcare, where it helps to analyze medical data without violating the privacy of patients. For example with cancer research, where hospitals analyze data on treatment methods without sharing patient data. But Google also uses Federated Learning in its Android platform. It is used for personalized recommendations on mobile devices, where the data remains locally. An example is to improve predictive text on smartphones. You can read more about Federated Learning on the websites of TNO and Active Collective.
3) Public data sets
In the Netherlands, public data sets are often used as an alternative to user data when training AI models. An example is the Dataset Statline of Statistics Netherlands. That is an extensive collection of public data on topics such as demography, economy and health. AI models are trained to analyze trends, such as population growth or economic developments. The Land Registry also has an interesting dataset with its large -scale topographical map (basic registration addresses and buildings). Training of AI systems with that data is used for geographical analyzes, such as planning infrastructure projects.
Health data, such as data on infectious diseases and environmental factors of the RIVM, are also a public data factor that can be used to train AI models to predict health risks and develop preventive measures. Vektis offers a data set on healthcare costs by postcode. AI is used to analyze patterns in care use and to improve policy. These data sets offer a valuable source for AI development without using personal data.
Do you still use user data? Ask explicit permission
Companies that still want to train AI systems with user data must comply with strict rules and request explicit permission. They can clearly inform users and request permission before data is used, for example via pop-ups or settings in apps. Companies that use explicit permission for training AI models with user data often do this to comply with privacy legislation such as the AVG and to build trust with their users.
Here are some examples: Meta requires explicit permission from users in the European Union to use public messages, reactions and chatbot interactions for training their AI models. This is done through reports in the app and by e-mail, where users can actively object. Training AI models is used by Meta, among other things, to better understand and respond to European languages and culture.
And Google requires explicit permission for the use of data in applications such as Google Assistant and personalized advertisements. Users can adjust their settings to determine which data is used. Applications are improving speech recognition and personalized recommendations.