Skip to content

Training Data Requirements

The AI Act specifically addresses requirements for training, validation, and test data used in high-risk AI systems. This warrants a dedicated topic covering data sourcing, quality standards, documentation requirements, and characteristics that must be maintained for AI model development.

training data training dataset data quality data documentation data sourcing data labeling data annotation dataset documentation

Overview

Legal Framework

The AI Act addresses training data requirements primarily through Recitals 108 and 109. Recital 108 establishes that providers of general-purpose AI models must implement a policy to comply with EU copyright law and publicly disclose a summary of the content used for training. The AI Office's monitoring role is limited to verifying the existence of this policy and summary, not conducting a detailed copyright assessment of each data source. Recital 109 clarifies that these obligations must be proportionate and commensurate to the type of model provider, explicitly excluding persons developing or using models for non-professional or scientific research purposes, though voluntary compliance is encouraged.

Practical Application

While the AI Act's recitals set the framework, the practical application of data quality principles is informed by established EU data protection jurisprudence. The Court of Justice of the EU, in cases like Peter Puškár, has consistently held that all processing of personal data must comply with core data quality principles. When training data contains personal data, the legitimate interest balancing test from Google Spain SL v. AEPD is relevant, requiring a careful assessment that weighs the controller's interests against the data subject's rights, considering factors like the sensitivity and age of the data. For copyright compliance, providers must document their sourcing policy and create a transparent summary, but the burden of detailed copyright verification is not imposed by the AI Office under this regulation.

Key Considerations

  • Develop a Copyright Compliance Policy: High-risk and general-purpose AI providers must establish and document a concrete policy for ensuring training data sourcing respects Union copyright law, accompanied by a publicly available summary of training content.
  • Apply Data Protection Principles to Personal Data: If training datasets contain personal data, ensure processing complies with data quality principles (lawfulness, fairness, purpose limitation) and conduct a documented legitimate interest assessment where applicable.
  • Assess Proportionality of Obligations: The requirements are not one-size-fits-all; evaluate whether your organization qualifies as a provider under the Act and tailor compliance efforts to be proportionate, noting the exemption for non-professional and scientific research activities.

Laws (10)

Case Law (7)

Onderwerp :

Raad van State

Bij besluit van 12 juli 2022 heeft de minister van Justitie en Veiligheid een verzoek van [wederpartij] om openbaarmaking van gegevens op grond van de Wet open overheid afgewezen. [wederpartij] was en is verdachte in een strafrechtelijk onderzoek met de codenaam ‘Peen’. [wederpartij] heeft op 15 april 2022 op grond van de Woo verzocht om openbaarmaking van documenten. [wederpartij] heeft twee lijsten, bestaande uit een deel 1 en een deel 2 overgelegd, waarop hij heeft gespecificeerd waarover hij allemaal informatie wil verkrijgen. Het gaat [wederpartij] om informatie die hij niet in zijn strafzaak kan krijgen. Voor zover relevant in deze procedure bij de voorzieningenrechter, heeft het openbaarmakingsverzoek betrekking op documenten die zien op de totstandkoming van en besluitvorming binnen de onderzoeken met de codenamen 13Yucca, 26Werl en het daaruit voortgevloeide onderzoek 26Argus. Uit deze onderzoeken zijn gegevens verkregen die zijn gebruikt in het onderzoek Peen, dat op [wederpartij] betrekking heeft. [wederpartij] heeft ook verzocht om openbaarmaking van documenten die zien op de totstandkoming en besluitvorming binnen het Joint Investigation Team.

GC and Others v CNIL

C-136/17 (GC and Others)

Conditions for delisting sensitive data from search results.

Peter Nowak v Data Protection Commissioner

C-434/16 (Nowak)

Examination scripts constitute personal data of the candidate.

Peter Puškár v Finančné riaditeľstvo Slovenskej republiky and Kriminálny úrad finančnej správy

Puškár

Lawful basis (in general): Subject to the exceptions permitted under Article 13 of the Data Protection Directive, all processing of personal data must comply, first, with the principles relating to data quality (in Article 6 of that directive) and, have lawful basis (by complying with one criteria for making data processing legitimate listed in Article 7 of that directive) (see, Bara). The list of lawful basis in Article 7 is an exhaustive and restrictive list of cases in which the processing of

GOOGLE SPAIN SL V. AEPD (THE DPA) & MARIO COSTEJA GONZALEZ, 13.May.2014 (“GOOGLE v. Spain”)

Google Spain

Legitimate interest balancing test: Legitimate interest requires balancing of the interest of the controller and third party with the interest of the data subject. In this particular case, having regard to the sensitivity for data subject’s private life of information contained in announcements and the fact that the initial publication occurred 16 years earlier, the data subject has established that the links should be removed. (¶¶ 70–75, 80-81, 98)

Google Spain SL and Google Inc. v AEPD and Mario Costeja González

C-131/12 (Google Spain)

Established the right to be forgotten (delisting). Search engines are data controllers.

Digital Rights Ireland Ltd v Minister for Communications

C-293/12 (Digital Rights Ireland)

Invalidated Data Retention Directive as incompatible with fundamental rights.

Guidance (7)

Guidelines 8/2020 on the targeting of social media users

Guidelines on the targeting of social media users

Guidelines 05/2020 on consent under Regulation 2016/679

Guidelines on consent

Richtsnoeren 05/2022 voor het gebruik van gezichtsherkenningstechnologie in het kader van rechtshandhaving

guidelines gebruik gezichtsherkenning bij rechtshandhaving

Steeds meer rechtshandhavingsinstanties passen gezichtsherkenningstechnologie toe of zijn voornemens deze toe te passen. De technologie kan worden gebruikt om een persoon te authenticeren of te identificeren en kan voor video's (bijv. CCTV) of foto's worden ingezet, maar ook voor andere doeleinden, waaronder het opzoeken van personen op signaleringslijsten van de politie of het volgen van de bewegingen van een persoon in de openbare ruimte. Gezichtsherkenningstechnologie is gebaseer...

Richtsnoeren 4/2019 inzake artikel 25 Gegevensbescherming door ontwerp en door standaardinstellingen

guidelines privacy by design en default

Guidelines 02/2021 on virtual voice assistants

Guidelines on virtual voice assistants

A virtual voice assistant (VVA) is a service that understands voice commands and executes them or mediates with other IT systems if needed. VVAs are currently available on most smartphones and tablets, traditional computers, and, in the latest years, even standalone devices like smart speakers. VVAs act as interface between users and their computing devices and online services such as search engines or online shops. Due to their role, VVAs have access to a huge amount of personal...

Guidelines 05/2022 on the use of facial recognition technology in the area of law enforcement

Guidelines on the use of facial recognition technology in the area of law enforcement

More and more law enforcement authorities (LEAs) apply or intend to apply facial recognition technology (FRT). It may be used to authenticate or to identify a person and can be applied on videos (e.g. CCTV) or photographs. It may be used for various purposes, including to search for persons in police watch lists or to monitor a person's movements in the public space. FRT is built on the processing of biometric data , therefore, it encompasses the processing of special categories ...

Richtsnoeren 02/2021 inzake virtuele spraakassistenten

guidelines over virtuele spraakassistenten

Een virtuele spraakassistent ( virtual voice assistant , of VVA) betreft een dienst die spraakgestuurde opdrachten begrijpt en uitvoert, of indien nodig als tussenschakel optreedt naar andere IT-systemen. Tegenwoordig is een VVA als optie beschikbaar op de meeste smartphones, tablets en reguliere computers en sinds enkele jaren zelfs op losse apparaten zoals smartspeakers. Een VVA functioneert als schakel tussen de gebruiker en zijn apparaat of een online dienst zoals een zoekmachine...

News (1)