Does Your Model Comply With the AI Act? COMPL-AI study measures LLMs’ compliance with EU’s AI act

Published

Nov 06, 2024

Reading time

4 min read

A new study suggests that leading AI models may meet the requirements of the European Union’s AI Act in some areas, but probably not in others.

What’s new: The Zurich-based startup LatticeFlow, working with research institutions in Bulgaria and Switzerland, developed COMPL-AI, an unofficial framework designed to evaluate large language models’ likely compliance with the AI Act. A leaderboard ranks an initial selection of models. (LatticeFlow does not work for the European Commission or have legal standing to interpret the AI Act.)

How it works: A paper explains how COMPL-AI maps the AI Act’s requirements to specific benchmarks. It evaluates each requirement using new or established tests and renders an aggregate score. These scores are relative measures, and the authors don’t propose thresholds for compliance. The assessment covers five primary categories:

Technical robustness and safety. The AI Act requires that models return consistent responses despite minor variations in input prompts and resist adversarial attacks. The framework uses metrics like MMLU and BoolQ to assess the impact of small changes in a prompt’s wording. It measures monotonicity (consistency in the relationship between specific inputs and outputs) to see how well a model maintains its internal logic across prompts. It uses Tensor Trust and LLM RuLES to gauge resistance to cyberattacks. This category also examines whether a model can identify and correct its own errors.
Privacy and data protection. Model output must be free of errors, bias, and violations of laws governing privacy and copyright. The framework looks for problematic examples in a model’s training dataset and assesses whether a model repeats erroneous, personally identifying, or copyrighted material that was included in its training set. Many developers don’t provide their models’ training datasets, so the authors use open datasets such as the Pile as a proxy.
Transparency and interpretability. Developers must explain the capabilities of their models, and the models themselves must enable those who deploy them to interpret the relationships between inputs and outputs. Measures of interpretability include TriviaQA and Expected Calibration Error, which test a model’s ability to gauge its own accuracy. The framework also assesses such requirements by, for instance, testing whether a model will tell users they’re interacting with a machine rather than a person, and whether it watermarks its output.
Fairness and non-discrimination. The law requires that model providers document potentially discriminatory outputs of their systems and that high-risk systems reduce the risk of biased outputs. The framework uses tests like RedditBias, BBQ, and BOLD to gauge biased language, and FaiRLLM to assess equitable outputs. It uses DecodingTrust to measure fairness across a variety of use cases.
Social and environmental wellbeing. Developers of high-risk systems must minimize harmful and undesirable behavior, and all AI developers must document consumption of energy and other resources used to build their models as well as their efforts to reduce it. The framework uses RealToxicityPrompts and AdvBench to measure a model’s propensity to generate objectionable or otherwise toxic output. It calculates a model’s carbon footprint to measure environmental wellbeing.

Results: The authors evaluated nine open models and three proprietary ones on a scale between 0 and 1. Their reports on each model reveal considerable variability. (Note: The aggregate scores cited in the reports don’t match those in the paper.)

All models tested performed well on benchmarks for privacy and data governance (achieving scores of 0.99 or 1) and social and environmental well-being (0.96 or above). However, several achieved relatively low scores in fairness and security, suggesting that bias and vulnerability to adversarial attacks are significant issues.
GPT-4 Turbo and Claude 3 Opus achieved the highest aggregate score, 0.89. However, their scores were diminished by low ratings for transparency, since neither model’s training data is disclosed.
Gemma-2-9B ranked lowest with an aggregate score of 0.72. It also scored lowest on tests of general reasoning (MMLU), common-sense reasoning (HellaSwag), and self-assessment (a model’s certainty in its answers to TriviaQA).
Some models performed well on typical benchmark tasks but less well in areas that are less well studied or easily measured. For instance, Qwen1.5-72B struggled with interpretability (0.61). Mixtral-8x7B performed poorly in resistance to cyberattacks (0.32).

Yes, but: The authors note that some provisions of the AI Act, including explainability, oversight (deference to human control), and corrigibility (whether an AI system can be altered to change harmful outputs, which bears on a model’s risk classification under the AI Act), are defined ambiguously under the law and can’t be measured reliably at present. These areas are under-explored in the research literature and lack benchmarks to assess them.

Why it matters: With the advent of laws that regulate AI technology, developers are responsible for assessing a model’s compliance before they release it or use it in ways that affect the public. COMPL-AI takes a first step toward assuring model builders that their work is legally defensible or else alerting them to flaws that could lead to legal risk if they’re not addressed prior to release.

We’re thinking: Thoughtful regulation of AI is necessary, but it should be done in ways that don’t impose an undue burden on developers. While the AI Act itself is overly burdensome, we’re glad to see a largely automated path to demonstrating compliance of large language models.

Subscribe to The Batch