Published
Reading time
2 min read
AI model performance benchmark comparing R1 1776 and DeepSeek-R1 across MMLU, DROP, MATH-500, and AIME 2024 tests.
Loading the Elevenlabs Text to Speech AudioNative Player...

Large language models built by developers in China may, in some applications, be less useful outside that country because they avoid topics its government deems politically sensitive. A developer fine-tuned DeepSeek-R1 to widen its scope without degrading its overall performance.

What’s new: Perplexity released R1 1776, a version of DeepSeek-R1 that responds more freely than the original. The model weights are available to download under a commercially permissive MIT license.

How it works: The team modified DeepSeek-R1’s knowledge of certain topics by fine-tuning it on curated question-answer pairs.

  • Human experts identified around 300 topics that are censored in China.
  • The authors developed a multilingual classifier that spots text related to these topics.
  • They identified 40,000 prompts that the classifier classified as sensitive with high confidence. They discarded those that contained personally identifiable information.
  • For each prompt, they produced factual, chain-of-thought responses that mirrored DeepSeek-R1's typical reasoning processes.
  • They fine-tuned DeepSeek-R1 on the resulting prompt-response pairs.

Results: The fine-tuned model responded to politically charged prompts factually without degrading its ability to generate high-quality output.

  • The authors fed their model 1,000 diverse prompts that covered frequently censored topics. An unspecified combination of human and AI judges rated the models' responses according to the degree to which they are (i) evasive and (ii) censored outright.
  • 100 percent of the fine-tuned model’s responses were rated uncensored, whereas the original version censored around 85 percent of sensitive queries. By comparison, DeepSeek-V3 censored roughly 73 percent, Claude-3.5-Sonnet around 5 percent, o3-mini about 1 percent, and GPT-4o 0 percent.
  • Evaluated on four language and math benchmarks (MMLU, DROP, MATH-500, and AIME 2024) and unspecified internal benchmarks, the fine-tuned and original models performed nearly identically. Their scores differed by a few tenths of a percent except on AIME 2024 (competitive high-school math problems), where the fine-tuned model achieved 79.8 percent compared to the original’s 80.96 percent.

Behind the news: Among the first countries to regulate AI, China requires AI developers to build models that uphold “Core Socialist Values” and produce true and reliable output. When these objectives conflict, the political goal tends to dominate. While large language models built by developers in China typically avoid contentious topics, the newer DeepSeek models enforce this more strictly than older models like Qwen and Yi, using methods akin to Western measures for aligning output, like Reinforcement Learning from Human Feedback and keyword filters.

Why it matters: AI models tend to reflect their developers’ values and legal constraints. Perplexity’s targeted fine-tuning approach addresses this barrier to international adoption of open-source models.

We’re thinking: As models with open weights are adopted by the global community, they become a source of soft power for their developers, since they tend to reflect their developers’ values. This work reflects a positive effort to customize a model to reflect the user’s values instead — though how many developers will seek out a fine-tuned version rather than the original remains to be seen.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox