TL;DR
The developers have released an open-source project allowing full reproduction of DeepSeek-R1’s AI pipeline. This includes datasets, training scripts, and evaluation tools, marking a significant step in transparency and collaboration in AI research.
Developers have publicly released a fully open reproduction of DeepSeek-R1, including datasets, training scripts, and evaluation tools, allowing the AI research community to replicate and build upon the model’s pipeline.
The project, hosted on a public repository, aims to recreate the entire DeepSeek-R1 pipeline, from data generation and model training to evaluation. It involves scripts for training models with techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), as well as data generation methods from smaller distilled models and the original DeepSeek-R1.
As of May 26, 2025, the team announced the completion of the first step: the release of Mixture-of-Thoughts, a reasoning dataset of 350,000 verified traces spanning mathematics, coding, and science tasks. This dataset is designed to facilitate reasoning capabilities in language models and is a core component for reproducing DeepSeek-R1’s reasoning performance.
The project also includes datasets like CodeForces-CoTs and IOI24, which enable training models that outperform some existing systems on competitive programming and reasoning benchmarks. The repository provides detailed instructions for installation, environment setup, and training, supporting large-scale models on high-performance hardware such as H100 GPUs.
While the repository offers comprehensive scripts and datasets, some aspects, such as the exact training procedures used in DeepSeek-R1’s original pipeline, remain partially inferred from the published reports. The developers emphasize that the project is a work in progress, encouraging community contributions to refine and extend the reproduction efforts.
Implications for AI Transparency and Collaboration
This open release marks a significant step toward transparency in large language model development, allowing researchers to verify, reproduce, and improve upon DeepSeek-R1’s architecture and training methods. It democratizes access to advanced AI training pipelines, potentially accelerating innovation and reducing duplication of effort across the community.
By sharing datasets, code, and training protocols, the project fosters collaborative research, enabling more rigorous benchmarking and fostering improvements in reasoning, coding, and scientific AI applications. It also sets a precedent for open science in AI, encouraging other organizations to share their models and methodologies.

SLURM FOR AI AND DEEP LEARNING: GPU CLUSTER MANAGEMENT AND DISTRIBUTED TRAINING: SCHEDULE PYTORCH, TENSORFLOW, AND MULTI-NODE LLM WORKLOADS WITH JOB QUEUING AND RESOURCE OPTIMIZATION
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background and Progress of DeepSeek-R1 Reproduction
DeepSeek-R1 is a prominent large language model known for its reasoning and coding capabilities, developed by DeepSeek AI. Prior to this open release, access to the model and its training data was restricted, limiting external validation and collaborative development.
The current effort builds on earlier releases of datasets like Math-220k and CodeForces-CoTs, which demonstrated the model’s strong performance in specialized tasks. The project follows a multi-stage plan, including data distillation, supervised fine-tuning, and reinforcement learning, as outlined in DeepSeek’s technical report.
Recent months have seen incremental releases of datasets and benchmarks, culminating in the May 26, 2025, publication of the full reproduction toolkit. This marks a milestone in making DeepSeek-R1’s capabilities accessible for broader research and development.
“This open release is about transparency and community collaboration. We want to enable everyone to understand, verify, and improve upon DeepSeek-R1’s architecture.”
— Lead Developer of the Reproduction Project

NOVATECH AI Workstation Desktop PC – Intel Core i9-14900K, Liquid Cooling – Machine Learning, Data Science, 3D Rendering, Video Editing, Simulation (RTX PRO 6000 | 192GB RAM | 10TB)
Extreme AI & Machine Learning Performance Powered by the Intel Core i9-14900K and RTX PRO 6000 with 96GB…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Unresolved Aspects of the Reproduction Effort
While the datasets and scripts are publicly available, it remains unclear how closely the reproduced models match the original DeepSeek-R1 in terms of performance on all benchmarks. The exact training procedures and hyperparameter choices used in the original model are not fully disclosed, which may affect replication fidelity.
Additionally, the community has yet to verify the robustness of the reproduction across different hardware setups and to assess how well the models generalize beyond the provided benchmarks.

Secondary Development Kit for LeArm Open Source Upgrade, with WonderMV AI Vision, WonderEcho Voice Module, Wireless Controller, Ultrasonic/Touch/Acceleration Sensors (Not Included LeArm Robot Arm)
Versatile Sensor Expansion. LeArm Open Source supports a wide range of sensors, including AI vision, voice module, ultrasonic,…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Future Steps for Community Engagement and Model Validation
Moving forward, the project team plans to encourage external validation and contributions, including refining datasets, optimizing training procedures, and benchmarking the reproduced models against the original. They aim to expand the dataset library and improve the reproducibility of the entire pipeline.
Community members are expected to test the models on diverse tasks, report performance metrics, and contribute improvements via the project’s GitHub repository. The team also plans to document best practices for training and evaluation to facilitate wider adoption.

Wolfram Summer School Research Reports 2024
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
Can I use the reproduced DeepSeek-R1 models for commercial purposes?
The repository’s licensing terms are not specified here; users should review the license files in the repository to determine usage rights, especially for commercial applications.
What hardware do I need to reproduce DeepSeek-R1?
The project recommends high-performance GPUs like H100s for training large models. Smaller models or datasets can be trained on less powerful hardware, but performance and fidelity may vary.
Are there any limitations to the current reproduction effort?
Yes, some training details and hyperparameters are inferred, and the exact performance match with the original DeepSeek-R1 has not been fully validated. The project is ongoing and community feedback is encouraged.
How can I contribute to this open project?
Contributions are welcomed via the project’s GitHub repository, including code improvements, dataset expansion, and benchmarking reports. Follow the contribution guidelines provided in the repo.
Source: Hacker News