010 References

References

By DeepSeek

DeepSeek-MoE by Dai et al. (2024) Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. https://arxiv.org/pdf/2401.06066
DeepSeek-v3 by Liu et al. (2024) DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
DeepSeek-r1 by Guo et al. (2025) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/pdf/2501.12948
DeepSeek-V3.2 (2025) Pushing the Frontier of Open Large Language Models https://arxiv.org/pdf/2512.02556
DeepSeek-v4 by Deepseek (2026) DeekSeek-V4: Towards Highly Efficient Million-Token Context Intelligence https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
mHC: Manifold-Constrained Hyper-Connections by Xie et al. (2026) https://arxiv.org/pdf/2512.24880

Others

Attention Is All You Need by Vaswani et al. (2017) https://arxiv.org/abs/1706.03762
GPT-2 by Radford et al. (2019) Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT-3 by Brown et al. (2020) Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
GPT-4 by Achiam et al. (2023) GPT-4 Technical Report https://arxiv.org/pdf/2303.08774
RoFormer by Su et al. (2021/2024) https://arxiv.org/abs/2104.09864v5; https://doi.org/10.1016/j.neucom.2023.127063
Multi-token Prediction by Gloeckle et al. (2024) https://arxiv.org/pdf/2404.19737
Hyper Connections by Zhu et al. (ByteDance) (2025) https://openreview.net/pdf?id=9FqARW7dwB