DEV Community

Andy
Andy

Posted on β€’ Edited on

1 1 1 1 1

Flash MLA curated references

Flash MLA Offical Github Repo: FlashMLA - deepseek-ai - Github

DeepSeek Official Anouncement of Flash MLA on X:

Hacker News Discussion: DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs | Hacker News

Deepseek Open Source week series

Day 1: Flash MLA

πŸš€ Day 1 of #OpenSourceWeek: FlashMLA

Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.

βœ… BF16 support
βœ… Paged KV cache (block size 64)
⚑ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

πŸ”— Explore on GitHub: https://github.com/deepseek-ai/FlashMLA

Day 2: DeepEP

πŸš€ Day 2 of #OpenSourceWeek: DeepEP

Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.

βœ… Efficient and optimized all-to-all communication
βœ… Both intranode and internode support with NVLink and RDMA
βœ… High-throughput kernels for training and inference prefilling
βœ… Low-latency kernels for inference decoding
βœ… Native FP8 dispatch support
βœ… Flexible GPU resource control for computation-communication overlapping

πŸ”— GitHub: https://github.com/deepseek-ai/DeepEP

Day 3: DeepGEMM

πŸš€ Day 3 of #OpenSourceWeek: DeepGEMM

Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.

⚑ Up to 1350+ FP8 TFLOPS on Hopper GPUs
βœ… No heavy dependency, as clean as a tutorial
βœ… Fully Just-In-Time compiled
βœ… Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
βœ… Supports dense layout and two MoE layouts

πŸ”— GitHub: https://github.com/deepseek-ai/DeepGEMM

Day 4: Optimized Parallelism Strategies

πŸš€ Day 4 of #OpenSourceWeek: Optimized Parallelism Strategies

βœ… DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
πŸ”— https://github.com/deepseek-ai/DualPipe

βœ… EPLB - an expert-parallel load balancer for V3/R1.
πŸ”— https://github.com/deepseek-ai/eplb

πŸ“Š Analyze computation-communication overlap in V3/R1.
πŸ”— https://github.com/deepseek-ai/profile-data

Day 5: 3FS

πŸš€ Day 5 of #OpenSourceWeek: 3FS, Thruster for All DeepSeek Data Access

Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.

⚑ 6.6 TiB/s aggregate read throughput in a 180-node cluster
⚑ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster
⚑ 40+ GiB/s peak throughput per client node for KVCache lookup
🧬 Disaggregated architecture with strong consistency semantics
βœ… Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1

πŸ“₯ 3FS β†’ https://github.com/deepseek-ai/3FS
β›² Smallpond - data processing framework on 3FS β†’ https://github.com/deepseek-ai/smallpond

Heroku

Amplify your impact where it matters most β€” building exceptional apps.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (0)

Image of Quadratic

AI spreadsheet assistant for easy data analysis

Chat with your data and get insights in seconds with the all-in-one spreadsheet that connects to your data, supports code natively, and has built-in AI.

Try Quadratic free

πŸ‘‹ Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay