More Photos of AWS COMMUNITY DAY in China (Shen Zhen)
Long-Context problem:
- Concurrent performance degradation as context length increases.
- Exponential growth in pre-fill latency with longer contexts.
- Linear increases in decoding latency and context switching costs as context length grows.
Long-Content optimization Hardware:
- A100 Memory Hierarchy - Leveraging the advanced memory architecture of the A100 GPU to improve performance for long-context models.
Long-Content optimization Machine Learning Engineering:
FlashAttention
An efficient attention mechanism that reduces the computational and memory costs of attention for long sequences.VLLM (Very Long Language Models)
Specialized techniques to enable training and inference of language models with extremely long contexts.
Long-Content optimization Model Architecture:
MoE (Mixture of Experts)
Using a modular model architecture with multiple specialized sub-networks to handle different aspects of long-context processing more efficiently.Speculative Decoding
Techniques to predict future tokens and start processing them in parallel, reducing overall latency for long-range dependencies.
Background of Prefill & Decode:
- Cost of LLM cluster inference:
- Throughput * Hardware Utilization / Hardware Price
Impact of Prefill duration on throughput:
- Prefill tasks occupy all computing resources, so Prefill-Prefill cannot be parallelized.
- Decode tasks require very few computing resources and can be parallelized with Prefill tasks.
Separate Prefill & Decode, Cut 80% cost
- introducing the DecodeOnly server.
- Achieve Prefill-Decode separation by transmitting inference data over the network.
- original architecture focuses on optimizing Prefill tasks.
- Prefill no longer needs to store the KV Cache data (data is immediately sent to the Decode server upon generation).
- Inference no longer requires large GPU memory support
Retrieval Augmented Generation (RAG):
- A technique that enhances language models by integrating external knowledge retrieval to generate more informed and relevant responses.
- RAG (includes: ETL, intention, retrieval)
- Model Lifecycle Management (includes: model, dataset, entity)
- Performance acceleration (includes: accelerate framework, quantization)
- Infrastructure Operation (includes: customize chip, managed service)
RAG Workflow:
- Data Preprocessing (ETL)
- Knowledge extraction
- Knowledge enhancement
- Knowledge vectorization
- Knowledge injection
LLM Orchestration:
- Intention identification (intention)
- knowledge retrieval(multi conversation rewrite)
- Retrieval
Knowledge Enhancement:
- QA document synthesize
- Content summary
- Content split
- Keyword extraction
Editor
Danny Chan, AWS community builder (Hong Kong), specialty of FSI and Serverless
Kenny Chan, AWS community builder (Hong Kong), specialty of FSI and Machine Learning
Top comments (2)
amazing!
happy learning