After my previous exploration of local vs cloud GPU performance for LLMs, I wanted to dive deeper into comparing inference speeds across different cloud API providers. With all the buzz around Groq and Cerebras's blazing-fast inference claims, I was curious to see how they stack up in real-world usage.
The Testing Framework
I developed a simple Node.js-based framework to benchmark different LLM providers consistently. The framework:
- Runs a series of standardised prompts across different providers
- Measures inference time and response generation
- Writes results to structured output files
- Supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Groq, and Cerebras
The test prompts were designed to cover different scenarios:
- Mathematical computations (typically challenging for LLMs)
- Long-form text summarisation (high input tokens, lower output)
- Structured output generation (JSON, XML, CSV formats)
Test Results
The complete benchmark results are available in this spreadsheet. While the GitHub repository contains the output from each LLM, we'll focus purely on performance metrics here.
One of the most interesting findings was the significant speed variation for identical models across different providers. This suggests that infrastructure and optimization play a crucial role in inference speed.
The most dramatic differences emerged when testing larger models like Llama 70B. Providers optimized for fast inference showed remarkable capabilities, demonstrating that even models with 70B parameters can achieve impressive speeds with the right infrastructure.
Groq's performance across different model sizes reveals an intriguing pattern: whether running small or large models, inference speeds remain remarkably consistent, suggesting they possibly managed to optimise for bigger models.
Key Findings
- Groq and Cerebras: The hype is real. Both providers demonstrated exceptional performance, particularly with larger models like Llama 3 70B
- Ollama: With a decent GPU (e.g., RTX 4090), smaller models (Llama 3.2 1B/3B) performed (speed-wise) comparably to the quickest "API-based models" like Anthropic's Claude Haiku 3 and Amazon's Nova Micro
- Speed rankings were fairly consistent across different prompts (math, summarisation, structured output)
- API throttling became an issue with larger models on AWS Bedrock (Claude Sonnet 3.5, Opus 3, Nova Pro)
Top comments (2)
A really good comparison, as a mere consumer I think we can take this at face value but from implementing as a company wide solution we should also consider the cost involved. But from a consumer point I think TSP and optimal cache eviction from groq is really paying off.
Thanks - the intention was merely to get a ballpark idea of speed across models and providers, although I appreciate there are so many variables that I did not consider which might affect results: output quality, costs, availability, scalability. Interesting that I hit throttling limits much faster on Bedrock than on Groq (free). It would be great to dive deeper into Groq, Cerebras and similar capabilities to optimise for models (e.g., can you make 1B and 3B models much faster?) and scale to serve more users and bigger models (e.g., LLama 405B parameters).