Francesco Mattia

Posted on Dec 8

Testing LLM Speed Across Cloud Providers: Groq, Cerebras, AWS & More

#llm #genai #benchmarks #bedrock

After my previous exploration of local vs cloud GPU performance for LLMs, I wanted to dive deeper into comparing inference speeds across different cloud API providers. With all the buzz around Groq and Cerebras's blazing-fast inference claims, I was curious to see how they stack up in real-world usage.

The Testing Framework

I developed a simple Node.js-based framework to benchmark different LLM providers consistently. The framework:

Runs a series of standardised prompts across different providers
Measures inference time and response generation
Writes results to structured output files
Supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Groq, and Cerebras

The test prompts were designed to cover different scenarios:

Mathematical computations (typically challenging for LLMs)
Long-form text summarisation (high input tokens, lower output)
Structured output generation (JSON, XML, CSV formats)

Test Results

The complete benchmark results are available in this spreadsheet. While the GitHub repository contains the output from each LLM, we'll focus purely on performance metrics here.

One of the most interesting findings was the significant speed variation for identical models across different providers. This suggests that infrastructure and optimization play a crucial role in inference speed.

The most dramatic differences emerged when testing larger models like Llama 70B. Providers optimized for fast inference showed remarkable capabilities, demonstrating that even models with 70B parameters can achieve impressive speeds with the right infrastructure.

Groq's performance across different model sizes reveals an intriguing pattern: whether running small or large models, inference speeds remain remarkably consistent, suggesting they possibly managed to optimise for bigger models.

Key Findings

Groq and Cerebras: The hype is real. Both providers demonstrated exceptional performance, particularly with larger models like Llama 3 70B
Ollama: With a decent GPU (e.g., RTX 4090), smaller models (Llama 3.2 1B/3B) performed (speed-wise) comparably to the quickest "API-based models" like Anthropic's Claude Haiku 3 and Amazon's Nova Micro
Speed rankings were fairly consistent across different prompts (math, summarisation, structured output)
API throttling became an issue with larger models on AWS Bedrock (Claude Sonnet 3.5, Opus 3, Nova Pro)

Top comments (2)

Goutham Santhakumar • Dec 9

A really good comparison, as a mere consumer I think we can take this at face value but from implementing as a company wide solution we should also consider the cost involved. But from a consumer point I think TSP and optimal cache eviction from groq is really paying off.

Francesco Mattia • Dec 9 • Edited

Thanks - the intention was merely to get a ballpark idea of speed across models and providers, although I appreciate there are so many variables that I did not consider which might affect results: output quality, costs, availability, scalability. Interesting that I hit throttling limits much faster on Bedrock than on Groq (free). It would be great to dive deeper into Groq, Cerebras and similar capabilities to optimise for models (e.g., can you make 1B and 3B models much faster?) and scale to serve more users and bigger models (e.g., LLama 405B parameters).

DEV Community

Testing LLM Speed Across Cloud Providers: Groq, Cerebras, AWS & More

The Testing Framework

Test Results

Key Findings

Top comments (2)

Read next

Understanding the Evolution of Word Representation: Static vs. Dynamic Embeddings

Day 39: Summarization with LLMs

Unlock the Power of Meta Llama LLM: Easy Guide to Hosting in Your Local Dev Environment

Day 28: Model Compression Techniques for Large Language Models (LLMs)