Maxim Saplin

Posted on Aug 22 • Edited on Oct 12

llama.cpp: CPU vs GPU, shared VRAM and Inference Speed

#ai #machinelearning #llm #genai

This is the 2nd part of my investigations of local LLM inference speed. Here're the 1st and 3rd ones

NVidia GPUs offer a Shared GPU Memory feature for Windows users, which allocates up to 50% of system RAM to virtual VRAM. If your GPU runs out of dedicated video memory, the driver can implicitly use system memory without throwing out-of-memory errors—application execution is not interrupted. Yet there's a performance toll.

Memory is the key constraint when dealing with LLMs. And VRAM is way more expensive than ordinary DDR4/DDR5 system memory. E.g. RTX 4090 with 24GB GDDR6 on board costs around $1700, while RTX 6000 with 48GB of GDDR6 goes above $5000. 2 sticks of G.Skill DDR5 with a total capacity of 96GB will cost you around $300. My workstation has RTX 4090 and 96GB of RAM, making 72GB of video memory available to the video card. Does it make sense to fill your PC with as much RAM as possible and have your LLM workloads use Shared GPU RAM?

I have already tested how GPU memory overflow into RAM influences LLM training speed. This time I've tried inference via LM Studio/llama.cpp using 4-bit quantized Llama 3.1 70B taking up 42.5GBs

LM Studio (a wrapper around llama.cpp) offers a setting for selecting the number of layers that can be offloaded to the GPU, with 100% making the GPU the sole processor. At the same time, you can choose to keep some of the layers in system RAM and have the CPU do part of the computations—the main purpose is to avoid VRAM overflows. Yet, as already mentioned, on Windows (not Linux), it is possible to overflow VRAM.

I tried 3 off-load settings:

100% GPU - ~50% of model weights stayed in VRAM while the other half was located in RAM
50% GPU and 50% CPU - this setting has filled VRAM almost completely without overflows into Shared memory
100% CPU

Here's my hardware setup:

Intel Core i5 13600KF (OC to 5.5GHz)
96GB DDR5 RAM at 4800MT/s (CL 30, RCD 30, RCDW 30, RP 30 ~ 70GB/s Read/Write/Copy at AIDA)
RTX 4090 24GB VRAM (OC, core at 3030MHz, VRAM +1600MHz ~37000 GPU Score at TimeSpy)

And here're the results:

	Tokens/s	Time-to-first Token (s)	RAM used (GB)
100% GPU	0.69	4.66	60
50/50 GPU/CPU	2.32	0.42	42
100% CPU	1.42	0.71	42

Please note that for the time to the first token I used the "warm" metric, i.e. the time from the second generation (loaded the model, generated completion, and then clicked "regenerate"). For the cold time to the first token I got

100% GPU ~6.9s
50/50 CPU/GPU ~2.4s
100% CPU ~30s

Besides, when using 100% GPU offload I had ~20GB more system RAM used (no matter whether "use_mlock" was set or not).

Apparently, there's not much point in Shared VRAM.

P.S> Screenshots...

100% GPU
50/50 CPU/GP
100% CPU

Top comments (4)

jim3692 • Sep 17

So, besides the cold start time, it's more performant to just use the CPU for big models, instead of paying for a 4090 ?

Maxim Saplin • Sep 17 • Edited

GPU+CPU is still significantly faster then CPU alone, the more layers can fit into VRAM and the less layers are processed by CPU - the better. You just have to watch out for VRAM overflows and do not let GPU use RAM as extension - this way you get performance that is worse than CPU alone.

jim3692 • Sep 17

It's almost double the performance. Which should mean that opting for a Ryzen 9 9950X, which has 75% better performance than the i5 (based on PassMark), while costing 1/3 of an RTX 4090, will make the GPU upgrade not worth.

Maxim Saplin • Sep 17 • Edited

I wouldn't be so confident in scaling Passmark score to LLM inference speed... E.g. here're my results for CPU only inference of Llama 3.1 8B 8bit on my i5 with 6 power cores (with HT):
12 threads - 5,37 tok/s
6 threads - 5,33 tok/s
3 threads - 4,76 tok/s
2 threads - 3,8 tok/s
1 thread - 2,3 tok/s

It doesn't seem the speed scales well with the number of cores (at least with llama.cpp/LM Studio, changed n_threads param)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community

llama.cpp: CPU vs GPU, shared VRAM and Inference Speed

Top comments (4)

Read next

Function-Calling vs. Model Context Protocol (MCP)

Deus in Machina: Pinging Jesus in the Digital Confessional

Day 45: Interpretability Techniques for LLMs

What’s on the Horizon for AI in Healthcare?