For Local LLMs, Buy VRAM, Not Speed

Software Engineering · 2026

A friend in Sweden runs a better local-LLM setup than I do, on hardware three years older than mine. He has two RTX 3090s. I have a single RTX 5080: newer, faster, roughly what his pair cost. For gaming there’s no contest, mine wins comfortably. For running models at home his old pair walks all over my new card, and the reason comes down to one number that turns out to govern this whole hobby: how much VRAM you have.

VRAM Is the Whole Game

For local inference, the spec that matters isn’t compute. It’s memory. A model has to fit, weights plus its context, inside the GPU’s VRAM. If it fits, things are fast. If it doesn’t, things fall off a cliff I’ll get to in a minute.

My 5080 has 16GB. Each of his 3090s has 24, for 48 combined. That gap is the entire story. It’s why a card I’d never trade for his in a game is the worse buy for this particular job, and why the second-hand 3090 has stubbornly remained the value pick for local LLMs years after it stopped being a flagship: it’s the cheapest way to get 24GB onto a card.

What 16GB Actually Runs

First, quantization, because it’s how any of this fits at all. Model weights are usually trained at 16-bit precision, and you can compress them down to 8, 5, or 4 bits with surprisingly little quality loss. Q8 and Q6 are close enough to the original that I can’t tell the difference in normal use. Q4 is the sensible floor. Go below it and models start getting noticeably dumber.

With that, here’s what my 16GB gets me:

7B to 14B at Q6 to Q8. Comfortable and quick. An 8B model runs around 80 to 100 tokens a second, a 14B around 50 to 65. This is the sweet spot, and it covers a lot of ground: the Qwen 14B, Phi-4, the Llama 8Bs, Qwen Coder 14B for code.
20B to 27B, only at a tight Q4, and you start trading away context length to make room. gpt-oss-20b is more or less built for this VRAM tier and fits almost exactly. A 27B will load at Q4, but you feel the squeeze.
70B and up: don’t. They’ll appear to load and then crawl, which is the cliff.

The Cliff Nobody Warns You About

When a model doesn’t fit in VRAM, it doesn’t fail with a clear error. It spills the overflow into system RAM and runs those layers on the CPU, and your comfortable 50 tokens a second collapses to one or two. The model “works.” It’s just unusable.

The frustrating part is that the tools mostly won’t tell you. LM Studio will happily load something too large and leave you wondering why each word takes a second to arrive. The honest signal is nvidia-smi: if VRAM is pinned at the ceiling and generation is glacial, you’ve gone over the edge. Learning to check that instead of trusting the app to warn me was the most useful habit I picked up running models locally, and it’s the whole reason I now think of VRAM as a hard wall rather than a soft preference.

Why the Old Cards Win

This is the whole reason my friend’s 3090s beat my 5080. His cards are slower silicon, an older generation, less efficient, worse at nearly everything on paper. But 24GB each means he runs a 27B model at high quality with room for a long context, sitting comfortably on the safe side of the cliff, while my newer and faster 16GB card chokes on the same model. For this one job you buy capacity, not the newest die.

The one place my card genuinely pulls ahead is Blackwell’s native FP4 support, which has dedicated silicon for 4-bit math and can make some things fit that nominally shouldn’t. For image models the effect is dramatic: FP4 takes FLUX from around 23GB down to about 9. For text models it helps at the margins. What it doesn’t do is rewrite the basic arithmetic, which is still that 16GB is 16GB.

Is Local Even Worth It

Here’s the part the spec arguments skip. Even at its best, a 14B model I run at home is not the model I reach for when I want a good answer. The gap between a solid local model and the frontier one I use through an API every day is real, and no amount of VRAM or quantization on a single consumer card closes it.

So I’ve gotten clear about when local is the right tool, and it’s never because it’s smarter. It’s when the reason has nothing to do with quality. Privacy and offline capability, mostly. A meeting-notes tool I’ve been building captures system audio and summarizes it entirely on-device, with nothing leaving the machine, and for that an 8B model plus a 3GB Whisper model are plenty. “Good enough and completely private” wins that job outright over “frontier and sitting in someone’s cloud.” Cost at volume is another reason, and sometimes I just want to tinker close to the metal and watch what the hardware does. When I actually want the best answer, though, I open the API like everyone else. Local isn’t a replacement for frontier models. It’s a different tool for when the binding constraint is privacy, cost, or being offline, rather than raw quality.

The Number That Matters

For most of what a GPU does, the spec sheet roughly tells the truth. Clocks, cores, and the gaming benchmarks all track how the card feels in practice. For this one task they mostly lie. The headline numbers that make the 5080 a great gaming card barely predict how good a local-LLM machine it is, and the boring capacity figure nobody puts on the box predicts almost all of it.

So if you’re buying a card to run models at home, the question isn’t how fast it is. It’s how much fits. On that question I’d take my friend’s three-year-old pair over my new flagship without a second thought. Which is worth asking the next time the newest thing tempts you: what are you actually optimizing for?