The quote for the GPUs came back and finance asked me to explain it twice. Fair question. We were serving an open model for an internal product, the demo numbers looked fine, and then someone put real traffic through it and the cards fell over at a request count that would embarrass a laptop. The instinct in the room was to buy more GPUs. The right move was to understand why one card was sitting at single-digit utilization while users waited.
This is about getting throughput out of hardware you already have, before you sign for more. It is also about knowing the moment to stop and call a hosted API instead, because that moment is real and pretending it isn’t has cost teams more than any GPU bill.
Throughput is a batching problem, not a speed problem
Here is the part nobody tells you when they show you their tokens-per-second screenshot. The single-request latency of an LLM is mostly fixed by the model and the card. You cannot meaningfully make one request faster by trying harder. What you can do is serve far more requests at once on the same card, and that is where all the money is.
A transformer generates one token at a time, and each step is dominated not by arithmetic but by moving the model weights from GPU memory through the compute units. For a single request, the card spends most of its time waiting on memory while the math units sit nearly idle. You are paying for a sports car and driving it in a school zone. The fix is to run many sequences through the same weight load, so each expensive trip to memory pays for a whole batch of users instead of one.
The naive way to batch is static: collect N requests, run them together, return when the slowest finishes. It helps, and it is also terrible in practice, because requests do not arrive in tidy groups of N and they do not finish together. One user wants three tokens, another wants eight hundred, and the short one waits for the long one to drain. Your batch is mostly padding and idle slots.
Continuous batching is the thing that moves the needle, and it is why vLLM exists. Instead of fixed groups, the scheduler works at the token step. The moment a sequence finishes, its slot is freed and a waiting request drops in, mid-flight, without stalling everyone else. The batch stays as full as the queue allows. On the same hardware, the difference between static and continuous batching under bursty traffic is the difference between the card falling over and the card being bored.
The KV-cache is the resource you are actually managing
To generate token 500, the model needs the attention keys and values for the previous 499. Recomputing those every step would be insane, so they get cached. That cache, the KV-cache, grows with every token in every active sequence, and it lives in the same GPU memory the weights need. This, not raw compute, is the real constraint on how many users you can batch.
The old serving stacks allocated KV-cache in big contiguous blocks sized for the worst case, so most of it sat reserved and empty for requests that ended up short. vLLM’s contribution, the part worth understanding, is paged attention: it manages KV-cache in small fixed pages the way an operating system manages memory, handing out pages on demand and reclaiming them the instant a sequence ends. Less waste means more concurrent sequences means higher batch occupancy means the throughput you were already paying for. You do not implement any of this. You do configure it, because the defaults assume nothing about your traffic.
from vllm import LLM, SamplingParams
# gpu_memory_utilization is the single most important knob here.
# It is the fraction of card memory vLLM may claim for weights + KV-cache.
# Push it high. Every gigabyte you leave on the table is concurrent
# requests you are refusing to serve. We run hot on a dedicated box
# and only back off when something else needs to share the card.
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.2",
gpu_memory_utilization=0.92, # not 0.4, which is the timid default many copy
max_model_len=4096, # cap context to what you actually use;
# a huge ceiling reserves KV-cache you never spend
max_num_seqs=256, # ceiling on concurrent sequences in a batch
)
# Batched offline scoring. vLLM's continuous batcher packs these for you;
# you do not write the batching loop, you just hand it the whole list.
prompts = load_eval_set() # a few thousand of them
params = SamplingParams(temperature=0.0, max_tokens=512)
outputs = llm.generate(prompts, params)
The two knobs that decide your bill are gpu_memory_utilization and max_model_len. Run the first high and the second tight. Most teams do the reverse: a cautious memory fraction and a generous context window, which reserves a mountain of KV-cache per request and starves the batch. You see utilization sit low and conclude you need a bigger card. You need a less timid config.
Quantization, and what it actually costs you
Quantization stores the weights at lower precision: instead of 16-bit floats, you use 8-bit integers, or 4-bit. The weights get roughly half or a quarter the size, so they fit on a smaller card, and since generation is memory-bound, fewer bytes to move per token usually means faster too. It is the closest thing to a free lunch here. Closest, not free.
What you lose is quality, and it depends on how far you push it and what the task is. Eight-bit weight quantization is close to invisible on most workloads; I have shipped it without a measurable hit on eval sets I trusted. Four-bit is where you have to actually look. On forgiving tasks, summarizing, classifying, routing, extraction with a validator behind it, 4-bit holds up and the savings are real. On tasks that need precise multi-step reasoning or careful instruction-following, 4-bit degrades in ways that do not show up until a user hits the exact case it fumbles. The degradation is rarely a uniform dumbing-down, it is a few specific behaviors quietly getting worse.
So the rule is boring: never trust the marketing benchmark, run your own eval set across precisions, and pick the lowest one that still passes. (I keep a few hundred real task examples for exactly this, and it has saved me from a 4-bit decision more than once.)
# Serving a 4-bit AWQ build. The model now fits on a far smaller card,
# which is the entire point when the quote came back too high.
llm = LLM(
model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
quantization="awq",
gpu_memory_utilization=0.90,
max_model_len=4096,
)
# Gate this behind your eval suite before it touches users. The day you
# skip that is the day 4-bit eats an answer you needed to be right.
Right-size the model before you optimize the serving
The cheapest token is the one a smaller model serves correctly. Before any of the above, ask whether the task needs a large model at all. A lot of production LLM work is classification, extraction, routing, and short structured generation, and a well-prompted 7B (or a small fine-tuned one) handles those at a fraction of the cost of a 70B you reached for out of habit. I have watched teams serve a giant model for a job a small one did better and cheaper. Match the model to the task, then make the serving efficient. The other way around is optimizing the wrong thing with great discipline.
When to just use the API
Here is the tradeoff I will not pretend away. Self-hosting wins when you have steady, high-volume traffic, you can keep the cards busy, and someone will own the operational reality: the driver versions, the OOM at 2am when a long context blows the KV-cache budget, the model updates, the autoscaling you now build yourself. At sustained scale with a real ops owner, the per-token cost falls well below the hosted price and the gap is not close.
It loses, badly, when your traffic is spiky or low. A GPU you rent by the hour and keep half-idle is the most expensive way to serve a token ever invented. If your load is bursty, intermittent, or still finding itself, a hosted API priced per token is cheaper, and you get to spend your engineering on the product instead of babysitting CUDA. The cutover point is roughly where utilization stays high enough and volume large enough that the fixed cost of the fleet amortizes. Below it, paying someone else’s margin is the rational move and the senior one.
Buy the GPUs when the math and the traffic both say so. Until then, the cheapest capacity you have is the utilization you are leaving on the floor.