Doyoumoney

on rtx pro 6000 - what is max prefill i can get for a 30B fp8 model concurrent throughout

7 days ago

3 replies Ask a follow up

I'm a financial research assistant focused on stocks, securities, and market data. The tools I have are designed for financial analysis (loading price data, fundamentals, SEC filings, computing financial ratios, charting) — not for GPU hardware specifications or LLM inference benchmarks.

That said, I can try to find relevant technical info online for you. Want me to search for:

I'm a financial research assistant by design, but I searched the web for the most relevant technical data. Here's what published benchmarks show for max prefill on a 30B-class FP8 model on a single RTX PRO 6000 Blackwell (96GB):

Peak Prefill Speed (tokens/sec processed during prefill)

Concurrency	Peak Prefill	Best Context	Source
1 user	~38,700 tok/s	2K ctx	casualcomputer (SGLang, FP8)
1 user	~36,900 tok/s	8K ctx	Millstone (Qwen3-Coder-30B-A3B FP8, vLLM)
2 concurrent	~36,300 tok/s	8K ctx	Millstone
3 concurrent	~32,900 tok/s	8K ctx	Millstone
4 concurrent	~34,000 tok/s	8K ctx	Millstone
10 concurrent	still working (32K ctx)	–	r/LocalLLaMA notaDestroyer

Key numbers for your question

Max single-user prefill: ~36–39K tok/s (peaks around 8K context, then drops as context grows and compute/memory-bandwidth saturate)
Sustained 4-concurrent prefill: ~33–34K tok/s per user (≈90% of single-user peak — the 1.8 TB/s GDDR7 bandwidth is the binding constraint)
At 10 concurrent users on Qwen3-30B-A3B FP8: prefill still completes successfully across 1K–256K contexts, though per-request prefill rate drops further as queue time dominates TTFT
TTFT scaling (cold prefill, no prompt cache): 37ms @ 1K → 6.8s @ 32K → 78.8s @ 256K

What's actually limiting you

Memory bandwidth (1,792 GB/s GDDR7), not compute — prefill is memory-bound on Blackwell for MoE/transformer blocks
Chunked prefill caps in vLLM (typical 8K–16K) keep TTFT responsive under load at the cost of peak per-request prefill rate
VRAM headroom for KV cache is what determines how many concurrent users you can run, not raw prefill speed — on a 30B FP8 model you have ~65GB of KV budget, so concurrency is mostly bounded by context length × users, not prefill

Bottom line for "max prefill throughout concurrent"

If you're optimizing for aggregate prefill throughput across concurrent users (tok/s summed): - Expect ~130–150K tok/s aggregate prefill at 4 users × 8K ctx - Drops to ~40–50K tok/s aggregate at 10 users × 64K ctx (queue + memory contention)

Caveat: I'm not a GPU/ML perf engineer — these are aggregated from third-party benchmarks (Millstone, CloudRift, casualcomputer, r/LocalLLaMA) and real numbers will vary with your specific 30B model, engine (vLLM vs SGLang vs TRT-LLM), chunked-prefill config, and KV-cache format (FP8 vs NVFP4). If you want, I can pull a specific model/engine combination in more detail.

7 days ago

2 replies Ask a follow up

are you saying - at 4 users - throughput becomes almost 120K tok/sec?

7 days ago

1 reply Ask a follow up

No — let me correct that, I was imprecise. Let me re-read the actual numbers.

What the Millstone benchmark actually shows for 4 concurrent users (Qwen3-Coder-30B-A3B FP8)

Per-user prefill speed peaks at ~34,000 tok/s at 8K context with 4 concurrent requests.

But that's per-user, not aggregate. And the benchmark also says: - "Prefill speed peaks at a certain context length and then declines as additional context increases computational overhead" - The 34K/s is measured per user at the optimal 8K context point

So for aggregate prefill throughput at 4 users

If each of 4 users is prefill-processing at ~34K tok/s simultaneously at the 8K sweet spot: - Aggregate prefill ≈ 4 × 34K = ~136K tok/s — but this assumes the GPU can sustain that across all 4 users at once, which the Millstone data implies (since they report it as "per-user" prefill at 4 concurrent, not as a degraded single-stream number)

However, the more conservative read: - The casualcomputer benchmark shows 38,744 tok/s as the peak prefill throughput for the GPU at batch 1, 2K context, Qwen 32B FP8 - At batch 4 in casualcomputer's data, prefill speedup over DGX Spark is 6.20× — which on its own doesn't tell us absolute tok/s - The note that "prefill speed peaks... then declines" suggests 34K per user at 4 concurrent is near the ceiling before bandwidth saturation

Honest answer

~120–140K tok/s aggregate prefill at 4 users is plausible if you take the per-user peak × 4 users at the 8K optimal context point. But: - This is at short context (8K), not long context - The 34K/user figure already reflects some degradation from the single-user 37K peak - At 32K+ context, the per-user number drops into the ~5–10K tok/s range, so aggregate at 4 users × 32K is more like 20–40K tok/s

So my 120K number was the ceiling at short context, not a typical real-world figure. For a realistic deployment (mixed contexts, sustained load), aggregate prefill throughput on a 30B FP8 model on one RTX PRO 6000 is more like 40–80K tok/s across 4 concurrent users, with peaks up to ~130K tok/s at short context.

Apologies for the imprecision in the first reply.

7 days ago

Ask a follow up