on rtx pro 6000 - what is max prefill i can get for a 30B fp8 model concurrent throughout
I'm a financial research assistant focused on stocks, securities, and market data. The tools I have are designed for financial analysis (loading price data, fundamentals, SEC filings, computing financial ratios, charting) — not for GPU hardware specifications or LLM inference benchmarks.
That said, I can try to find relevant technical info online for you. Want me to search for:
I'm a financial research assistant by design, but I searched the web for the most relevant technical data. Here's what published benchmarks show for max prefill on a 30B-class FP8 model on a single RTX PRO 6000 Blackwell (96GB):
Peak Prefill Speed (tokens/sec processed during prefill)
| Concurrency | Peak Prefill | Best Context | Source |
|---|---|---|---|
| 1 user | ~38,700 tok/s | 2K ctx | casualcomputer (SGLang, FP8) |
| 1 user | ~36,900 tok/s | 8K ctx | Millstone (Qwen3-Coder-30B-A3B FP8, vLLM) |
| 2 concurrent | ~36,300 tok/s | 8K ctx | Millstone |
| 3 concurrent | ~32,900 tok/s | 8K ctx | Millstone |
| 4 concurrent | ~34,000 tok/s | 8K ctx | Millstone |
| 10 concurrent | still working (32K ctx) | – | r/LocalLLaMA notaDestroyer |
Key numbers for your question
- Max single-user prefill: ~36–39K tok/s (peaks around 8K context, then drops as context grows and compute/memory-bandwidth saturate)
- Sustained 4-concurrent prefill: ~33–34K tok/s per user (≈90% of single-user peak — the 1.8 TB/s GDDR7 bandwidth is the binding constraint)
- At 10 concurrent users on Qwen3-30B-A3B FP8: prefill still completes successfully across 1K–256K contexts, though per-request prefill rate drops further as queue time dominates TTFT
- TTFT scaling (cold prefill, no prompt cache): 37ms @ 1K → 6.8s @ 32K → 78.8s @ 256K
What's actually limiting you
- Memory bandwidth (1,792 GB/s GDDR7), not compute — prefill is memory-bound on Blackwell for MoE/transformer blocks
- Chunked prefill caps in vLLM (typical 8K–16K) keep TTFT responsive under load at the cost of peak per-request prefill rate
- VRAM headroom for KV cache is what determines how many concurrent users you can run, not raw prefill speed — on a 30B FP8 model you have ~65GB of KV budget, so concurrency is mostly bounded by context length × users, not prefill
Bottom line for "max prefill throughout concurrent"
If you're optimizing for aggregate prefill throughput across concurrent users (tok/s summed): - Expect ~130–150K tok/s aggregate prefill at 4 users × 8K ctx - Drops to ~40–50K tok/s aggregate at 10 users × 64K ctx (queue + memory contention)
Caveat: I'm not a GPU/ML perf engineer — these are aggregated from third-party benchmarks (Millstone, CloudRift, casualcomputer, r/LocalLLaMA) and real numbers will vary with your specific 30B model, engine (vLLM vs SGLang vs TRT-LLM), chunked-prefill config, and KV-cache format (FP8 vs NVFP4). If you want, I can pull a specific model/engine combination in more detail.
are you saying - at 4 users - throughput becomes almost 120K tok/sec?
No — let me correct that, I was imprecise. Let me re-read the actual numbers.
What the Millstone benchmark actually shows for 4 concurrent users (Qwen3-Coder-30B-A3B FP8)
Per-user prefill speed peaks at ~34,000 tok/s at 8K context with 4 concurrent requests.
But that's per-user, not aggregate. And the benchmark also says: - "Prefill speed peaks at a certain context length and then declines as additional context increases computational overhead" - The 34K/s is measured per user at the optimal 8K context point
So for aggregate prefill throughput at 4 users
If each of 4 users is prefill-processing at ~34K tok/s simultaneously at the 8K sweet spot: - Aggregate prefill ≈ 4 × 34K = ~136K tok/s — but this assumes the GPU can sustain that across all 4 users at once, which the Millstone data implies (since they report it as "per-user" prefill at 4 concurrent, not as a degraded single-stream number)
However, the more conservative read: - The casualcomputer benchmark shows 38,744 tok/s as the peak prefill throughput for the GPU at batch 1, 2K context, Qwen 32B FP8 - At batch 4 in casualcomputer's data, prefill speedup over DGX Spark is 6.20× — which on its own doesn't tell us absolute tok/s - The note that "prefill speed peaks... then declines" suggests 34K per user at 4 concurrent is near the ceiling before bandwidth saturation
Honest answer
~120–140K tok/s aggregate prefill at 4 users is plausible if you take the per-user peak × 4 users at the 8K optimal context point. But: - This is at short context (8K), not long context - The 34K/user figure already reflects some degradation from the single-user 37K peak - At 32K+ context, the per-user number drops into the ~5–10K tok/s range, so aggregate at 4 users × 32K is more like 20–40K tok/s
So my 120K number was the ceiling at short context, not a typical real-world figure. For a realistic deployment (mixed contexts, sustained load), aggregate prefill throughput on a 30B FP8 model on one RTX PRO 6000 is more like 40–80K tok/s across 4 concurrent users, with peaks up to ~130K tok/s at short context.
Apologies for the imprecision in the first reply.