NVIDIA Rubin CPX Explained: Disaggregated Inference And The Cost Of Million-Token Context
NVIDIA’s Rubin platform splits long-context prefill from token decode. Rubin CPX handles the compute-heavy front half, standard Rubin handles bandwidth-heavy generation. The NVL144 CPX rack is the first productized version of that idea. Here is what changes in silicon, memory, power, networking, and scheduling, and why it matters for costs in the million-token era.
What Rubin CPX actually is
Rubin CPX is a companion accelerator for Vera Rubin systems that targets the compute-heavy prefill phase of long-context inference. Prefill is where attention blows up with sequence length. Decode is different: it is more bandwidth and latency sensitive per token. NVIDIA’s answer is to split the work and size each die for the part it does best. That is the core of “disaggregated inference.”
The first real product: NVL144 CPX
NVIDIA’s own briefings describe an NVL144 CPX rack that combines Rubin GPUs, Rubin CPX accelerators, and Vera CPUs in one cabinet. Claimed figures include racks with on the order of 8 EF (NVFP4) of AI performance, roughly 100 TB of fast memory, and ~1.7 PB/s aggregate memory bandwidth per rack. Compared to GB300 NVL72, NVIDIA positions this as a very large generational jump in rack-level capability for long-context workloads.
Why split prefill and decode
- Different bottlenecks: Prefill is dominated by attention and matmul compute; decode leans on memory bandwidth and cache locality.
- Right sizing: A compute-optimized CPX die with fast local memory can run prefill hot without paying for the larger HBM footprint standard Rubin needs to feed decode well.
- Scheduler freedom: With two distinct engines, orchestration can pipeline batches so CPX is never idle and Rubin never starved.
Memory: GDDR7 + HBM4 is not a downgrade, it is a role match
Standard Rubin is the bandwidth monster with large HBM stacks and a dual-die configuration aimed at sustained generation. CPX is compute-lean with fast GDDR7 and attention-friendly kernels. In a million-token world, that split is logical. You pay for HBM where decode keeps it busy, and you switch to cheaper, dense GDDR7 where compute intensity dominates.
What this means for racks and power
Expect very high nameplate power per rack and liquid cooling as the default. The shape is simple: heavy prefill windows will light CPX and the interconnect; decode windows will shift draw to standard Rubin GPUs and NVSwitch. If you average the two, you still live in the high hundreds of kilowatts per cabinet once utilization is healthy. Designing halls and liquid plant for that sustained load is the new normal.
Healthy hall behavior
- Flat manifold pressures and pump curves under long soak tests.
- Thermal maps that do not oscillate between prefill and decode windows.
- Liquid segmentation that lets operators isolate trays without draining large zones.
Networking: keep tensors local, evacuate without pain
Disaggregation adds inter-engine traffic inside the rack and across racks when batches move between CPX trays and Rubin trays. NVLink/NVSwitch generations increase fabric headroom, but the cardinal rule remains: do not bounce data unless the batch justifies it. For multi-rack or multi-room deployments, the campus network must be sized so evacuation looks like a fast copy plus restart, not a rebuild.
Scheduling is the real product here
Once you split the work, the scheduler decides if the investment pays off. The goals are boring to say and hard to do: keep useful accelerator hours per installed megawatt high, keep state local, and keep fallbacks rare. If a decode-heavy night turns CPX into parked capacity for hours, the model mix and batch planner need a tune-up. If prefill saturates CPX and starves Rubin, that is also a waste. The right fix is not more hardware. It is better placement and batch shaping.
Operator-level signals to watch
- Useful accelerator hours per MW: should sit high and stable week over week.
- Evacuation time vs. checkpoint interval: evacuation should finish well under the median checkpoint time.
- Energy per token at fixed quality: should fall release by release as kernels and compilers mature.
- P99 latency during faults: should remain flat when a rack is drained and refilled.
Economics in one line: cost per million tokens at quality parity
Once racks are lit, the cost story collapses to energy, utilization, and rework. CPX helps by turning long-context prefill into a cheaper operation per token, but only if the scheduler keeps both engines busy and minimizes memory movement. The rack is the unit of cost. The scheduler is the lever.
What could go wrong
- Toolchain maturity: if kernels and compilers lag, CPX will not earn its keep on prefill.
- Packaging and memory supply: HBM4 and substrates for the standard Rubin path remain brittle. If bins fall, decode throughput falls.
- Thermal transitions: moving to more liquid-heavy distribution mid-ramp without downtime is surgical work.
- Over-eager disaggregation: splitting work across racks without sufficient fabric makes networking a tax.
My Bottom line
Rubin CPX is not a marketing suffix. It is a response to the million-token era. Splitting prefill from decode lets NVIDIA right-size silicon and memory for each phase and pack far more useful work into each rack. The hardware is the easy part. The hard part is scheduling and memory placement that keep both engines busy without spraying tensors across the fabric. If operators get that right, cost per token drops and user-visible latency stays predictable. If not, the most expensive component in the rack will be time.







Leave a Reply