Back to Podcast Digest
AI Engineer··10m

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

TL;DR

  • NVIDIA’s pitch is local-first development, not cloud replacement — Mozhgan Kabiri frames DGX/“Jetson” Spark as a way to keep AI work close to the developer for cost predictability, data residency, deterministic latency, and faster iteration without waiting on shared infrastructure.

  • The hardware claim is big: 128 GB unified memory and models up to ~200B parameters on a desk-sized system — Spark uses the GB10 Grace Blackwell superchip plus NVIDIA’s production AI stack, so workflows can move from desktop to data center or cloud with minimal changes.

  • She built a reproducible benchmark harness instead of hand-wavy demos — Every vLLM run from 1.5B to 14B used Docker isolation, three warm-up runs, and GPU metrics logged every second, with versioned artifacts capturing full responses and metadata.

  • Quantization was the headline result, not just raw silicon — A 14B NVFP4 model hit 20.19 tokens/sec versus 8.40 tokens/sec for the 14B base model, showing that on Blackwell hardware the precision format can matter as much as the hardware itself.

  • Time to first token is treated as the real UX metric — Kabiri emphasizes that a model feels responsive when the first streamed token arrives quickly, and her 14B NVFP4 setup delivered 3.4× faster time-to-first-token than the unoptimized 14B base model.

  • Her practical sweet spot is ‘intelligence per byte’ — The takeaway is that memory capacity lets you fit large models, but memory bandwidth decides how responsive they feel, which is why NVFP4 is presented as the hero for local prototyping on Spark.

The Breakdown

The core problem: developers get pushed to the cloud by default

Kabiri opens with a very practical complaint: modern AI work either runs out of memory or hits software-stack friction, so teams end up shoving everything into the cloud or data center. That creates its own pain — unpredictable cost, data residency issues, latency concerns, and development delays when your experiments are competing with everyone else’s jobs.

What Spark is supposed to solve

She positions NVIDIA Spark as a desk-side system built specifically to bring serious AI development back to where developers actually work. The specs are the hook: a GB10 Grace Blackwell superchip, 128 GB of unified memory, NVLink support, and enough headroom to work with models up to roughly 200 billion parameters locally while using the same NVIDIA AI software stack as production.

Her setup was built to be reproducible, not flashy

Instead of a glossy benchmark slide, she walks through her actual setup: vLLM serving quantized models of different sizes inside an NVIDIA-optimized container. She even jokes that she wants to show “the how,” then details the automated benchmarking harness with Docker isolation, three mandatory warm-up runs, and GPU logging at one-second intervals so every result can be verified later.

The measurement detail that matters: streaming and first token timing

Kabiri makes a point that end-to-end latency matters, but time to first token is what users actually feel. She highlights the script that handles streaming responses from the vLLM server and timestamps the first chunk, because this is the difference between an app feeling instant versus feeling broken.

Raw throughput numbers: small models fly, but 14B with NVFP4 is the real story

The 1.5B instruct model leads the chart at 61.73 tokens per second, but she clearly cares most about the 14B NVFP4 result: 20.19 tokens per second. Her framing is that this is the engineering sweet spot — nearly 10× the parameters of the 1.5B model, but still faster than average human reading speed.

Quantization changed everything

The contrast that makes her point land is the 14B base model at just 8.40 tokens per second. Her takeaway is blunt: on Blackwell, quantization format is just as important as the hardware, and NVFP4 is what lets Spark bridge research experimentation and production-style prototyping on a single local machine.

Why responsiveness beats capacity in practice

She returns to time to first token and calls out the side-by-side 14B comparison again: the NVFP4 version is 3.4× faster to first token than the unoptimized base model. That leads to her broader lesson that memory capacity and memory bandwidth are not the same thing — 128 GB lets you fit big models, but efficient data movement determines whether those models actually feel usable.

Where she thinks Spark fits in the workflow

Her closing recommendation is specific: use Spark for steady-state workloads, privacy-sensitive work, and rapid local prototyping. The bigger idea is a workflow loop — run locally, iterate fast, then move the exact same stack to cloud or data center when you’re ready to scale.