Back to Podcast Digest
AI Engineer13m

Under 5 minutes to a deployed LLM endpoint — Audry Hsu, RunPod

TL;DR

  • RunPod's pitch is simple: bring your code or a Hugging Face model, and RunPod handles the GPUs, containers, and deployment plumbing so developers can focus on building instead of infrastructure.

  • The company started with basement GPUs: founders Zenon and Pardeep turned failed crypto-mining rigs into the first version of RunPod in 2022, posted on Reddit offering free GPU access for feedback, and have been revenue-generating since.

  • The scale is no longer tiny: Audrey says RunPod now serves more than 500,000 developers across 30-plus data centers and has reached $120 million in annual recurring revenue.

  • Serverless is the main product for inference: teams can set max workers, spending caps, and always-on workers, paying only while requests are actually being processed instead of keeping containers running all the time.

  • The live demo shows the tradeoff clearly: deployment from a Hub listing was fast, but the first request sat in queue for about 41 seconds because workers were initializing and downloading the model, while actual execution took only about 1.5 seconds.

  • The Hub is the shortcut: pre-vetted AI repos with preconfigured Dockerfiles and defaults let users fork, tweak environment variables, and deploy popular open-source models like vLLM-backed LLM endpoints with just a few clicks.

The Breakdown

RunPod claims you can go from zero to a deployed LLM endpoint in under five minutes, and Audrey Hsu more or less proves it live by spinning up a vLLM-based serverless endpoint from the Hub, with the first response arriving after a 41-second cold start. Along the way, she positions RunPod as the abstraction layer for GPU chaos: 500,000 developers, 30-plus data centers, and $120 million ARR built from a couple of failed crypto-mining rigs in a basement.

Was This Useful?

Share