Back to Podcast Digest
Latent Space1h 29m

The Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway

TL;DR

  • Railway is betting the cloud should be agent-native, not human-native — Jake Cooper’s core claim is that we’ve moved from “assembly to C to C++ to JavaScript to now words,” so infrastructure has to support thousands of agents making safe, versioned changes in parallel.

  • The company hit escape velocity by fixing the business before chasing growth — Railway went from losing roughly $500,000/month on a free tier with about $50,000 MRR and a $20 million bank account to a lean 35-person team serving 2-3 million users and adding 100,000 signups per week.

  • Owning metal is the economic unlock for agent workloads — Cooper says Railway’s payback period on self-hosted hardware versus cloud is about 3 months, with metal margins around 70%, which is what makes “run a thousand agents in parallel” even remotely affordable.

  • The real moat is safe iteration in production-like environments — Cooper is skeptical of “AI SRE” magic unless the platform can clone services, snapshot storage, fork environments, progressively roll out changes, and keep staging from drifting away from prod.

  • Coding agents have already changed Railway’s internal operating model — After winter break, Cooper told the team that “if you are writing code by hand, you are doing this wrong,” arguing engineers should review and reconcile AI-written code rather than generate code they already know how to write.

  • Railway’s internal tools hint at a bigger product surface beyond deployment — Central Station clusters customer feedback, incidents, and support context in real time, feature flagging is already built in-house, and Cooper repeatedly frames the canvas as shifting from an input UI for humans to an output layer for approving agent actions.

The Breakdown

From frictionless bikes to kernel patches

Cooper frames his whole career as a chase for frictionless user experience: from front-end work to distributed systems at Uber, and now all the way down to patching the Linux kernel for Railway. His line is basically: we’ll “swim to the bottom of the swimming pool” if that’s what it takes to make deployment feel effortless.

What Railway actually is, and why the canvas matters

He defines Railway as “the easiest way to ship anything” — deploy a Postgres instance, a GitHub repo, or code by talking to Claude or using the canvas. But the deeper pitch is versioned infrastructure: clone environments, fork “into a parallel universe,” copy production-like data, validate changes, then merge back without the usual Docker/Kubernetes/Ansible entropy pile.

The six-year grind, free-tier pain, and sudden acceleration

The early days were brutally manual: the support link on the site went straight to Discord, and Cooper kept notifications on a second monitor so he could greet every new user himself. Later came the free-tier boom — and a mess of Reddit bots, crypto miners, and abuse — plus a business that was losing about $500K a month while making roughly $50K in revenue, forcing Railway to shut the free floodgates, compact the product, and rebuild around a viable business.

Why Railway went all the way to bare metal

Railway now runs two data centers in every region except Singapore, where a second is coming in Q3, and Cooper makes the economics sound absurd: cloud-to-metal payback in about 3 months, with four years of depreciated hardware. He says hardware has appreciated so fast that Railway’s servers plus cash are worth more than the total money raised, largely because RAM prices climbed and compute supply got tight.

The compute crunch is real, and Railway already got burned by it

At one point this year Railway became compute-constrained because an upstream provider couldn’t deliver quota fast enough, which triggered reliability problems. Cooper spent a weekend rebuilding the network overlay to straddle five clouds — Oracle, AWS, GCP, Railway’s own metal, and another provider — just to keep up with growth and avoid being bottlenecked by hyperscaler scarcity.

Agents want the same primitives as humans, just 1000x better

Cooper’s answer to “what do agents want?” is basically: version control, observability, storage, compute, networking, and orchestration — same as humans, just at insane scale and speed. He thinks today’s stack will melt under that load, from CI/CD to Git itself, and says the future is safe, progressive, production-like iteration where agents can fork services, test changes, and know when to raise a hand instead of becoming an “interrupt factory.”

CLI over canvas, and the death of today’s deployment loop

One of the sharper shifts in the conversation is that Railway’s famous canvas is becoming less of an input surface and more of an output surface for humans approving agent work. Cooper says CLIs are perfect for agents because what’s annoying for a human — “40 arguments and 600 flags” — is actually a rich set of handles for Claude or Codex, and he predicts the whole push/pull/build loop will disappear in favor of directly versioned changes merged into live infrastructure.

Internal tooling, incident culture, and the Heroku handoff

Cooper shows off Central Station, Railway’s internal system for clustering support issues, incidents, and user feedback so a tiny team can operate with huge leverage; it already helped catch a recent cache invalidation issue that affected about 3,000 users. He ties that same philosophy to transparent incident reporting, progressive rollouts, and even Heroku’s decline under Salesforce: when a platform isn’t your core business, it languishes, and Railway sees itself not as “the new Heroku” so much as the platform for how software — increasingly agent-written software — gets built, tested, and shipped.

“If you are writing code by hand, you are doing this wrong”

Near the end, Cooper goes full evangelist: Railway spends about $300K/month on coding agents, he personally uses roughly $25K of that, and he told the team after break that hand-writing code is now the wrong default. His view is that architecture, specs, and tests matter more than ever, but raw code generation doesn’t — engineers should prompt, review, reconcile, and use the newfound leverage to ship roadmaps in months instead of years.

Share