Back to Podcast Digest
Wes Roth··40m

Hermes Agent is INSANE...

TL;DR

  • Wes built a full AI benchmark with almost no hand-coding — the Gravity Well game, the website, and the ship-control scripts were all generated by LLMs, with Wes saying he only manually entered about three lines for sensitive API keys.

  • The benchmark measures iterative coding skill, not just one-shot cleverness — models get 20 tries to improve their ship-piloting code from English instructions, then the best script is tested across 100 seeds, which let Claude Opus 4.5 climb from weak early runs to a high score of 276 while smaller models plateaued far lower.

  • Hermes Agent’s real superpower is orchestration — Wes used it to spin up sub-agents like Claude Code and Codex, hand them tasks, collect diagnostics, and keep them grinding overnight from 2:17 a.m. to 5:32 a.m. across models like GPT-5.4, GPT-5.5 Pro, Grok 4, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview.

  • Hermes helped Wes escape 'tight spots' even though Codex did most of the heavy lifting — he says you don’t strictly need Hermes, but it became especially useful for managing parallel agent workflows, troubleshooting, and setting up new PvP and duel modes for the benchmark.

  • The tutorial is basically a case for AI-native workflow over step-by-step human instruction — instead of memorizing install docs, Wes repeatedly asks Claude Opus 4.7 what to do next on a fresh Ubuntu VPS and frames that as the new way to learn technical systems 'at light speed.'

  • Wes is bullish on GPT-5.5 for long-horizon coding despite imperfect benchmark results — he suspects the lower score came from routing through OpenRouter instead of a direct OpenAI setup, and in Hermes-powered duel tests GPT-5.5 High beat Claude Code Opus 4.7 seven rounds to three after about an hour.

The Breakdown

The Gravity Well benchmark that AI built end-to-end

Wes opens by showing off Gravity Well: four suns, three blue ships, real gravity, momentum, collisions, fuel limits, and a moving scoring circle the bots have to anticipate rather than just chase. The hook is that the entire thing — site, gameplay, and pilot scripts — was built by large language models, with Wes acting more like a director than a coder.

What the benchmark is really testing

The benchmark feeds models an English description of the game, asks them to write control code, then gives them 20 iterations to improve. Wes loves this because it tests whether a model can translate instructions into precise code and then actually learn from feedback; Claude Opus 4.5 eventually hit 276, while Claude Sonnet variants learned a bit but plateaued much earlier.

Letting agents work the night shift

One of the most persuasive moments is Wes showing logs from 2:17 a.m. to 5:32 a.m. while his agents ran evaluations as he slept. For him, that’s the point: daytime is collaborative AI work, nighttime is automated agents doing the grind across GPT-5.4, GPT-5.5, Grok 4, DeepSeek V4 Pro, Gemini 3.1 Pro Preview, and Anthropic models.

Why he still bothers making his own benchmarks

Wes says public benchmarks are messy because companies may train on them directly, which makes leaderboard bragging less meaningful. So he’s building his own battery of tests — this one took roughly 40 hours — and wants something he can trust when a new model drops and he needs to know, quickly, if it’s actually smart.

Installing Hermes manually on a VPS

The tutorial portion walks through setting up Hermes Agent on a Hostinger KVM2 VPS with Ubuntu 24.04 LTS, SSH access, and a root password. Wes keeps the vibe loose and funny — praising Hostinger for making cancellation easy and joking they “don’t weaponize my ADHD against me” — while still giving the actual path: plain OS, latest LTS, log in over terminal, then run the installer.

Using chatbots as your install guide instead of YouTube priests

A big recurring idea is that the chatbot should be your real setup companion. Wes asks Claude Opus 4.7 which Ubuntu version to use, how to install Hermes on fresh Ubuntu, and what setup choices matter, and he argues this beats passively following someone else’s technical tutorial line by line.

News Portal, model selection, and the Docker gotcha

Once Hermes is installed, Wes goes through full setup, picks News Portal instead of OpenRouter, and explains the appeal: one subscription bundles model access plus extras like web search, image generation, text-to-speech, and browser automation. He also drops a practical warning from experience: on a fresh Ubuntu box, choosing Docker too early can crash the setup, so start local and sandbox later.

Safety, blast radius, and the 'Boston nuclear meltdown' joke

Wes is very explicit that he often runs agents with approvals bypassed because he wants them fully autonomous for 48-hour builds, but he also says that’s only sane if you contain the blast radius. His solution is isolated VPSes, old laptops, mini PCs, Docker, password managers, and remote SSH control — with a very Wes line that if a meltdown happens on the faraway server, well, 'I’ve never been to Boston.'

Hermes as manager: Claude Code vs Codex in live duels

The closing demo is the most concrete proof of Hermes’ value: it acts like a foreman, opening fresh instances of Claude Code and Codex, passing each the game state, collecting diagnostics, and running an iterative duel loop. After about an hour, Hermes had even created a reusable skill for the workflow, and in that run GPT-5.5 High beat Claude Code Opus 4.7 seven rounds to three, with replay footage showing the leap from rookie chaos to smooth 'ace pilot' behavior over iterations.