Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI
TL;DR
5 million-token training is possible with standard transformers: Max Ryabinin says Together AI pushed Llama-style training to multi-million-token contexts by combining existing systems tricks with a new context parallelism optimization.
Memory, not just compute, is the hidden wall: He highlights two bottlenecks in long-context training, quadratic attention compute and linearly growing memory, arguing the second one is often the more practical blocker.
No single trick solves it: Fully Sharded Data Parallelism, DeepSpeed Ulysses context parallelism, activation checkpointing, CPU offloading, and sequence tiling each cut memory, but only the full stack made 3 million tokens fit on an 8x H100 setup.
Untitled Ulysses saves memory by reusing smaller attention buffers: Together AI found one set of heads already saturates a GPU, so instead of allocating big buffers for multiple heads at once, it processes head chunks over time and reuses memory with little throughput loss at smaller scales.
The trade-off is clear and tunable: Larger attention chunks run faster but use more memory, while smaller chunks conserve memory and extend context length, giving teams a knob to balance throughput against sequence length.
Profiling matters because bottlenecks show up in weird places: Ryabinin closes by urging people to inspect training with tools like the PyTorch profiler, since long-context scaling depends on finding the unexpected memory hogs, not just knowing the theory.
The Breakdown
Together AI says it can train standard transformer models at up to 5 million tokens of context by stacking a series of memory-saving tricks, then adding its own "Untitled Ulysses" tweak to squeeze attention activations even further. The punchline is that the real barrier is often memory, not just quadratic compute, and careful profiling plus the right combination of sharding, checkpointing, offloading, and chunking can make seemingly impossible context lengths fit on hardware like an 8x H100 node.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.