Back to Podcast Digest
AI Engineer17m

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

TL;DR

  • Vision Transformers won by scaling pretraining, not by having better built-in priors — Isaac Robinson argues that ViTs beat CNNs despite worse inductive bias and worse raw compute scaling because VIT-specific pretraining like MAE, DINOv2, and DINOv3 teaches the missing visual structure back into the model.

  • Every attempt to 'fix' ViTs eventually circled back to plain ViTs — Swin added local windows, ConvNeXt ported transformer lessons back into convolutions, and Hiera stripped inductive biases back out again, but the simple ViT kept winning once large-scale pretraining and LLM-era infrastructure kicked in.

  • MAE is the hinge point because CNNs can't really use it the same way — masking image patches and reconstructing them, BERT-style, is a VIT-native trick that lets transformers learn locality and semantics from scale, while convolutional architectures don't naturally support patch dropout in the same form.

  • DINO-style pretraining turns ViTs into strong visual foundation models before task-specific training even begins — Robinson highlights DINOv3 feature maps where cat paws and satellite regions separate cleanly, and says linear probes on frozen features are getting very close to fully supervised performance.

  • LLM optimizations helped vision transformers more than architectural cleverness did — Hiera showed speed gains over ViT until you add FlashAttention back in, at which point the supposedly 'silly' n-to-the-fourth scaling becomes much less of a practical disadvantage.

  • The remaining problem is deployment, not whether ViTs are the backbone winner — Robinson says models like SAM 3 are powerful but blunt instruments at 800M parameters and roughly 300 ms on a T4 GPU, so Roboflow's answer is architecture-search-based adaptation that gets about 40x speedup at similar accuracy on object detection transfer.

The Breakdown

The old champion versus the weird newcomer

Robinson opens with the clean contrast: CNNs have beautiful inductive bias borrowed from how vision works, while transformers are basically generic set-to-set machines with almost no visual prior baked in. That makes the original Vision Transformer feel kind of absurd — split an image into 16x16 patches, add positional embeddings, and eat the brutal resolution scaling anyway.

Swin tried to make transformers act more like convolutions

The first big response was Swin: stop doing global attention everywhere, and restrict attention to local windows that shift between layers. Robinson points out that this starts looking suspiciously like convolution again — overlapping local operations, locality bias, and much better n-squared behavior if window size stays fixed.

ConvNeXt was the 'fine, let's go back to convolutions' moment

Then the field tried the reverse move: take everything transformers taught us and pour it back into a convolutional net. ConvNeXt used patchifying, mixer/feed-forward structure, layer norm, and a more transformer-like design language, and on standard ImageNet evaluation it beat both ViT and Swin. Robinson's reaction is basically: finally, something that makes intuitive sense.

Hiera showed the real game is bias versus pretraining

Meta's Hiera asks a sharper question: which inductive biases actually matter, and which can be learned instead? Robinson loves this as a case study in the tradeoff — remove specialized architectural bias, gain speed, and recover the lost structure through large-scale pretraining, especially with MAE.

MAE and DINO taught ViTs the visual priors they were missing

He explains MAE as BERT for images: drop patches, reconstruct them from context, and let the model absorb visual structure from scale. Then he pushes further with DINOv2 and DINOv3, where frozen ViT features already carve up cat paws and satellite imagery in semantically meaningful ways, to the point that linear probes are nearing the best supervised results.

FlashAttention helped settle the speed argument

The obvious objection is still compute: ViTs scale horribly with resolution. But Robinson says the broader LLM ecosystem cared so much about attention efficiency that tools like FlashAttention erased much of the practical edge from architectures designed to be faster, and he notes Hiera's own paper avoided measuring with FlashAttention enabled.

SAM's backbone history tells the same story

He uses Segment Anything as the practical proof: SAM starts with a ViT plus MAE, MobileSAM swaps in TinyViT, SAM 2 uses Hiera with MAE, and then SAM 3 basically stops the architecture soul-searching and goes back to a massively pretrained transformer backbone. In his telling, the entire family reenacts the same industry-wide arc.

The real bottleneck now is deployment flexibility

Robinson closes on Roboflow's angle: giant foundation models are great, but a one-size-fits-all 800M-parameter SAM 3 taking about 300 ms on a T4 is useless for many edge or real-time settings. Their answer is RF100VL plus architecture-search-based adaptation of a shared foundation model family, yielding about 40x speedup at similar accuracy for transfer and, at publication time, outperforming leading real-time convolutional instance segmentation systems.

Quick Q&A: video is active, JEPA is still open-ended

In questions, Robinson says multimodal and video-capable architectures are absolutely being worked on, and points to SAM 3's object tracking as one concrete vision-video example. On JEPA and V-JEPA, he's measured but skeptical: image JEPA hasn't clearly beaten other image pretraining methods for him yet, and he hasn't seen V-JEPA meaningfully win downstream video transfer so far.

Share