
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Superlinked built an open-source small-model inference stack because the market had a real gap between model serving and production infra — Filip Makraduli says existing tools helped you run models, but not handle routing, autoscaling, queuing, GPU provisioning, Prometheus/Grafana monitoring, and deployment end to end.
Small models matter because they fight 'context rot' in agent systems before the big model ever sees the data — citing Chroma’s research, he argues that preprocessing, filtering, taxonomy classification, named entity recognition, and tool-calling with small models can shrink context and improve downstream agent performance.
Throwing more GPUs at inference is the wrong mental model for small models — when models like Stella embeddings, rerankers, and GlyNER only take a few GB each, dedicating one GPU per model wastes memory, so Superlinked focused on hot-swapping models on a single GPU with least-recently-used eviction.
Supporting 'hundreds of models' is much harder than it sounds because BERT, Qwen, ColBERT, rerankers, and cross-encoders all behave differently under the hood — Makraduli walks through mismatches in FlashAttention, normalization, fused QKV, grouped-query attention, and positional embeddings like absolute lookup vs rotary.
Their core pitch is the 'yin and yang' of inference: model support plus infrastructure — he says inference is only useful if you both support fast-moving open-source models from Hugging Face and provide the cluster layer to run them, with API primitives like encode, score, and extract backed by KEDA autoscaling and spot-instance orchestration.
This talk doubles as a soft launch for SIE, the Superlinked Inference Engine — the repo has already been tested with partners including Chroma, Qdrant, Weaviate, and LanceDB, and Makraduli frames it as the practical way he closed his own blind spot around production inference.
Filip Makraduli opens with a little humility: he’d written a Substack post explaining FlashAttention, memory-bound vs compute-bound workloads, and felt pretty good about it — until people pointed out he’d missed the thing that makes models fast in the real world: inference. Instead of hand-waving past it, he treated it like a personal bug report and decided to learn by building.
That learning path led him to Superlinked, where he teamed up with infrastructure engineers to build an open-source inference engine for small models used in AI search and document processing. He presents the repo as a soft launch and notes it’s already been tested with Chroma, Qdrant, Weaviate, and LanceDB — not a toy, but something partners have actually kicked the tires on.
His case for this category starts with “context rot,” pointing to Chroma’s research showing that quality degrades as context windows grow. The answer, he says, is to use small models upstream for preprocessing, filtering, named entity recognition, taxonomy classification, and tool-calling so agents get cleaner inputs and less token bloat; he mentions Andrej Karpathy’s graph-based knowledge base work as part of the same broader response.
Makraduli argues that the usual scaling instinct breaks down for small models because many only occupy a few gigabytes, so pinning one GPU to one model leaves expensive hardware sitting idle. His team built hot-swapping so multiple models can share a GPU, plus a least-recently-used eviction policy, which cuts waste and makes it easier to switch between tools like rerankers and retrievers on demand.
Another misconception: inference is not just spinning up a server with vLLM, TGI, or an API wrapper. The hard part is everything around it — routing, autoscaling, queuing, monitoring, and provisioning GPUs — and he says there wasn’t an open-source stack that took teams all the way from model runtime to production deployment for this small-model use case.
His first half of the “yin and yang of inference” is model support, and he makes the point plainly: inference is worthless if you don’t support the models people actually want to use. With millions of models on Hugging Face and strong open-source results on narrow benchmarks like MTEB — plus examples like Gemma getting high Elo scores with low-parameter models — he says open source is no longer a compromise.
This is the most technical stretch of the talk: different models disagree on normalization, FlashAttention implementation, fused query-key-value projections, positional encoding, and output format. BERT, Qwen, ColBERT, cross-encoders, and rerankers all need different handling, so Superlinked re-implemented forward passes, added variable-length FlashAttention to avoid wasting compute on padding, and built support for oddballs like multi-vector late-interaction models.
The infrastructure side wraps those models in three API primitives — encode, score, and extract — then layers in routers, queues, GPU pools, spot instances, larger GPUs, and KEDA autoscaling driven by Prometheus metrics. His pitch is simple: users shouldn’t have to stitch model support and cluster ops together themselves; with SIE, models become config, deployment becomes Terraform apply, and the whole thing ships with Helm charts and Docker images. He closes by revealing the opening slide’s mystery background: a sinusoidal positional encoding visualization, which nicely ties the joke back to the technical core.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.