It's tempting to point an entire workflow at the best model available and move on. It's also wasteful. A lot of the steps in a real crew — classifying an input, extracting a field, formatting a result — are easy work that a small, fast model handles perfectly well. Spending premium tokens on them buys you nothing but a bigger bill and higher latency.
Per-step model routing#
In LoopLlama, a model isn't a property of the whole run — it's chosen per step. A workflow has a default model, and any agent in the crew can override it. So the planner and the final synthesizer can run on a frontier model while the classifier and the extractor run on something cheap and fast, all within the same run.
Cost and latency aware#
The payoff compounds across a multi-step run. If four of a workflow's six steps are routine and you move them to a model that's an order of magnitude cheaper, you cut most of the cost while keeping the quality where it actually matters — on the two steps that do the hard reasoning. Latency drops too, because the fast steps finish fast.
- Route classification, extraction, and formatting to small, fast models.
- Reserve frontier models for planning, synthesis, and review.
- Keep latency-sensitive steps on whichever model responds quickest.
Same workflow definition#
The point of doing this at the routing layer is that your workflow logic doesn't change. The crew, the roles, the order, the tools — all of it stays the same. You're swapping which model executes a step, not rewriting how the step works. That means you can tune the cost/quality trade-off after the fact, as models and prices change, without touching the design of the workflow itself.
Model choice stops being an architectural commitment and becomes a dial you can turn. Most teams start with everything on one capable model, watch the per-step traces, and then push the obvious routine steps down to cheaper models once they can see where the tokens are going.