Open source · OpenClaw, Codex, Claude + any MCP client

You're paying for the wrong model.

Smart Spawn scores every model against 5 benchmark sources and routes to the one that fits your task and budget.

smart-spawn
Claude Opus 4.6 / DeepSeek R1 / GPT-5 / Gemini 2.5 Flash / Kimi K2.5 / Llama 4 Maverick / Mistral Large / Claude Sonnet 4.5 / Qwen 3 235B / Grok 3 / Gemini 2.5 Pro / Command R+ / Claude Opus 4.6 / DeepSeek R1 / GPT-5 / Gemini 2.5 Flash / Kimi K2.5 / Llama 4 Maverick / Mistral Large / Claude Sonnet 4.5 / Qwen 3 235B / Grok 3 / Gemini 2.5 Pro / Command R+ /

Four ways to route

Different problems need different strategies.

1

Single

One task. Best model. Done.

Task ——→ [Score] ——→ Best Model

Describe what you need. Smart Spawn scores every model in your budget tier for that category and picks the winner. Coding tasks get coders. Research gets readers.

N

Collective

Cheap models, expensive results.

┌──→ Model A ──┐ Task ──├──→ Model B ──┤──→ Merge └──→ Model C ──┘

Same prompt, three diverse models, parallel execution. Merge the best parts. Budget models brainstorming together regularly match premium on creative work. For about 1/50th the price.

Cascade

Start cheap. Escalate only if you have to.

Task ──→ Cheap ──→ Good? ├── Yes ─→ Done (saved 90%) └── No ─→ Premium ─→ Done

The $0.10 model goes first. If it handles it (and it usually does), you saved 90%+. If not, premium takes over. Most routine tasks never need to escalate.

*

Swarm

Big problems, decomposed.

┌── Research ─→ Gemini Flash Task ──├── Code ─→ DeepSeek R1 ├── Design ─→ Claude Sonnet └── Review ─→ GPT-5

Break a complex project into a dependency graph of subtasks. Each piece gets the model that scores highest for that specific job. Research to a context specialist, code to a code machine, review to a reasoning engine.

Cheap models need better prompts

A $0.10 model with a well-structured expert prompt regularly outperforms a $15 model with a lazy one. Smart Spawn builds that prompt for you.

You provide
task: "Build a billing dashboard"
persona: fullstack-engineer
stack: [nextjs, typescript, stripe]
domain: saas
Smart Spawn builds

Expert system prompt with role context, stack conventions, domain knowledge, output format constraints, and guardrails.

15+ personas, 30+ stack blocks, 8 domains, configurable guardrails. The API composes them into a single prompt before routing.

This is how collective mode works so well on budget models. Three cheap agents with expert-crafted prompts, brainstorming in parallel, routinely match a single premium model on creative and architectural tasks.

Real numbers

6,000 requests · 168M input · 15M output
Without routing (Opus 4.6) $2,550/mo
With Smart Spawn $600/mo
You save $1,950/mo (~77%)

Based on power-user workload. Routing splits 15-20% to premium models, remainder to cost-optimized alternatives. All pricing from published API rates.

~80%
Tasks resolve at budget tier
Cascade catches the 20% that actually need premium. The rest? A $0.10 model handles it.
50×
Cheaper for brainstorming
3 budget models collectively ($0.30/M) vs 1 premium ($15/M). The collective wins on creative work.
6hr
Benchmark refresh
New models ship weekly. Scores go stale. Smart Spawn re-pulls from all 5 sources every 6 hours.

How scoring works

Five benchmark sources. Z-score normalization. Category-specific weighting.

Data Sources

OpenRouter
Model catalog, pricing, capabilities, context lengths
Artificial Analysis
Intelligence, coding, and math quality indices
HuggingFace Leaderboard
MMLU, BBH, and academic benchmarks
LMArena (Chatbot Arena)
ELO ratings from human preference battles
LiveBench
Contamination-free coding and reasoning scores

The Pipeline

01 Normalize

Different benchmarks use different scales. An "intelligence index" of 65 means something completely different than an Arena ELO of 1350. Everything gets z-score normalized, so 2σ above average on any benchmark maps to the same score.

02 Categorize

Models get scored per category: coding, reasoning, creative, vision, research, general. Coding benchmarks weigh more for coding tasks, creativity benchmarks for creative ones. Each task type gets its own ranking.

03 Blend

Final score = benchmarks + your personal feedback + community ratings + context signals. Your ratings feed back into future picks.

Live from the API

Real numbers. Right now.

Models Tracked
With Benchmarks
Sources Online
Last Refresh

Try it yourself

Describe a task, pick a budget, see what comes back. No signup, no API key.

Three ways in

OpenClaw plugin, lightweight skill, or MCP server. All talk to the same scoring engine.

Install
openclaw plugins install @deeflectcom/smart-spawn
Config (optional)
# openclaw.yaml
plugins:
  smart-spawn:
    budget: medium
    mode: single

Full tool integration for OpenClaw. Your agent gets smart_spawn as a native command with all spawn modes, feedback loops, and local scoring.

Ready to stop picking models by hand?

One command. No config. Your agent starts picking the right model immediately.