Foundation Models: LLM vs SLM vs Multimodal
Model selection guide — when to use large vs small language models, hybrid routing, cost vs quality trade-offs, and the MoE architecture.
LLM vs SLM: Size and Performance Trade-offs
Large Language Models (LLMs)
- Parameters: 70B–175B+
- Training cost: Massive (GPT-4 reportedly required thousands of A100 GPUs over several months; exact figures unconfirmed by OpenAI)
- Inference cost: 10–30× more expensive than equivalent SLM
- Use case: Complex reasoning, general-purpose capability
- Latency: Higher per-token time
- When to use: Complex multi-step reasoning, broad knowledge needed, quality over cost
Small Language Models (SLMs)
- Parameters: Typically 1B–10B (emerging SLMs <1B for edge, <13B for specialized tasks)
- Training cost: Significantly lower
- Inference cost: 10–30× cheaper than LLM equivalent
- Real-time inference: Possible on smartphones and edge devices
- Fine-tuning: Faster and cheaper
- Use case: Real-time applications, mobile deployment, cost-sensitive operations
- Emerging trend: SLMs dominate agentic AI due to lower latency requirements
- When to use: Tool-use loops, real-time decision making, cost-constrained harnesses
Hybrid Approach (2025+ Trend)
- Use SLMs for real-time decision-making and tool invocation
- Use LLMs for complex reasoning or final verification
- Route tasks to appropriate model based on capability and cost
- Examples: Phi-4, LLaMA 4 (with Mixture-of-Experts)
- Pattern: SLM for agent loop speed, LLM for verification/complex steps
Multimodal Models
Definition: Models that integrate information from multiple modalities (text, images, video, audio) within a single architecture.
Architecture Approach:
- Typically use an LLM as backbone
- Integrate vision encoder for image understanding
- Can handle text-to-image, image-to-text, and mixed reasoning tasks
Recent Examples:
- Phi-4 (vision + language)
- LLaMA 4 (Meta’s first MoE architecture, unprecedented scale with efficient inference)
- Claude 3 family (multimodal reasoning)
Practical Advantage: Can build more capable systems without proportional increase in model size or inference cost (especially with MoE architectures).
Training vs Inference
Training Phase
- One-time cost to create/adapt a model
- Requires large GPU clusters and extended time
- Produces weights and parameters
- Fundamental cost barrier to creating new models
Inference Phase (99% of operational cost)
- Per-request cost, happens every time model is used
- KV cache optimization focuses here
- Quantization applied at this stage
- Batching strategies improve throughput
Key Economics
- A 7B SLM inference costs 10–30× less than 70B LLM
- Fine-tuning stops being economical above ~20B parameters
- Quantization unlocks 4× throughput improvements for inference
- Cost implication: SLMs win for long-running harnesses
Practical Selection Guide
| Use Case | Recommended | Reasoning |
|---|---|---|
| Agent tool loop | 7B–10B SLM | Speed critical, cost-effective |
| Complex reasoning | 70B+ LLM | Need advanced capability |
| Verification step | 70B+ LLM | Quality gate |
| Mobile/Edge agent | <1B SLM | Size/power constraints |
| Hybrid harness | SLM + LLM | Use each for right task |
Real-World Decision Example: Customer Support Agent
Scenario: Building a customer support chatbot that answers FAQs, escalates complex issues, and gathers ticket info in real-time.
Cost vs Quality Analysis
Option A: LLM-Only (70B Claude or Llama)
- Inference cost: $3 per 1M input tokens, $15 per 1M output tokens
- Typical request: ~500 input tokens + ~500 output tokens
- Cost per request: (500/1M × $3) + (500/1M × $15) = $0.0015 + $0.0075 = ~$0.009
- Annual cost for 10,000 requests/day: ~$33K
- Quality: Excellent reasoning, handles all issues
- Latency: 2-5 seconds per response
- Problem: More expensive than SLM at scale
Option B: SLM-Only (7B Mistral or Phi-4)
- Inference cost: $0.10 per 1M input tokens, $0.30 per 1M output tokens
- Cost per request: ~$0.0002
- Annual cost for 10,000 requests/day: ~$730
- Quality: Good for FAQs, sometimes fails on edge cases
- Latency: 200-500ms per response
- Problem: Misses complex issues, lower quality
Option C: Hybrid (SLM + LLM Router)
- Step 1: SLM classifies intent (cost: ~$0.0002)
- If FAQ match -> Answer directly (SLM, cost: ~$0.0002)
- If complex -> Route to LLM (cost: ~$0.009)
- Estimated: 70% FAQ = cheap, 30% complex = expensive
- Average cost per request: 0.7 x $0.0004 + 0.3 x $0.0092 = ~$0.003
- Annual cost for 10,000 requests/day: ~$11K
- Quality: High for common cases, excellent for edge cases
- Latency: 200ms for FAQ, 2s for complex (acceptable)
- Result: ~67% cheaper than LLM-only, 10x more reliable than SLM-only
Note: Prices approximate as of early 2025. Check provider websites for current rates.
Recommendation: Use Option C (Hybrid) for production customer support.
Implementation Checklist
- Identify your use cases — What 80/20 split of requests do you handle?
- Measure baseline latency — How fast does the user expect responses?
- Calculate cost sensitivity — What’s your budget threshold?
- Test SLM performance — Does SLM handle 70% of your workload correctly?
- Design router logic — How do you classify requests into “SLM-friendly” vs “needs LLM”?
- Implement cost tracking — Log which model handled each request
- Monitor quality metrics — Track success rate, user satisfaction
- Plan escalation path — What happens when SLM fails? Retry with LLM or ask human?
- Set up A/B testing — Compare SLM-only vs Hybrid in production
Decision Tree: Which Model Should You Use?
START: "I need to process requests"
│
├─ "Do I need real-time response?" (< 500ms)
│ ├─ YES → "Is quality critical?"
│ │ ├─ YES (healthcare, finance) → 70B LLM + GPU
│ │ └─ NO → 7B SLM
│ └─ NO (reports, batch) → 70B+ LLM (quality > speed)
│
├─ "What's my compute budget?"
│ ├─ UNLIMITED → 175B+ LLM (best quality)
│ ├─ $10K+/month → 70B LLM
│ ├─ $1K-10K/month → Hybrid (SLM + LLM router)
│ └─ <$1K/month → SLM only
│
├─ "What's my deployment target?"
│ ├─ Cloud (AWS, GCP, Azure) → Any (scale as needed)
│ ├─ On-prem GPU → 7B-13B SLM
│ ├─ Local machine → <3B SLM or quantized
│ └─ Mobile/Edge → <500M SLM (like Phi-2) or quantized
│
└─ "Is this a 'reasoning' or 'retrieval' task?"
├─ Reasoning (math, logic, planning) → 70B+ LLM
└─ Retrieval (search, extraction, classification) → 7B SLM
Advanced: Mixture of Experts (MoE) Models
What It Is: A model architecture that routes tokens to specialized sub-networks (“experts”), enabling large effective model size with lower compute.
Examples (2026):
- Mixtral 8x7B: 8 experts of ~7B each, activates 2 per token (~12.9B active of ~46.7B total)
- LLaMA 4: Meta’s first MoE architecture, with its own expert configuration
- Phi-4-MoE (emerging)
Key Property (Mixtral example): Mixtral 8x7B has ~46.7B total parameters but only activates ~12.9B per token, giving near-56B quality at much lower compute cost.
When to Consider:
- You need reasoning capacity of 70B models
- But inference latency/cost constraints force SLM
- MoE bridges this gap
Trade-off: Not all frameworks support MoE efficiently (requires CUDA kernel optimizations).
How This Connects to Other Docs
- Doc 02 (KV Cache Optimization): Once you choose your model, optimize its inference
- Doc 03 (Hugging Face): Find and download your chosen model
- Doc 21 (Model Fundamentals): Understand how weights and parameters work
- Doc 22 (Knowledge Transfer): Consider fine-tuning your SLM for your domain
- Doc 24 (Hardware): Choose hardware based on model size and inference latency
Reasoning Models: A Different Kind of Intelligence
Everything above compares models by size — how many parameters, how much RAM, how fast. But there is a second axis that matters just as much: what the model was trained to do.
Instruction Models
Instruction-tuned models are trained to follow directions and predict the next token. They excel at generating content in specified formats, following templates, and producing fluent text. When you give an instruction model a prompt, it responds immediately with its best prediction.
Examples: Qwen 2.5 Instruct, Llama 3.1 Instruct, Mistral Instruct, Phi-4.
Reasoning Models
Reasoning models are trained to chain through logic steps before answering. They think step-by-step internally, then respond. This explicit reasoning process makes them dramatically better at multi-step inference, logical chains, and problems where the answer depends on getting intermediate steps right.
Examples: DeepSeek-R1, QwQ, OpenAI o1/o3.
Key Insight
Model selection is not just about size — it is about what the model was trained to DO.
A 14B reasoning model outperforms a 14B instruction model on multi-step inference tasks. The training objective matters more than the parameter count for reasoning-heavy workloads.
Comparison
| Instruction Model | Reasoning Model | |
|---|---|---|
| Training approach | Follow directions, predict next token | Chain through logic steps before answering |
| Process | Prompt in, response out | Prompt in, think step-by-step, then respond |
| Strength | Content generation, format compliance | Multi-step inference, logical chains |
| Speed | Fast (immediate response) | Slower (thinks before answering) |
| Best for | Formatting, summarisation, content generation | Strategic analysis, multi-step reasoning, inference |
| Examples | Qwen Instruct, Llama Instruct, Mistral Instruct | DeepSeek-R1, QwQ, OpenAI o1/o3 |
When to Use Each
- Instruction model: Content generation, format compliance, summarisation, classification, tool-calling in agent loops where speed matters.
- Reasoning model: Strategic analysis, multi-step inference, problems where getting intermediate steps wrong corrupts the final answer, verification tasks.
The Trade-off
Reasoning models are slower because they think before answering. A 14B reasoning model might take ~173 seconds where a 14B instruction model takes ~25 seconds. But for tasks that require chained logic, the reasoning model produces dramatically better results — the extra time buys accuracy.
Updated Decision Rule
If your task requires multi-step logical reasoning, choose a reasoning model regardless of parameter count. A 14B reasoning model will outperform a 14B instruction model on inference tasks, and may match or exceed a 70B instruction model on reasoning-specific benchmarks.
Specialist vs Generalist: Depth vs Breadth
Even within reasoning models, there is a critical distinction: how focused is the problem?
A 14B reasoning model excels at deep, focused reasoning — one problem, one context, step-by-step logic. Give it a single customer support ticket with full history and ask “what is the root cause?” and it chains through five inference steps flawlessly.
The same model fails at broad reasoning — many problems, large context, cross-referencing. Give it fifty family summaries and ask “which families are connected?” and it produces generic advice. It may not even follow explicit instructions in the system prompt when the context is large.
The diagnostic: if a model can’t reliably follow a direct instruction in the system prompt (e.g., “always include X in your suggestions”), it won’t make subtle inferences across a large context. Instruction-following is the floor — if that fails, reasoning about the content won’t succeed either.
Practical rule: use local models (7B-14B) as specialists — one focused problem per call with rich, relevant context. For generalist reasoning across many inputs, route to a larger model or API (70B+, Claude, GPT-4). See the tiered inference pattern in Harness Architecture.
Validation Checklist: Did You Choose Correctly?
Run this checklist after selecting your model(s):
-
Performance: Test model on your actual use cases (not just examples)
- Target: 90%+ success rate on representative sample
-
Latency: Measure end-to-end response time (model + other systems)
- Target: Meets user expectations (e.g., <500ms for chat, <5s for reports)
-
Cost: Calculate monthly cost at your projected volume
- Formula: (avg_input_tokens + avg_output_tokens) × price_per_token × daily_requests × 30
- Target: Within budget (usually 5-15% of overall system cost)
-
Quality: Run evaluation on 50+ test cases
- Target: ≥90% success rate on FAQ cases, ≥80% on complex cases
-
Fallback: If SLM fails, what’s the recovery plan?
- Target: Defined escalation path (retry with LLM, ask human, or error gracefully)
Summary: Foundation Models in Your Harness
Key Insight: Model selection is the foundational architectural decision. It determines:
- Inference cost (10–100× variation between SLM and LLM)
- Latency (200ms SLM vs 5s LLM)
- Capability (SLM handles 70% of tasks, LLM handles 100%)
Recommendation for Starting: Use hybrid approach
- SLM for fast, cheap decisions (tool routing, FAQs, triage)
- LLM for complex reasoning and final verification
- Router to decide which model handles each request
- This pattern is proven at scale (Claude Code, Claw-Code, production harnesses)
Next Steps:
- Identify your use cases and calculate cost sensitivity
- Test SLM performance on representative samples
- If SLM handles 70%+, design a router
- If SLM handles <70%, accept LLM cost or invest in fine-tuning
See Also
- Doc 02 (KV Cache Optimization) — Understand how to optimize models you’ve selected for longer contexts and faster inference
- Doc 03 (Hugging Face Ecosystem) — Find and evaluate specific models on Hugging Face; includes quantization options (AWQ, GPTQ) and performance trade-offs
- Doc 21 (Model Fundamentals) — Understand how weights, parameters, and transformers work; essential background for model selection
- Doc 22 (Knowledge Transfer Methods) — Learn distillation and fine-tuning to adapt models to your specific use case
- Read Doc 02 (Optimization) to squeeze more performance from your chosen model