Foundation Models: LLM vs SLM vs Multimodal — The Harness Handbook Reference

LLM vs SLM: Size and Performance Trade-offs

Large Language Models (LLMs)

Parameters: 70B–175B+
Training cost: Massive (GPT-4 reportedly required thousands of A100 GPUs over several months; exact figures unconfirmed by OpenAI)
Inference cost: 10–30× more expensive than equivalent SLM
Use case: Complex reasoning, general-purpose capability
Latency: Higher per-token time
When to use: Complex multi-step reasoning, broad knowledge needed, quality over cost

Small Language Models (SLMs)

Parameters: Typically 1B–10B (emerging SLMs <1B for edge, <13B for specialized tasks)
Training cost: Significantly lower
Inference cost: 10–30× cheaper than LLM equivalent
Real-time inference: Possible on smartphones and edge devices
Fine-tuning: Faster and cheaper
Use case: Real-time applications, mobile deployment, cost-sensitive operations
Emerging trend: SLMs dominate agentic AI due to lower latency requirements
When to use: Tool-use loops, real-time decision making, cost-constrained harnesses

Hybrid Approach (2025+ Trend)

Use SLMs for real-time decision-making and tool invocation
Use LLMs for complex reasoning or final verification
Route tasks to appropriate model based on capability and cost
Examples: Phi-4, LLaMA 4 (with Mixture-of-Experts)
Pattern: SLM for agent loop speed, LLM for verification/complex steps

Multimodal Models

Definition: Models that integrate information from multiple modalities (text, images, video, audio) within a single architecture.

Architecture Approach:

Typically use an LLM as backbone
Integrate vision encoder for image understanding
Can handle text-to-image, image-to-text, and mixed reasoning tasks

Recent Examples:

Phi-4 (vision + language)
LLaMA 4 (Meta’s first MoE architecture, unprecedented scale with efficient inference)
Claude 3 family (multimodal reasoning)

Practical Advantage: Can build more capable systems without proportional increase in model size or inference cost (especially with MoE architectures).

Training vs Inference

Training Phase

One-time cost to create/adapt a model
Requires large GPU clusters and extended time
Produces weights and parameters
Fundamental cost barrier to creating new models

Inference Phase (99% of operational cost)

Per-request cost, happens every time model is used
KV cache optimization focuses here
Quantization applied at this stage
Batching strategies improve throughput

Key Economics

A 7B SLM inference costs 10–30× less than 70B LLM
Fine-tuning stops being economical above ~20B parameters
Quantization unlocks 4× throughput improvements for inference
Cost implication: SLMs win for long-running harnesses

Practical Selection Guide

Use Case	Recommended	Reasoning
Agent tool loop	7B–10B SLM	Speed critical, cost-effective
Complex reasoning	70B+ LLM	Need advanced capability
Verification step	70B+ LLM	Quality gate
Mobile/Edge agent	<1B SLM	Size/power constraints
Hybrid harness	SLM + LLM	Use each for right task

Real-World Decision Example: Customer Support Agent

Scenario: Building a customer support chatbot that answers FAQs, escalates complex issues, and gathers ticket info in real-time.

Cost vs Quality Analysis

Option A: LLM-Only (70B Claude or Llama)

Inference cost: $3 per 1M input tokens, $15 per 1M output tokens
Typical request: ~500 input tokens + ~500 output tokens
Cost per request: (500/1M × $3) + (500/1M × $15) = $0.0015 + $0.0075 = ~$0.009
Annual cost for 10,000 requests/day: ~$33K
Quality: Excellent reasoning, handles all issues
Latency: 2-5 seconds per response
Problem: More expensive than SLM at scale

Option B: SLM-Only (7B Mistral or Phi-4)

Inference cost: $0.10 per 1M input tokens, $0.30 per 1M output tokens
Cost per request: ~$0.0002
Annual cost for 10,000 requests/day: ~$730
Quality: Good for FAQs, sometimes fails on edge cases
Latency: 200-500ms per response
Problem: Misses complex issues, lower quality

Option C: Hybrid (SLM + LLM Router)

Step 1: SLM classifies intent (cost: ~$0.0002)
- If FAQ match -> Answer directly (SLM, cost: ~$0.0002)
- If complex -> Route to LLM (cost: ~$0.009)
Estimated: 70% FAQ = cheap, 30% complex = expensive
Average cost per request: 0.7 x $0.0004 + 0.3 x $0.0092 = ~$0.003
Annual cost for 10,000 requests/day: ~$11K
Quality: High for common cases, excellent for edge cases
Latency: 200ms for FAQ, 2s for complex (acceptable)
Result: ~67% cheaper than LLM-only, 10x more reliable than SLM-only

Note: Prices approximate as of early 2025. Check provider websites for current rates.

Recommendation: Use Option C (Hybrid) for production customer support.

Implementation Checklist

Identify your use cases — What 80/20 split of requests do you handle?
Measure baseline latency — How fast does the user expect responses?
Calculate cost sensitivity — What’s your budget threshold?
Test SLM performance — Does SLM handle 70% of your workload correctly?
Design router logic — How do you classify requests into “SLM-friendly” vs “needs LLM”?
Implement cost tracking — Log which model handled each request
Monitor quality metrics — Track success rate, user satisfaction
Plan escalation path — What happens when SLM fails? Retry with LLM or ask human?
Set up A/B testing — Compare SLM-only vs Hybrid in production

Decision Tree: Which Model Should You Use?

START: "I need to process requests"
│
├─ "Do I need real-time response?" (< 500ms)
│  ├─ YES → "Is quality critical?"
│  │   ├─ YES (healthcare, finance) → 70B LLM + GPU
│  │   └─ NO → 7B SLM
│  └─ NO (reports, batch) → 70B+ LLM (quality > speed)
│
├─ "What's my compute budget?"
│  ├─ UNLIMITED → 175B+ LLM (best quality)
│  ├─ $10K+/month → 70B LLM
│  ├─ $1K-10K/month → Hybrid (SLM + LLM router)
│  └─ <$1K/month → SLM only
│
├─ "What's my deployment target?"
│  ├─ Cloud (AWS, GCP, Azure) → Any (scale as needed)
│  ├─ On-prem GPU → 7B-13B SLM
│  ├─ Local machine → <3B SLM or quantized
│  └─ Mobile/Edge → <500M SLM (like Phi-2) or quantized
│
└─ "Is this a 'reasoning' or 'retrieval' task?"
   ├─ Reasoning (math, logic, planning) → 70B+ LLM
   └─ Retrieval (search, extraction, classification) → 7B SLM

Advanced: Mixture of Experts (MoE) Models

What It Is: A model architecture that routes tokens to specialized sub-networks (“experts”), enabling large effective model size with lower compute.

Examples (2026):

Mixtral 8x7B: 8 experts of ~7B each, activates 2 per token (~12.9B active of ~46.7B total)
LLaMA 4: Meta’s first MoE architecture, with its own expert configuration
Phi-4-MoE (emerging)

Key Property (Mixtral example): Mixtral 8x7B has ~46.7B total parameters but only activates ~12.9B per token, giving near-56B quality at much lower compute cost.

When to Consider:

You need reasoning capacity of 70B models
But inference latency/cost constraints force SLM
MoE bridges this gap

Trade-off: Not all frameworks support MoE efficiently (requires CUDA kernel optimizations).

How This Connects to Other Docs

Doc 02 (KV Cache Optimization): Once you choose your model, optimize its inference
Doc 03 (Hugging Face): Find and download your chosen model
Doc 21 (Model Fundamentals): Understand how weights and parameters work
Doc 22 (Knowledge Transfer): Consider fine-tuning your SLM for your domain
Doc 24 (Hardware): Choose hardware based on model size and inference latency

Reasoning Models: A Different Kind of Intelligence

Everything above compares models by size — how many parameters, how much RAM, how fast. But there is a second axis that matters just as much: what the model was trained to do.

Instruction Models

Instruction-tuned models are trained to follow directions and predict the next token. They excel at generating content in specified formats, following templates, and producing fluent text. When you give an instruction model a prompt, it responds immediately with its best prediction.

Examples: Qwen 2.5 Instruct, Llama 3.1 Instruct, Mistral Instruct, Phi-4.

Reasoning Models

Reasoning models are trained to chain through logic steps before answering. They think step-by-step internally, then respond. This explicit reasoning process makes them dramatically better at multi-step inference, logical chains, and problems where the answer depends on getting intermediate steps right.

Examples: DeepSeek-R1, QwQ, OpenAI o1/o3.

Key Insight

Model selection is not just about size — it is about what the model was trained to DO.

A 14B reasoning model outperforms a 14B instruction model on multi-step inference tasks. The training objective matters more than the parameter count for reasoning-heavy workloads.

Comparison

	Instruction Model	Reasoning Model
Training approach	Follow directions, predict next token	Chain through logic steps before answering
Process	Prompt in, response out	Prompt in, think step-by-step, then respond
Strength	Content generation, format compliance	Multi-step inference, logical chains
Speed	Fast (immediate response)	Slower (thinks before answering)
Best for	Formatting, summarisation, content generation	Strategic analysis, multi-step reasoning, inference
Examples	Qwen Instruct, Llama Instruct, Mistral Instruct	DeepSeek-R1, QwQ, OpenAI o1/o3

When to Use Each

Instruction model: Content generation, format compliance, summarisation, classification, tool-calling in agent loops where speed matters.
Reasoning model: Strategic analysis, multi-step inference, problems where getting intermediate steps wrong corrupts the final answer, verification tasks.

The Trade-off

Reasoning models are slower because they think before answering. A 14B reasoning model might take ~173 seconds where a 14B instruction model takes ~25 seconds. But for tasks that require chained logic, the reasoning model produces dramatically better results — the extra time buys accuracy.

Updated Decision Rule

If your task requires multi-step logical reasoning, choose a reasoning model regardless of parameter count. A 14B reasoning model will outperform a 14B instruction model on inference tasks, and may match or exceed a 70B instruction model on reasoning-specific benchmarks.

Specialist vs Generalist: Depth vs Breadth

Even within reasoning models, there is a critical distinction: how focused is the problem?

A 14B reasoning model excels at deep, focused reasoning — one problem, one context, step-by-step logic. Give it a single customer support ticket with full history and ask “what is the root cause?” and it chains through five inference steps flawlessly.

The same model fails at broad reasoning — many problems, large context, cross-referencing. Give it fifty family summaries and ask “which families are connected?” and it produces generic advice. It may not even follow explicit instructions in the system prompt when the context is large.

The diagnostic: if a model can’t reliably follow a direct instruction in the system prompt (e.g., “always include X in your suggestions”), it won’t make subtle inferences across a large context. Instruction-following is the floor — if that fails, reasoning about the content won’t succeed either.

Practical rule: use local models (7B-14B) as specialists — one focused problem per call with rich, relevant context. For generalist reasoning across many inputs, route to a larger model or API (70B+, Claude, GPT-4). See the tiered inference pattern in Harness Architecture.

Validation Checklist: Did You Choose Correctly?

Run this checklist after selecting your model(s):

Performance: Test model on your actual use cases (not just examples)
- Target: 90%+ success rate on representative sample
Latency: Measure end-to-end response time (model + other systems)
- Target: Meets user expectations (e.g., <500ms for chat, <5s for reports)
Cost: Calculate monthly cost at your projected volume
- Formula: (avg_input_tokens + avg_output_tokens) × price_per_token × daily_requests × 30
- Target: Within budget (usually 5-15% of overall system cost)
Quality: Run evaluation on 50+ test cases
- Target: ≥90% success rate on FAQ cases, ≥80% on complex cases
Fallback: If SLM fails, what’s the recovery plan?
- Target: Defined escalation path (retry with LLM, ask human, or error gracefully)

Summary: Foundation Models in Your Harness

Key Insight: Model selection is the foundational architectural decision. It determines:

Inference cost (10–100× variation between SLM and LLM)
Latency (200ms SLM vs 5s LLM)
Capability (SLM handles 70% of tasks, LLM handles 100%)

Recommendation for Starting: Use hybrid approach

SLM for fast, cheap decisions (tool routing, FAQs, triage)
LLM for complex reasoning and final verification
Router to decide which model handles each request
This pattern is proven at scale (Claude Code, Claw-Code, production harnesses)

Next Steps:

Identify your use cases and calculate cost sensitivity
Test SLM performance on representative samples
If SLM handles 70%+, design a router
If SLM handles <70%, accept LLM cost or invest in fine-tuning