Hugging Face Ecosystem: Model Selection & Quantization — The Harness Handbook Reference

What It Is

Hugging Face is the primary hub for open-source AI models, datasets, and related tools. As of 2025, it hosts:

2 million+ public models
500,000+ public datasets
13 million users
Shift from pure consumption to active creation (fine-tuning, adapters, benchmarks, applications)

Finding and Evaluating Models

Key Metrics on Hugging Face

Downloads: Popular models typically have 1M+ downloads/month
Library support: Built on transformers (HF), diffusers, or specialized frameworks
Hardware requirements: CPU/GPU/Memory needed (clearly documented)
License: MIT, Apache 2.0, OpenRAIL, etc. (check for restrictions)
Task type: Text generation, classification, embeddings, vision, speech, etc.
Benchmark scores: MMLU, HellaSwag, Arc, TruthfulQA (compare across models)
Model card quality: Detailed documentation = better maintained model

Finding Right Model: Decision Tree

Are you building an agent?
├─ YES: Look for instruction-tuned models (Llama, Mistral, Phi family)
│       Size: 7B–13B SLM range (balance speed + capability)
│       Check: Does it support function calling/tool use?
│
├─ NO: What's your task?
    ├─ Classification/Tagging: 
    │   └─ Try: DistilBERT, ALBERT, or <1B models
    │       Preference: CPU inference possible
    │
    ├─ Embeddings/RAG:
    │   └─ Try: nomic-embed-text, BGE, or OpenAI embeddings
    │       Check: Embedding dimension, max sequence length
    │
    ├─ Code generation:
    │   └─ Try: CodeLlama (7B–34B), Codestral, DeepSeek-Coder
    │       Check: Language support, context window
    │
    └─ General chat:
        └─ Try: Llama 2/3, Mistral, Claude (API)

Performance vs Size Trade-off (2025)

Parameter Count	Use Cases	Speed	Cost
100M–500M	Classification, tagging, fast inference	Excellent	Minimal
1B–3B	Document processing, search tagging, edge	Very good	Low
7B–13B	Coding, reasoning, general chat, agents	Good	Medium
20B–34B	Complex multi-step reasoning	Fair	Higher
70B+	Advanced reasoning, specialized tasks	Slower	Very high

Try Before Integrating: Hugging Face Spaces

Most popular models have free Spaces (hosted demos)
Test model behavior without integration cost
Check latency, output quality, edge cases
Community comments surface common issues

Model Download and First Inference (Zero to Running in 15 Minutes)

This section gets you from a fresh Python environment to running your first local inference. No GPU required for small models.

Step 1: Install Dependencies

# Create a virtual environment (recommended)
python3 -m venv hf-env
source hf-env/bin/activate

# Install core packages
pip install torch transformers accelerate

# For quantized models (optional, install when needed)
pip install autoawq          # AWQ quantization
pip install auto-gptq        # GPTQ quantization
pip install bitsandbytes     # 8-bit/4-bit via bitsandbytes

# For GGUF / llama.cpp (alternative path)
pip install llama-cpp-python

# Hugging Face CLI (for downloading models directly)
pip install huggingface-hub

Step 2: Authenticate (Required for Gated Models)

Some models (Llama, Gemma) require accepting a license on the Hugging Face website before download.

# Login to Hugging Face (creates ~/.cache/huggingface/token)
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

Step 3: Download and Run First Inference

Option A: Small model, no GPU needed (recommended first test)

# first_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "microsoft/phi-2"  # 2.7B params, runs on CPU

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("Loading model (this takes 1-3 minutes on first run)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,  # CPU needs float32
    device_map="cpu",
)

prompt = "Write a Python function that reverses a string:"
inputs = tokenizer(prompt, return_tensors="pt")

print("Generating...")
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Expected output (first run):

Loading tokenizer...
Loading model (this takes 1-3 minutes on first run)...
Generating...
Write a Python function that reverses a string:

def reverse_string(s):
    return s[::-1]

# Example usage:
print(reverse_string("hello"))  # Output: "olleh"

Option B: With GPU (faster, larger models)

# first_inference_gpu.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,       # Half precision for GPU
    device_map="auto",               # Auto-distribute across GPUs
)

messages = [
    {"role": "user", "content": "Explain what a Python decorator is in 3 sentences."}
]

# Use the chat template (instruction-tuned models need this)
inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Step 4: Common First-Run Errors and Fixes

Error	Cause	Fix
`OutOfMemoryError: CUDA out of memory`	Model too large for GPU VRAM	Use quantized model or `device_map="cpu"`
`OSError: You are trying to access a gated repo`	Model requires license acceptance	Visit model page on HF, accept license, then `huggingface-cli login`
`ImportError: No module named 'torch'`	PyTorch not installed	`pip install torch`
`RuntimeError: Expected all tensors on same device`	Mixed CPU/GPU tensors	Use `device_map="auto"` or move inputs with `.to(model.device)`
`ValueError: Tokenizer class ... not found`	Old transformers version	`pip install --upgrade transformers`
`KeyError: 'model.safetensors'`	Incomplete download	Delete `~/.cache/huggingface/hub/models--<name>` and re-download
Model produces gibberish	Wrong prompt format	Use `tokenizer.apply_chat_template()` for instruction-tuned models
Extremely slow generation	Running FP32 on CPU for large model	Use quantized model (AWQ/GGUF) or switch to GPU with FP16

Using the Hugging Face Pipeline (Simpler API)

For quick experimentation, the pipeline API abstracts away tokenizer/model management:

from transformers import pipeline

# One-liner inference
generator = pipeline("text-generation", model="microsoft/phi-2", device_map="auto")
result = generator("Explain recursion:", max_new_tokens=150)
print(result[0]["generated_text"])

Pipelines are convenient for testing but give less control. Use the explicit tokenizer + model approach for production harness code.

Quantization Guide: Practical Code for Each Format

Quantization reduces model size and speeds inference by using lower-precision data types.

Supported Formats and Use Cases

Technique	Precision	Memory Savings	Speed Gain	Quality	Use Case
GQA (built-in)	KV cache sharing	2-4x cache	2-4x cache ops	Minimal	Models with GQA (Llama 3, Mistral)
AWQ	4-bit	~75%	3-4x	Minimal loss	Production inference (best choice)
GPTQ	4-bit	~75%	2-3x	Minimal loss	Smaller devices, older GPUs
GGUF	2-8 bit	50-87%	Varies	Varies by quant	llama.cpp, local CPU/Apple Silicon
8-bit (bitsandbytes)	8-bit	~50%	1-2x	Negligible	Medium hardware, research
ONNX optimized	FP32/FP16/int8	0-75%	2-5x	Varies	Cross-platform deployment

Quantization Decision Tree

What hardware are you running on?
│
├─ Apple Silicon (M1/M2/M3/M4)?
│   └─ Use GGUF format with llama.cpp or MLX
│       Why: Native Metal acceleration, no CUDA needed
│
├─ NVIDIA GPU with 8GB+ VRAM?
│   ├─ Need maximum speed?
│   │   └─ AWQ (fastest inference, best quality retention)
│   └─ Need widest compatibility?
│       └─ GPTQ (works on older CUDA GPUs)
│
├─ NVIDIA GPU with <8GB VRAM?
│   └─ GGUF with llama.cpp (offload layers to CPU as needed)
│
├─ CPU only (Intel/AMD)?
│   └─ GGUF with llama.cpp (optimized CPU kernels)
│
└─ Cloud/server deployment?
    └─ AWQ (best throughput per dollar)

AWQ Quantization (Recommended: Best Speed/Quality)

AWQ (Activation-aware Weight Quantization) preserves the most important weights at higher precision, giving the best quality at 4-bit.

Using a pre-quantized AWQ model (easiest path):

# awq_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

# AWQ models use the same inference code as full-precision models
messages = [{"role": "user", "content": "What is gradient descent?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Quantizing your own model to AWQ:

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.3"
quant_path = "./mistral-7b-awq"

# Load full-precision model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",  # Use GEMM for best speed on modern GPUs
}

# Quantize (takes 15-30 min on a single GPU for 7B model)
model.quantize(tokenizer, quant_config=quant_config)

# Save locally
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Quantized model saved to {quant_path}")

AWQ memory requirements for quantization (you need the full model in memory first):

7B model: ~16GB GPU VRAM to quantize
13B model: ~28GB GPU VRAM to quantize
70B model: ~140GB GPU VRAM (multi-GPU or cloud instance)

GPTQ Quantization (Most Compatible)

GPTQ has the widest hardware support and the largest library of pre-quantized models. Slightly slower than AWQ but works on older CUDA versions.

Using a pre-quantized GPTQ model:

# gptq_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain the difference between a list and a tuple in Python."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Quantizing your own model to GPTQ:

# quantize_gptq.py
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPTQ needs calibration data (representative text samples)
calibration_text = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a versatile programming language.",
    # Add 100-200 representative samples for best results
]

gptq_config = GPTQConfig(
    bits=4,
    dataset=calibration_text,
    tokenizer=tokenizer,
    group_size=128,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("./mistral-7b-gptq")
tokenizer.save_pretrained("./mistral-7b-gptq")

GGUF Format (For llama.cpp and Local Inference)

GGUF is the standard format for llama.cpp. It supports CPU inference, partial GPU offloading, and runs natively on Apple Silicon. This is the go-to format for local development on a Mac.

Using GGUF with llama-cpp-python:

# gguf_inference.py
from llama_cpp import Llama

# Download a GGUF file first:
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
#   mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models

llm = Llama(
    model_path="./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    n_ctx=4096,        # Context window
    n_threads=8,       # CPU threads (match your core count)
    n_gpu_layers=0,    # Set to -1 for full GPU offload, 0 for CPU only
    verbose=False,
)

output = llm(
    "Q: What is the capital of France? A:",
    max_tokens=100,
    temperature=0.7,
    stop=["Q:", "\n\n"],
)

print(output["choices"][0]["text"])

Using GGUF with ollama (simplest path):

# Install ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (auto-downloads GGUF)
ollama pull mistral:7b-instruct-q4_K_M

# Run inference
ollama run mistral:7b-instruct-q4_K_M "What is recursion?"

# Use via API (for integration with harness)
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-instruct-q4_K_M",
  "prompt": "What is recursion?",
  "stream": false
}'

GGUF quantization levels explained:

Suffix	Bits	Size (7B)	Quality	Use When
Q2_K	2-bit	~2.7GB	Poor	Extreme memory constraints only
Q3_K_M	3-bit	~3.3GB	Acceptable	Tight memory, acceptable quality loss
Q4_K_M	4-bit	~4.1GB	Good	Best default choice
Q5_K_M	5-bit	~4.8GB	Very good	When quality matters more than size
Q6_K	6-bit	~5.5GB	Excellent	Near full-precision quality
Q8_0	8-bit	~7.2GB	Near-perfect	Maximum quality, still saves ~50%
F16	16-bit	~14GB	Full	Baseline, no quantization

The sweet spot is Q4_K_M: good quality, 70%+ size reduction, fast inference.

How to Choose Quantization Format

Start here:
│
├─ Running on Apple Silicon Mac?
│   └─ GGUF (Q4_K_M) via llama.cpp or ollama
│
├─ Running on NVIDIA GPU?
│   ├─ Want best speed + quality?
│   │   └─ AWQ (4-bit GEMM)
│   ├─ Older GPU (pre-Ampere)?
│   │   └─ GPTQ (4-bit, widest CUDA support)
│   └─ Just experimenting?
│       └─ bitsandbytes 8-bit (pip install, one-line config)
│
├─ CPU only (Intel/AMD server)?
│   └─ GGUF (Q4_K_M) via llama.cpp
│
└─ Deploying to production?
    ├─ Single GPU server → AWQ
    ├─ Multi-GPU server → AWQ with tensor parallelism
    └─ Edge/mobile → GGUF (smallest quant that meets quality bar)

Model Comparison Table (April 2026)

7B-Class Models Head-to-Head

These are the models most relevant for building an agent harness. All benchmarks from public evaluations as of April 2026.

Model	Params	License	Context	MMLU	HumanEval	Speed (tok/s, AWQ)	VRAM (AWQ)	Best For
Mistral 7B v0.3	7.3B	Apache 2.0	32K	62.5	40.2	~95	~5GB	General agent, good all-rounder
Llama 3.1 8B	8.0B	Llama 3.1	128K	66.6	62.2	~85	~5.5GB	Long-context tasks, coding
Phi-4 ¹	14B	MIT	16K	78.0	67.8	~55	~8.5GB	Reasoning, math, coding
Gemma 2 9B	9.2B	Gemma	8K	64.3	54.1	~75	~6GB	Multilingual, instruction following
Qwen 2.5 7B	7.6B	Apache 2.0	128K	68.4	61.5	~90	~5GB	Coding, multilingual, long context
DeepSeek-Coder-V2-Lite	6.7B	MIT	128K	60.1	73.8	~100	~4.5GB	Code-only tasks

¹ Note: Phi-4 is 14B parameters, nearly 2x larger than other models in this table, which partly explains its higher benchmark scores.

Interpretation

Best all-rounder for agent harness: Llama 3.1 8B or Qwen 2.5 7B (great benchmarks, 128K context, permissive license)
Best for coding: DeepSeek-Coder-V2-Lite (highest HumanEval, smallest size, fastest)
Best for reasoning/math: Phi-4 (highest MMLU by far, but larger at 14B and shorter context)
Best for constrained hardware: Mistral 7B (smallest effective size, Apache license, very fast)
Avoid: Gemma 2 9B for agent use (short 8K context is limiting)

70B-Class Models (For Verification / Complex Reasoning)

Model	Params	License	Context	MMLU	Best For
Llama 3.1 70B	70.6B	Llama 3.1	128K	79.3	Verification agent, complex reasoning
Qwen 2.5 72B	72.7B	Apache 2.0	128K	80.1	Best open-source benchmark scores
Mixtral 8x22B	141B (39B active)	Apache 2.0	65K	77.8	MoE efficiency, lower per-token cost

These require multi-GPU setups or cloud instances for local inference. For a harness, route complex tasks to a 70B+ model via API (see Hybrid Routing section below).

Apple Silicon / Local Development

Running models locally on Apple Silicon (M1/M2/M3/M4) is one of the best local inference experiences available. The unified memory architecture means your GPU and CPU share RAM, so a 32GB M-series Mac can run models that would need a dedicated GPU on x86.

Memory Requirements Per Model Size

Model Size	Quantization	RAM Needed	Runs On
1-3B	Q4_K_M	~2-3GB	Any M-series Mac
7B	Q4_K_M	~5-6GB	M1 8GB (tight), M1 16GB comfortable
7B	Q8_0	~8-9GB	M1 16GB minimum
13B	Q4_K_M	~8-9GB	M1 16GB minimum
13B	Q8_0	~15GB	M2/M3 16GB (tight), 32GB comfortable
34B	Q4_K_M	~20GB	M2/M3/M4 32GB minimum
70B	Q4_K_M	~40GB	M2/M3/M4 Pro/Max 64GB+
70B	Q8_0	~75GB	M3/M4 Max 96GB or Ultra 128GB

Rule of thumb: Model size in GB (quantized) plus 2-3GB overhead for context and system. If your total RAM is less than that, expect heavy swapping and terrible performance.

Path 1: GGUF + llama.cpp (Recommended)

The most battle-tested path for Apple Silicon. llama.cpp has first-class Metal support.

# Install llama-cpp-python with Metal support
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

# Or use Homebrew
brew install llama.cpp

# Download a GGUF model
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir ./models

# apple_silicon_inference.py
from llama_cpp import Llama

llm = Llama(
    model_path="./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,    # -1 = offload ALL layers to Metal GPU
    n_threads=8,         # Match performance cores (M1=4, M2=4, M3 Pro=6, M4 Pro=10)
    verbose=False,
)

output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers."},
    ],
    max_tokens=300,
    temperature=0.7,
)

print(output["choices"][0]["message"]["content"])

Expected performance (Mistral 7B Q4_K_M):

M1 8GB: ~15-20 tokens/sec
M2 Pro 16GB: ~25-35 tokens/sec
M3 Pro 36GB: ~35-45 tokens/sec
M4 Pro 48GB: ~50-65 tokens/sec

Path 2: MLX Framework (Apple’s Native Option)

MLX is Apple’s machine learning framework, built specifically for Apple Silicon. It gives the best raw performance on M-series chips but has a smaller model ecosystem.

# Install MLX and mlx-lm (model loading utility)
pip install mlx mlx-lm

# mlx_inference.py
from mlx_lm import load, generate

# MLX models are available on Hugging Face with the "mlx" tag
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Explain what a hash table is in simple terms."
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=200,
    temp=0.7,
)
print(response)

MLX vs llama.cpp on Apple Silicon:

	MLX	llama.cpp
Speed	10-20% faster	Slightly slower
Model availability	Limited (mlx-community)	Huge (all GGUF models)
API	Pythonic, clean	C++ with Python bindings
Maturity	Newer, evolving	Battle-tested
Memory efficiency	Excellent	Excellent
Community	Growing	Very large

Recommendation: Use llama.cpp/GGUF for the widest model selection. Use MLX if you want maximum speed on Apple Silicon and the model you need is available in MLX format.

Path 3: Ollama (Simplest)

Ollama wraps llama.cpp with a clean CLI and REST API. Easiest way to get started.

# Install on macOS
brew install ollama

# Start the server
ollama serve

# In another terminal, pull and run a model
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M

# Use via Python
pip install ollama

# ollama_inference.py
import ollama

response = ollama.chat(
    model="llama3.1:8b-instruct-q4_K_M",
    messages=[
        {"role": "user", "content": "What is the difference between a stack and a queue?"},
    ],
)
print(response["message"]["content"])

Apple Silicon Tips

Check your RAM before downloading: Run system_profiler SPHardwareDataType | grep Memory to see total RAM
Close memory-hungry apps: Safari, Chrome, and Xcode each eat 2-8GB. Close them before running large models
Monitor with Activity Monitor: Watch “Memory Pressure” gauge. Green = fine. Yellow = swapping (slow). Red = stop and use a smaller model
Use Q4_K_M as default: Best balance of quality and size for Apple Silicon
Set n_gpu_layers=-1: Always offload all layers to Metal GPU. Mixed CPU/GPU is slower than all-GPU on Apple Silicon
Thermal throttling: Sustained inference on a MacBook will throttle. Desktop Macs (Mac Mini, Mac Studio, Mac Pro) sustain full speed

Integration with Your Harness

This section shows how to wire HF models into the agent harness you are building.

Loading a Model in harness.py

# harness.py — model loading module

class ModelProvider:
    """Abstract interface for model providers (local and API)."""
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        raise NotImplementedError


class LocalLlamaProvider(ModelProvider):
    """Local inference via llama.cpp (GGUF models)."""
    
    def __init__(self, model_path: str, n_ctx: int = 4096, n_gpu_layers: int = -1):
        from llama_cpp import Llama
        self.llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_gpu_layers=n_gpu_layers,
            verbose=False,
        )
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        output = self.llm.create_chat_completion(
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        )
        return output["choices"][0]["message"]["content"]


class OllamaProvider(ModelProvider):
    """Local inference via Ollama REST API."""
    
    def __init__(self, model: str = "llama3.1:8b-instruct-q4_K_M"):
        self.model = model
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        import ollama
        response = ollama.chat(
            model=self.model,
            messages=messages,
            options={"num_predict": max_tokens, "temperature": temperature},
        )
        return response["message"]["content"]


class AnthropicProvider(ModelProvider):
    """Cloud inference via Claude API."""
    
    # Note: Claude model IDs include date suffixes that change with releases
    def __init__(self, model: str = "claude-sonnet-4", api_key: str | None = None):
        import anthropic
        self.client = anthropic.Anthropic(api_key=api_key)  # Uses ANTHROPIC_API_KEY env var if None
        self.model = model
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        # Convert from OpenAI-style messages to Anthropic format
        system = ""
        user_messages = []
        for msg in messages:
            if msg["role"] == "system":
                system = msg["content"]
            else:
                user_messages.append(msg)
        
        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            system=system,
            messages=user_messages,
            temperature=temperature,
        )
        return response.content[0].text

Switching Between Local and API Models

# config.py — model selection

import os

def get_provider(mode: str = "auto") -> ModelProvider:
    """
    Get the right model provider based on mode.
    
    Modes:
        "local"  — Always use local GGUF model
        "api"    — Always use cloud API
        "auto"   — Local if model file exists, else API
    """
    local_model_path = os.environ.get(
        "LOCAL_MODEL_PATH",
        "./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
    )
    
    if mode == "local":
        return LocalLlamaProvider(local_model_path)
    
    if mode == "api":
        return AnthropicProvider()
    
    # Auto: prefer local, fall back to API
    if os.path.exists(local_model_path):
        return LocalLlamaProvider(local_model_path)
    
    if os.environ.get("ANTHROPIC_API_KEY"):
        return AnthropicProvider()
    
    raise RuntimeError(
        "No model available. Either download a GGUF model to ./models/ "
        "or set ANTHROPIC_API_KEY environment variable."
    )

Hybrid Routing: Local for Cheap, API for Complex

The most cost-effective pattern: use a fast local model for simple tasks (classification, extraction, formatting) and route complex reasoning to a powerful cloud model.

# router.py — task complexity routing

class HybridRouter:
    """Routes tasks to local or cloud model based on estimated complexity."""
    
    def __init__(self):
        self.local = LocalLlamaProvider("./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf")
        self.cloud = AnthropicProvider(model="claude-sonnet-4")
    
    def route(self, messages: list[dict], task_type: str = "auto") -> str:
        """
        Route to appropriate provider.
        
        task_type:
            "simple"   — formatting, extraction, classification → local
            "complex"  — multi-step reasoning, code generation, analysis → cloud
            "auto"     — estimate complexity from prompt length and keywords
        """
        if task_type == "simple":
            return self.local.generate(messages)
        
        if task_type == "complex":
            return self.cloud.generate(messages)
        
        # Auto-detect complexity
        last_message = messages[-1]["content"]
        
        # Heuristics for routing
        complex_signals = [
            len(last_message) > 2000,                    # Long prompts usually need more reasoning
            "step by step" in last_message.lower(),      # Explicit reasoning request
            "analyze" in last_message.lower(),            # Analysis tasks
            "compare" in last_message.lower(),            # Comparison tasks
            "debug" in last_message.lower(),              # Debugging needs strong reasoning
            last_message.count("\n") > 20,                # Multi-part problems
        ]
        
        if sum(complex_signals) >= 2:
            return self.cloud.generate(messages, max_tokens=2000)
        
        return self.local.generate(messages)


# Usage in agent loop
router = HybridRouter()

# Simple task — routed to local Mistral 7B (free, ~50ms)
result = router.route(
    [{"role": "user", "content": "Extract the email from: Contact us at [email protected]"}],
    task_type="simple",
)

# Complex task — routed to Claude (paid, but much better quality)
result = router.route(
    [{"role": "user", "content": "Analyze this code for security vulnerabilities and suggest fixes:\n..."}],
    task_type="complex",
)

Cost comparison with hybrid routing:

Approach	Cost per 1M tokens	When
Always cloud (Claude Sonnet)	~$3.00 input / ~$15.00 output	Every task
Always local (Mistral 7B)	$0.00 (electricity only)	Every task
Hybrid (80% local, 20% cloud)	~$0.60 input / ~$3.00 output	Smart routing

Note: Prices approximate as of early 2025. Check provider websites for current rates.

The hybrid approach can typically save up to 80-90% versus always using a cloud API when the majority of requests can be routed to local models, with minimal quality loss because simple tasks do not need a powerful model.

Common Mistakes

These are the mistakes that waste the most time when starting with HF models. Each one is something real developers hit.

Mistake 1: “I downloaded a 70B model on my 16GB laptop”

The problem: A 70B parameter model in FP16 needs ~140GB of RAM. Even quantized to Q4_K_M, it needs ~40GB. A 16GB laptop cannot run it.

The fix: Match model size to your hardware.

# Check available memory before loading
import psutil

available_gb = psutil.virtual_memory().available / (1024 ** 3)
print(f"Available RAM: {available_gb:.1f} GB")

# Rule of thumb for GGUF Q4_K_M:
# Model RAM needed ≈ (params_billions * 0.6) + 2 GB overhead
# 7B  → ~6 GB
# 13B → ~10 GB
# 34B → ~22 GB
# 70B → ~44 GB

model_params_b = 7  # Change this
estimated_ram = (model_params_b * 0.6) + 2
if available_gb < estimated_ram:
    print(f"WARNING: Need ~{estimated_ram:.0f}GB but only {available_gb:.1f}GB available")
    print(f"Use a smaller model or close other applications")

Mistake 2: “I’m using FP32 instead of quantized”

The problem: Loading a model in full FP32 precision uses 4x more memory than necessary and runs 3-4x slower. There is almost never a reason to use FP32 for inference.

The wrong way:

# DON'T do this — loads in FP32, wastes memory
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")

The right way:

# DO this — use FP16 minimum on GPU
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# BETTER — use a pre-quantized AWQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-v0.3-AWQ",
    device_map="auto",
)

# BEST for Mac — use GGUF
llm = Llama(model_path="./models/mistral-7b.Q4_K_M.gguf", n_gpu_layers=-1)

Mistake 3: “I forgot to set device_map=‘auto’”

The problem: Without device_map, the model loads entirely on CPU even if you have a GPU. Inference is 10-50x slower.

The fix: Always pass device_map="auto" when loading with transformers:

# This automatically puts the model on the best available device
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",        # <-- never forget this
    torch_dtype=torch.float16,
)

Mistake 4: “I’m not using the chat template”

The problem: Instruction-tuned models expect a specific prompt format. Without it, they produce worse output or gibberish.

The wrong way:

# DON'T do this with instruction-tuned models
inputs = tokenizer("What is Python?", return_tensors="pt")

The right way:

# DO this — use the model's chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

Mistake 5: “I’m downloading models to the wrong place”

The problem: HF models cache in ~/.cache/huggingface/hub/ by default. On macOS, this is on your boot drive which may be small. A single 7B model is 4-15GB.

The fix: Set a custom cache directory:

# In your shell profile (.zshrc / .bashrc)
export HF_HOME="/Volumes/ExternalDrive/huggingface"
# Or for a project-specific cache
export TRANSFORMERS_CACHE="./models/cache"

Mistake 6: “My model is slow because I’m re-loading it every request”

The problem: Loading a model takes 5-30 seconds. If your harness loads the model on every inference call, it is unusable.

The fix: Load once, reuse the instance:

# WRONG — loads model every call
def generate(prompt):
    model = AutoModelForCausalLM.from_pretrained(...)  # 10 sec each time!
    # ...

# RIGHT — load once at startup
class Harness:
    def __init__(self):
        self.model = AutoModelForCausalLM.from_pretrained(...)  # Load once
    
    def generate(self, prompt):
        # Use self.model — already loaded
        pass

Decision Tree: “Which Model Should I Download?”

Follow this tree from top to bottom. The first match is your answer.

START: What is your primary use case?
│
├─ Building an agent harness (tool use, ReAct loop)?
│   ├─ Mac with 16GB+ RAM?
│   │   └─ Llama 3.1 8B, GGUF Q4_K_M, via llama.cpp
│   │       Download: TheBloke/Llama-3.1-8B-Instruct-GGUF (Q4_K_M)
│   │
│   ├─ Mac with 8GB RAM?
│   │   └─ Phi-2 (2.7B) or Qwen 2.5 3B, GGUF Q4_K_M
│   │       Download: TheBloke/phi-2-GGUF (Q4_K_M)
│   │
│   ├─ NVIDIA GPU with 8GB+ VRAM?
│   │   └─ Mistral 7B Instruct, AWQ 4-bit
│   │       Download: TheBloke/Mistral-7B-Instruct-v0.2-AWQ
│   │
│   └─ No local hardware / cloud only?
│       └─ Use API: Claude Haiku (cheap) + Claude Sonnet (complex)
│           No download needed. Set ANTHROPIC_API_KEY.
│
├─ Code generation / code review?
│   └─ DeepSeek-Coder-V2-Lite (6.7B) or CodeLlama 7B
│       GGUF for Mac, AWQ for NVIDIA
│
├─ Embeddings / RAG?
│   └─ nomic-embed-text (137M)
│       Runs on CPU, no GPU needed
│       pip install sentence-transformers
│
├─ Complex reasoning / verification agent?
│   ├─ Have 64GB+ RAM or multi-GPU?
│   │   └─ Llama 3.1 70B, GGUF Q4_K_M or AWQ
│   │
│   └─ Less hardware?
│       └─ Use API: Claude Sonnet or Opus for verification
│           Hybrid route: local 7B for simple, API for complex
│
├─ Just learning / experimenting?
│   └─ ollama pull llama3.1:8b
│       Simplest path, works everywhere, good enough for learning
│
└─ Not sure?
    └─ Start with: ollama pull mistral:7b-instruct-q4_K_M
        It is fast, permissive license, good quality.
        You can always switch later.

Quick Reference: Model Download Commands

# GGUF models (for llama.cpp / Apple Silicon / CPU)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models

huggingface-cli download TheBloke/Llama-3.1-8B-Instruct-GGUF \
  llama-3.1-8b-instruct.Q4_K_M.gguf --local-dir ./models

# AWQ models (for NVIDIA GPUs)
# These download automatically when you load with from_pretrained()
# Model name: TheBloke/Mistral-7B-Instruct-v0.2-AWQ

# Ollama (simplest, manages downloads for you)
ollama pull mistral:7b-instruct-q4_K_M
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull deepseek-coder:6.7b-instruct-q4_K_M

Market Trends 2025-2026

What’s Changing

Model size distribution: Mean model size grew from 827M (2023) to 20.8B (2025)
- Driven by quantization making large models practical
- Mixture-of-Experts (MoE) enables efficiency at scale
Community focus shift: From training to adapting, quantizing, redistributing
- Fine-tuning adapters (LoRA) more popular than full fine-tune
- Quantized variants dominate downloads
- Speed of deployment matters more than raw capability
Dominant pattern: Use base model + quantize + optionally fine-tune adapters
- Example: Start Llama 3 -> Quantize to AWQ -> Add LoRA for domain knowledge
- Cost: ~5% of training a model from scratch

Practical Selection for Your Harness

Agent / Tool-Use Harness

Recommended models (as of April 2026):

7B SLM: Llama 3.1 8B, Mistral 7B, Phi-4
Quantization: AWQ 4-bit
Context: 16K-32K tokens
Speed: Fast enough for real-time loops
Cost: <$0.01 per million input tokens

Long-Context Harness

Enable:

GQA models with INT8/INT4 KV cache quantization
For maximum compression: TurboQuant (3-bit, 6x memory reduction, zero accuracy loss — see Doc 02)
32K-100K+ context window models
Example: Llama 3.1 (128K context, GQA enabled) + INT8 KV cache

Specialized Tasks

Code: CodeLlama-13B (quantized) or DeepSeek-Coder
Math: Qwen Math specialist, DeepSeek-Math
Embeddings: nomic-embed-text (137M parameters, better than OpenAI embeddings for cost)
RAG: Combine Llama 7B + embedding model + vector store (Qdrant, ChromaDB)

Implementation Checklist

Identify your use case (agent, classification, embeddings, code)
Browse Hugging Face for top models in category
Check model card for license, training data, limitations
Try model on Spaces (if available)
Choose quantization (start with AWQ)
Download and benchmark locally before integration
Track model version in requirements.txt / lock file
Plan for model updates (check for new releases quarterly)

Reasoning Model Recommendations

The models above are all instruction-tuned. For tasks requiring multi-step logical reasoning (strategic analysis, inference chains, verification), reasoning models offer dramatically better quality. See Doc 01 for the full reasoning vs instruction model comparison.

Recommended Reasoning Models (April 2026)

Model	Params	Type	4-bit Size	RAM Needed	Runs On	Best For
DeepSeek-R1-Distill-Qwen-14B	14B	Reasoning	~9GB	32GB Apple Silicon	M2/M3/M4 32GB+	Strategic analysis, multi-step reasoning
QwQ-32B	32B	Reasoning	~18GB	48GB+	M4 Max 48GB+ or multi-GPU	Complex reasoning, higher quality than 14B

Practical Guidance

For 32GB Apple Silicon: DeepSeek-R1-Distill-Qwen-14B is the clear choice. It fits comfortably in memory at 4-bit quantization and delivers reasoning quality far beyond instruction models of the same size.
QwQ-32B will not fit on a 32GB Mac — at ~18GB quantized plus context overhead, it needs 48GB+ RAM. Only viable on M4 Max 48GB+ or dedicated GPU setups.
Speed trade-off: Reasoning models are significantly slower. Expect ~173 seconds for a complex reasoning task on a 14B reasoning model versus ~25 seconds on a 14B instruction model. The quality difference on reasoning tasks justifies the wait.
When to use: For strategic analysis, multi-step inference, and verification tasks, use DeepSeek-R1 over instruction models of the same size. For content generation, formatting, and agent tool loops where speed matters, stick with instruction models.

Download Commands

# DeepSeek-R1-Distill-Qwen-14B via Ollama
ollama pull deepseek-r1:14b

# Or download GGUF for llama.cpp
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
  DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf --local-dir ./models

Choosing a Local Inference Runtime

Three options dominate local model inference on Apple Silicon. Each makes different trade-offs between ease of use, structured output guarantees, and raw speed.

	Ollama	llama-cpp-python	MLX
Built for	Interactive chat	Cross-platform inference	Apple Silicon specifically
Integration	HTTP server (localhost:11434)	In-process Python calls	In-process Python-native
Setup	`brew install ollama` then `ollama pull model`	`CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python`	`pip install mlx-lm`
JSON enforcement	`format: json` flag (request, not guarantee)	GBNF grammar constraints (token-level guarantee)	Validate + retry
Speed on Apple Silicon	Fast (wraps llama.cpp)	Fast (Metal bolt-on)	Fastest (native Apple framework)
Model format	GGUF (auto-downloaded)	Any GGUF file	MLX format (mlx-community on HuggingFace)
Best for	Interactive use, quick prototyping	Production agents needing guaranteed JSON, cross-platform	Apple Silicon production agents prioritising speed

When to Use Each

Ollama: You want to try a model in 2 minutes. Great for exploration, not for automated agents (HTTP overhead on every call).
llama-cpp-python: Your agent needs mechanically guaranteed JSON output (GBNF grammar constraints force valid output at the token level). Or you need cross-platform support (Linux/Mac/Windows).
MLX: You’re running on Apple Silicon and speed matters more than grammar constraints. Apple’s own framework optimised for unified memory. Pair with validate-and-retry for JSON reliability.

Quick Code Examples

Ollama (HTTP-based, simplest setup):

import ollama

response = ollama.chat(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "What is a hash table?"}],
)
print(response["message"]["content"])

llama-cpp-python (in-process, with GBNF grammar for guaranteed JSON):

from llama_cpp import Llama

llm = Llama(model_path="./models/qwen2.5-7b.Q4_K_M.gguf", n_gpu_layers=-1, verbose=False)
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is a hash table?"}],
    max_tokens=200,
    response_format={"type": "json_object"},  # or use grammar= for GBNF
)
print(output["choices"][0]["message"]["content"])

MLX (Apple Silicon native, fastest):

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
response = generate(model, tokenizer, prompt="What is a hash table?", max_tokens=200)
print(response)

Validation Checklist

How do you know you got this right?

Performance Checks

Model loads without CUDA/memory errors on your target hardware
First inference completes in <10 seconds (local) or <2 seconds (API)
Quantized model (AWQ 4-bit) runs at 50+ tokens/sec on target GPU
Memory usage matches documentation (no unexpected OOM mid-session)

Implementation Checks

Downloaded and pinned exact model version in lock file
Model card checked: license acceptable for your use case
Tested on 2+ example inputs from your domain
Ran 5+ inference samples to verify consistent output
Know which quantization variant you’re using (AWQ/GPTQ/GGUF)
Benchmarked against at least one alternative model in the same category
Have a fallback model listed if primary one becomes unavailable

Integration Checks

Model integrates cleanly with harness (correct API for transformers/ollama/etc)
Tool integration works: web search, code execution, file ops don’t break with this model
Token limits checked: model’s context window matches your use case
Understood quantization impact: quality loss acceptable? latency gain worth it?

Common Failure Modes

Model loads but inference errors: Wrong tokenizer/processor for model
Out of memory on quantized model: Quantization still too large for GPU; try smaller base model
Inference super slow: Quantization not actually enabled; check model loading code
Model produces gibberish: Wrong prompt format for instruction-tuned model; check model card
“Model not found” errors: Incorrect HF repo name or no internet access during download

Sign-Off Criteria

Model runs end-to-end in your harness on a real task
Performance metrics (latency, memory) match or exceed benchmarks
Quality on test cases acceptable (measured by metric in doc 16)
Quantization validated: speed gain vs quality trade-off acceptable
Model version pinned and documented in your project