Skip to main content
Reference

Hugging Face Ecosystem: Model Selection & Quantization

Finding, downloading, and running models from Hugging Face — AWQ, GPTQ, GGUF quantization, Apple Silicon guide, and harness integration.

What It Is

Hugging Face is the primary hub for open-source AI models, datasets, and related tools. As of 2025, it hosts:

  • 2 million+ public models
  • 500,000+ public datasets
  • 13 million users
  • Shift from pure consumption to active creation (fine-tuning, adapters, benchmarks, applications)

Finding and Evaluating Models

Key Metrics on Hugging Face

  1. Downloads: Popular models typically have 1M+ downloads/month
  2. Library support: Built on transformers (HF), diffusers, or specialized frameworks
  3. Hardware requirements: CPU/GPU/Memory needed (clearly documented)
  4. License: MIT, Apache 2.0, OpenRAIL, etc. (check for restrictions)
  5. Task type: Text generation, classification, embeddings, vision, speech, etc.
  6. Benchmark scores: MMLU, HellaSwag, Arc, TruthfulQA (compare across models)
  7. Model card quality: Detailed documentation = better maintained model

Finding Right Model: Decision Tree

Are you building an agent?
├─ YES: Look for instruction-tuned models (Llama, Mistral, Phi family)
│       Size: 7B–13B SLM range (balance speed + capability)
│       Check: Does it support function calling/tool use?

├─ NO: What's your task?
    ├─ Classification/Tagging: 
    │   └─ Try: DistilBERT, ALBERT, or <1B models
    │       Preference: CPU inference possible

    ├─ Embeddings/RAG:
    │   └─ Try: nomic-embed-text, BGE, or OpenAI embeddings
    │       Check: Embedding dimension, max sequence length

    ├─ Code generation:
    │   └─ Try: CodeLlama (7B–34B), Codestral, DeepSeek-Coder
    │       Check: Language support, context window

    └─ General chat:
        └─ Try: Llama 2/3, Mistral, Claude (API)

Performance vs Size Trade-off (2025)

Parameter CountUse CasesSpeedCost
100M–500MClassification, tagging, fast inferenceExcellentMinimal
1B–3BDocument processing, search tagging, edgeVery goodLow
7B–13BCoding, reasoning, general chat, agentsGoodMedium
20B–34BComplex multi-step reasoningFairHigher
70B+Advanced reasoning, specialized tasksSlowerVery high

Try Before Integrating: Hugging Face Spaces

  • Most popular models have free Spaces (hosted demos)
  • Test model behavior without integration cost
  • Check latency, output quality, edge cases
  • Community comments surface common issues

Model Download and First Inference (Zero to Running in 15 Minutes)

This section gets you from a fresh Python environment to running your first local inference. No GPU required for small models.

Step 1: Install Dependencies

# Create a virtual environment (recommended)
python3 -m venv hf-env
source hf-env/bin/activate

# Install core packages
pip install torch transformers accelerate

# For quantized models (optional, install when needed)
pip install autoawq          # AWQ quantization
pip install auto-gptq        # GPTQ quantization
pip install bitsandbytes     # 8-bit/4-bit via bitsandbytes

# For GGUF / llama.cpp (alternative path)
pip install llama-cpp-python

# Hugging Face CLI (for downloading models directly)
pip install huggingface-hub

Step 2: Authenticate (Required for Gated Models)

Some models (Llama, Gemma) require accepting a license on the Hugging Face website before download.

# Login to Hugging Face (creates ~/.cache/huggingface/token)
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

Step 3: Download and Run First Inference

Option A: Small model, no GPU needed (recommended first test)

# first_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "microsoft/phi-2"  # 2.7B params, runs on CPU

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("Loading model (this takes 1-3 minutes on first run)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,  # CPU needs float32
    device_map="cpu",
)

prompt = "Write a Python function that reverses a string:"
inputs = tokenizer(prompt, return_tensors="pt")

print("Generating...")
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Expected output (first run):

Loading tokenizer...
Loading model (this takes 1-3 minutes on first run)...
Generating...
Write a Python function that reverses a string:

def reverse_string(s):
    return s[::-1]

# Example usage:
print(reverse_string("hello"))  # Output: "olleh"

Option B: With GPU (faster, larger models)

# first_inference_gpu.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,       # Half precision for GPU
    device_map="auto",               # Auto-distribute across GPUs
)

messages = [
    {"role": "user", "content": "Explain what a Python decorator is in 3 sentences."}
]

# Use the chat template (instruction-tuned models need this)
inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Step 4: Common First-Run Errors and Fixes

ErrorCauseFix
OutOfMemoryError: CUDA out of memoryModel too large for GPU VRAMUse quantized model or device_map="cpu"
OSError: You are trying to access a gated repoModel requires license acceptanceVisit model page on HF, accept license, then huggingface-cli login
ImportError: No module named 'torch'PyTorch not installedpip install torch
RuntimeError: Expected all tensors on same deviceMixed CPU/GPU tensorsUse device_map="auto" or move inputs with .to(model.device)
ValueError: Tokenizer class ... not foundOld transformers versionpip install --upgrade transformers
KeyError: 'model.safetensors'Incomplete downloadDelete ~/.cache/huggingface/hub/models--<name> and re-download
Model produces gibberishWrong prompt formatUse tokenizer.apply_chat_template() for instruction-tuned models
Extremely slow generationRunning FP32 on CPU for large modelUse quantized model (AWQ/GGUF) or switch to GPU with FP16

Using the Hugging Face Pipeline (Simpler API)

For quick experimentation, the pipeline API abstracts away tokenizer/model management:

from transformers import pipeline

# One-liner inference
generator = pipeline("text-generation", model="microsoft/phi-2", device_map="auto")
result = generator("Explain recursion:", max_new_tokens=150)
print(result[0]["generated_text"])

Pipelines are convenient for testing but give less control. Use the explicit tokenizer + model approach for production harness code.


Quantization Guide: Practical Code for Each Format

Quantization reduces model size and speeds inference by using lower-precision data types.

Supported Formats and Use Cases

TechniquePrecisionMemory SavingsSpeed GainQualityUse Case
GQA (built-in)KV cache sharing2-4x cache2-4x cache opsMinimalModels with GQA (Llama 3, Mistral)
AWQ4-bit~75%3-4xMinimal lossProduction inference (best choice)
GPTQ4-bit~75%2-3xMinimal lossSmaller devices, older GPUs
GGUF2-8 bit50-87%VariesVaries by quantllama.cpp, local CPU/Apple Silicon
8-bit (bitsandbytes)8-bit~50%1-2xNegligibleMedium hardware, research
ONNX optimizedFP32/FP16/int80-75%2-5xVariesCross-platform deployment

Quantization Decision Tree

What hardware are you running on?

├─ Apple Silicon (M1/M2/M3/M4)?
│   └─ Use GGUF format with llama.cpp or MLX
│       Why: Native Metal acceleration, no CUDA needed

├─ NVIDIA GPU with 8GB+ VRAM?
│   ├─ Need maximum speed?
│   │   └─ AWQ (fastest inference, best quality retention)
│   └─ Need widest compatibility?
│       └─ GPTQ (works on older CUDA GPUs)

├─ NVIDIA GPU with <8GB VRAM?
│   └─ GGUF with llama.cpp (offload layers to CPU as needed)

├─ CPU only (Intel/AMD)?
│   └─ GGUF with llama.cpp (optimized CPU kernels)

└─ Cloud/server deployment?
    └─ AWQ (best throughput per dollar)

AWQ (Activation-aware Weight Quantization) preserves the most important weights at higher precision, giving the best quality at 4-bit.

Using a pre-quantized AWQ model (easiest path):

# awq_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

# AWQ models use the same inference code as full-precision models
messages = [{"role": "user", "content": "What is gradient descent?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Quantizing your own model to AWQ:

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.3"
quant_path = "./mistral-7b-awq"

# Load full-precision model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",  # Use GEMM for best speed on modern GPUs
}

# Quantize (takes 15-30 min on a single GPU for 7B model)
model.quantize(tokenizer, quant_config=quant_config)

# Save locally
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Quantized model saved to {quant_path}")

AWQ memory requirements for quantization (you need the full model in memory first):

  • 7B model: ~16GB GPU VRAM to quantize
  • 13B model: ~28GB GPU VRAM to quantize
  • 70B model: ~140GB GPU VRAM (multi-GPU or cloud instance)

GPTQ Quantization (Most Compatible)

GPTQ has the widest hardware support and the largest library of pre-quantized models. Slightly slower than AWQ but works on older CUDA versions.

Using a pre-quantized GPTQ model:

# gptq_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain the difference between a list and a tuple in Python."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Quantizing your own model to GPTQ:

# quantize_gptq.py
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPTQ needs calibration data (representative text samples)
calibration_text = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a versatile programming language.",
    # Add 100-200 representative samples for best results
]

gptq_config = GPTQConfig(
    bits=4,
    dataset=calibration_text,
    tokenizer=tokenizer,
    group_size=128,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("./mistral-7b-gptq")
tokenizer.save_pretrained("./mistral-7b-gptq")

GGUF Format (For llama.cpp and Local Inference)

GGUF is the standard format for llama.cpp. It supports CPU inference, partial GPU offloading, and runs natively on Apple Silicon. This is the go-to format for local development on a Mac.

Using GGUF with llama-cpp-python:

# gguf_inference.py
from llama_cpp import Llama

# Download a GGUF file first:
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
#   mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models

llm = Llama(
    model_path="./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    n_ctx=4096,        # Context window
    n_threads=8,       # CPU threads (match your core count)
    n_gpu_layers=0,    # Set to -1 for full GPU offload, 0 for CPU only
    verbose=False,
)

output = llm(
    "Q: What is the capital of France? A:",
    max_tokens=100,
    temperature=0.7,
    stop=["Q:", "\n\n"],
)

print(output["choices"][0]["text"])

Using GGUF with ollama (simplest path):

# Install ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (auto-downloads GGUF)
ollama pull mistral:7b-instruct-q4_K_M

# Run inference
ollama run mistral:7b-instruct-q4_K_M "What is recursion?"

# Use via API (for integration with harness)
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-instruct-q4_K_M",
  "prompt": "What is recursion?",
  "stream": false
}'

GGUF quantization levels explained:

SuffixBitsSize (7B)QualityUse When
Q2_K2-bit~2.7GBPoorExtreme memory constraints only
Q3_K_M3-bit~3.3GBAcceptableTight memory, acceptable quality loss
Q4_K_M4-bit~4.1GBGoodBest default choice
Q5_K_M5-bit~4.8GBVery goodWhen quality matters more than size
Q6_K6-bit~5.5GBExcellentNear full-precision quality
Q8_08-bit~7.2GBNear-perfectMaximum quality, still saves ~50%
F1616-bit~14GBFullBaseline, no quantization

The sweet spot is Q4_K_M: good quality, 70%+ size reduction, fast inference.

How to Choose Quantization Format

Start here:

├─ Running on Apple Silicon Mac?
│   └─ GGUF (Q4_K_M) via llama.cpp or ollama

├─ Running on NVIDIA GPU?
│   ├─ Want best speed + quality?
│   │   └─ AWQ (4-bit GEMM)
│   ├─ Older GPU (pre-Ampere)?
│   │   └─ GPTQ (4-bit, widest CUDA support)
│   └─ Just experimenting?
│       └─ bitsandbytes 8-bit (pip install, one-line config)

├─ CPU only (Intel/AMD server)?
│   └─ GGUF (Q4_K_M) via llama.cpp

└─ Deploying to production?
    ├─ Single GPU server → AWQ
    ├─ Multi-GPU server → AWQ with tensor parallelism
    └─ Edge/mobile → GGUF (smallest quant that meets quality bar)

Model Comparison Table (April 2026)

7B-Class Models Head-to-Head

These are the models most relevant for building an agent harness. All benchmarks from public evaluations as of April 2026.

ModelParamsLicenseContextMMLUHumanEvalSpeed (tok/s, AWQ)VRAM (AWQ)Best For
Mistral 7B v0.37.3BApache 2.032K62.540.2~95~5GBGeneral agent, good all-rounder
Llama 3.1 8B8.0BLlama 3.1128K66.662.2~85~5.5GBLong-context tasks, coding
Phi-4 ¹14BMIT16K78.067.8~55~8.5GBReasoning, math, coding
Gemma 2 9B9.2BGemma8K64.354.1~75~6GBMultilingual, instruction following
Qwen 2.5 7B7.6BApache 2.0128K68.461.5~90~5GBCoding, multilingual, long context
DeepSeek-Coder-V2-Lite6.7BMIT128K60.173.8~100~4.5GBCode-only tasks

¹ Note: Phi-4 is 14B parameters, nearly 2x larger than other models in this table, which partly explains its higher benchmark scores.

Interpretation

  • Best all-rounder for agent harness: Llama 3.1 8B or Qwen 2.5 7B (great benchmarks, 128K context, permissive license)
  • Best for coding: DeepSeek-Coder-V2-Lite (highest HumanEval, smallest size, fastest)
  • Best for reasoning/math: Phi-4 (highest MMLU by far, but larger at 14B and shorter context)
  • Best for constrained hardware: Mistral 7B (smallest effective size, Apache license, very fast)
  • Avoid: Gemma 2 9B for agent use (short 8K context is limiting)

70B-Class Models (For Verification / Complex Reasoning)

ModelParamsLicenseContextMMLUBest For
Llama 3.1 70B70.6BLlama 3.1128K79.3Verification agent, complex reasoning
Qwen 2.5 72B72.7BApache 2.0128K80.1Best open-source benchmark scores
Mixtral 8x22B141B (39B active)Apache 2.065K77.8MoE efficiency, lower per-token cost

These require multi-GPU setups or cloud instances for local inference. For a harness, route complex tasks to a 70B+ model via API (see Hybrid Routing section below).


Apple Silicon / Local Development

Running models locally on Apple Silicon (M1/M2/M3/M4) is one of the best local inference experiences available. The unified memory architecture means your GPU and CPU share RAM, so a 32GB M-series Mac can run models that would need a dedicated GPU on x86.

Memory Requirements Per Model Size

Model SizeQuantizationRAM NeededRuns On
1-3BQ4_K_M~2-3GBAny M-series Mac
7BQ4_K_M~5-6GBM1 8GB (tight), M1 16GB comfortable
7BQ8_0~8-9GBM1 16GB minimum
13BQ4_K_M~8-9GBM1 16GB minimum
13BQ8_0~15GBM2/M3 16GB (tight), 32GB comfortable
34BQ4_K_M~20GBM2/M3/M4 32GB minimum
70BQ4_K_M~40GBM2/M3/M4 Pro/Max 64GB+
70BQ8_0~75GBM3/M4 Max 96GB or Ultra 128GB

Rule of thumb: Model size in GB (quantized) plus 2-3GB overhead for context and system. If your total RAM is less than that, expect heavy swapping and terrible performance.

The most battle-tested path for Apple Silicon. llama.cpp has first-class Metal support.

# Install llama-cpp-python with Metal support
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

# Or use Homebrew
brew install llama.cpp

# Download a GGUF model
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir ./models
# apple_silicon_inference.py
from llama_cpp import Llama

llm = Llama(
    model_path="./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,    # -1 = offload ALL layers to Metal GPU
    n_threads=8,         # Match performance cores (M1=4, M2=4, M3 Pro=6, M4 Pro=10)
    verbose=False,
)

output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers."},
    ],
    max_tokens=300,
    temperature=0.7,
)

print(output["choices"][0]["message"]["content"])

Expected performance (Mistral 7B Q4_K_M):

  • M1 8GB: ~15-20 tokens/sec
  • M2 Pro 16GB: ~25-35 tokens/sec
  • M3 Pro 36GB: ~35-45 tokens/sec
  • M4 Pro 48GB: ~50-65 tokens/sec

Path 2: MLX Framework (Apple’s Native Option)

MLX is Apple’s machine learning framework, built specifically for Apple Silicon. It gives the best raw performance on M-series chips but has a smaller model ecosystem.

# Install MLX and mlx-lm (model loading utility)
pip install mlx mlx-lm
# mlx_inference.py
from mlx_lm import load, generate

# MLX models are available on Hugging Face with the "mlx" tag
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Explain what a hash table is in simple terms."
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=200,
    temp=0.7,
)
print(response)

MLX vs llama.cpp on Apple Silicon:

MLXllama.cpp
Speed10-20% fasterSlightly slower
Model availabilityLimited (mlx-community)Huge (all GGUF models)
APIPythonic, cleanC++ with Python bindings
MaturityNewer, evolvingBattle-tested
Memory efficiencyExcellentExcellent
CommunityGrowingVery large

Recommendation: Use llama.cpp/GGUF for the widest model selection. Use MLX if you want maximum speed on Apple Silicon and the model you need is available in MLX format.

Path 3: Ollama (Simplest)

Ollama wraps llama.cpp with a clean CLI and REST API. Easiest way to get started.

# Install on macOS
brew install ollama

# Start the server
ollama serve

# In another terminal, pull and run a model
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M

# Use via Python
pip install ollama
# ollama_inference.py
import ollama

response = ollama.chat(
    model="llama3.1:8b-instruct-q4_K_M",
    messages=[
        {"role": "user", "content": "What is the difference between a stack and a queue?"},
    ],
)
print(response["message"]["content"])

Apple Silicon Tips

  1. Check your RAM before downloading: Run system_profiler SPHardwareDataType | grep Memory to see total RAM
  2. Close memory-hungry apps: Safari, Chrome, and Xcode each eat 2-8GB. Close them before running large models
  3. Monitor with Activity Monitor: Watch “Memory Pressure” gauge. Green = fine. Yellow = swapping (slow). Red = stop and use a smaller model
  4. Use Q4_K_M as default: Best balance of quality and size for Apple Silicon
  5. Set n_gpu_layers=-1: Always offload all layers to Metal GPU. Mixed CPU/GPU is slower than all-GPU on Apple Silicon
  6. Thermal throttling: Sustained inference on a MacBook will throttle. Desktop Macs (Mac Mini, Mac Studio, Mac Pro) sustain full speed

Integration with Your Harness

This section shows how to wire HF models into the agent harness you are building.

Loading a Model in harness.py

# harness.py — model loading module

class ModelProvider:
    """Abstract interface for model providers (local and API)."""
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        raise NotImplementedError


class LocalLlamaProvider(ModelProvider):
    """Local inference via llama.cpp (GGUF models)."""
    
    def __init__(self, model_path: str, n_ctx: int = 4096, n_gpu_layers: int = -1):
        from llama_cpp import Llama
        self.llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_gpu_layers=n_gpu_layers,
            verbose=False,
        )
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        output = self.llm.create_chat_completion(
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        )
        return output["choices"][0]["message"]["content"]


class OllamaProvider(ModelProvider):
    """Local inference via Ollama REST API."""
    
    def __init__(self, model: str = "llama3.1:8b-instruct-q4_K_M"):
        self.model = model
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        import ollama
        response = ollama.chat(
            model=self.model,
            messages=messages,
            options={"num_predict": max_tokens, "temperature": temperature},
        )
        return response["message"]["content"]


class AnthropicProvider(ModelProvider):
    """Cloud inference via Claude API."""
    
    # Note: Claude model IDs include date suffixes that change with releases
    def __init__(self, model: str = "claude-sonnet-4", api_key: str | None = None):
        import anthropic
        self.client = anthropic.Anthropic(api_key=api_key)  # Uses ANTHROPIC_API_KEY env var if None
        self.model = model
    
    def generate(self, messages: list[dict], max_tokens: int = 500, temperature: float = 0.7) -> str:
        # Convert from OpenAI-style messages to Anthropic format
        system = ""
        user_messages = []
        for msg in messages:
            if msg["role"] == "system":
                system = msg["content"]
            else:
                user_messages.append(msg)
        
        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            system=system,
            messages=user_messages,
            temperature=temperature,
        )
        return response.content[0].text

Switching Between Local and API Models

# config.py — model selection

import os

def get_provider(mode: str = "auto") -> ModelProvider:
    """
    Get the right model provider based on mode.
    
    Modes:
        "local"  — Always use local GGUF model
        "api"    — Always use cloud API
        "auto"   — Local if model file exists, else API
    """
    local_model_path = os.environ.get(
        "LOCAL_MODEL_PATH",
        "./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
    )
    
    if mode == "local":
        return LocalLlamaProvider(local_model_path)
    
    if mode == "api":
        return AnthropicProvider()
    
    # Auto: prefer local, fall back to API
    if os.path.exists(local_model_path):
        return LocalLlamaProvider(local_model_path)
    
    if os.environ.get("ANTHROPIC_API_KEY"):
        return AnthropicProvider()
    
    raise RuntimeError(
        "No model available. Either download a GGUF model to ./models/ "
        "or set ANTHROPIC_API_KEY environment variable."
    )

Hybrid Routing: Local for Cheap, API for Complex

The most cost-effective pattern: use a fast local model for simple tasks (classification, extraction, formatting) and route complex reasoning to a powerful cloud model.

# router.py — task complexity routing

class HybridRouter:
    """Routes tasks to local or cloud model based on estimated complexity."""
    
    def __init__(self):
        self.local = LocalLlamaProvider("./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf")
        self.cloud = AnthropicProvider(model="claude-sonnet-4")
    
    def route(self, messages: list[dict], task_type: str = "auto") -> str:
        """
        Route to appropriate provider.
        
        task_type:
            "simple"   — formatting, extraction, classification → local
            "complex"  — multi-step reasoning, code generation, analysis → cloud
            "auto"     — estimate complexity from prompt length and keywords
        """
        if task_type == "simple":
            return self.local.generate(messages)
        
        if task_type == "complex":
            return self.cloud.generate(messages)
        
        # Auto-detect complexity
        last_message = messages[-1]["content"]
        
        # Heuristics for routing
        complex_signals = [
            len(last_message) > 2000,                    # Long prompts usually need more reasoning
            "step by step" in last_message.lower(),      # Explicit reasoning request
            "analyze" in last_message.lower(),            # Analysis tasks
            "compare" in last_message.lower(),            # Comparison tasks
            "debug" in last_message.lower(),              # Debugging needs strong reasoning
            last_message.count("\n") > 20,                # Multi-part problems
        ]
        
        if sum(complex_signals) >= 2:
            return self.cloud.generate(messages, max_tokens=2000)
        
        return self.local.generate(messages)


# Usage in agent loop
router = HybridRouter()

# Simple task — routed to local Mistral 7B (free, ~50ms)
result = router.route(
    [{"role": "user", "content": "Extract the email from: Contact us at [email protected]"}],
    task_type="simple",
)

# Complex task — routed to Claude (paid, but much better quality)
result = router.route(
    [{"role": "user", "content": "Analyze this code for security vulnerabilities and suggest fixes:\n..."}],
    task_type="complex",
)

Cost comparison with hybrid routing:

ApproachCost per 1M tokensWhen
Always cloud (Claude Sonnet)~$3.00 input / ~$15.00 outputEvery task
Always local (Mistral 7B)$0.00 (electricity only)Every task
Hybrid (80% local, 20% cloud)~$0.60 input / ~$3.00 outputSmart routing

Note: Prices approximate as of early 2025. Check provider websites for current rates.

The hybrid approach can typically save up to 80-90% versus always using a cloud API when the majority of requests can be routed to local models, with minimal quality loss because simple tasks do not need a powerful model.


Common Mistakes

These are the mistakes that waste the most time when starting with HF models. Each one is something real developers hit.

Mistake 1: “I downloaded a 70B model on my 16GB laptop”

The problem: A 70B parameter model in FP16 needs ~140GB of RAM. Even quantized to Q4_K_M, it needs ~40GB. A 16GB laptop cannot run it.

The fix: Match model size to your hardware.

# Check available memory before loading
import psutil

available_gb = psutil.virtual_memory().available / (1024 ** 3)
print(f"Available RAM: {available_gb:.1f} GB")

# Rule of thumb for GGUF Q4_K_M:
# Model RAM needed ≈ (params_billions * 0.6) + 2 GB overhead
# 7B  → ~6 GB
# 13B → ~10 GB
# 34B → ~22 GB
# 70B → ~44 GB

model_params_b = 7  # Change this
estimated_ram = (model_params_b * 0.6) + 2
if available_gb < estimated_ram:
    print(f"WARNING: Need ~{estimated_ram:.0f}GB but only {available_gb:.1f}GB available")
    print(f"Use a smaller model or close other applications")

Mistake 2: “I’m using FP32 instead of quantized”

The problem: Loading a model in full FP32 precision uses 4x more memory than necessary and runs 3-4x slower. There is almost never a reason to use FP32 for inference.

The wrong way:

# DON'T do this — loads in FP32, wastes memory
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")

The right way:

# DO this — use FP16 minimum on GPU
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# BETTER — use a pre-quantized AWQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-v0.3-AWQ",
    device_map="auto",
)

# BEST for Mac — use GGUF
llm = Llama(model_path="./models/mistral-7b.Q4_K_M.gguf", n_gpu_layers=-1)

Mistake 3: “I forgot to set device_map=‘auto’”

The problem: Without device_map, the model loads entirely on CPU even if you have a GPU. Inference is 10-50x slower.

The fix: Always pass device_map="auto" when loading with transformers:

# This automatically puts the model on the best available device
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",        # <-- never forget this
    torch_dtype=torch.float16,
)

Mistake 4: “I’m not using the chat template”

The problem: Instruction-tuned models expect a specific prompt format. Without it, they produce worse output or gibberish.

The wrong way:

# DON'T do this with instruction-tuned models
inputs = tokenizer("What is Python?", return_tensors="pt")

The right way:

# DO this — use the model's chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

Mistake 5: “I’m downloading models to the wrong place”

The problem: HF models cache in ~/.cache/huggingface/hub/ by default. On macOS, this is on your boot drive which may be small. A single 7B model is 4-15GB.

The fix: Set a custom cache directory:

# In your shell profile (.zshrc / .bashrc)
export HF_HOME="/Volumes/ExternalDrive/huggingface"
# Or for a project-specific cache
export TRANSFORMERS_CACHE="./models/cache"

Mistake 6: “My model is slow because I’m re-loading it every request”

The problem: Loading a model takes 5-30 seconds. If your harness loads the model on every inference call, it is unusable.

The fix: Load once, reuse the instance:

# WRONG — loads model every call
def generate(prompt):
    model = AutoModelForCausalLM.from_pretrained(...)  # 10 sec each time!
    # ...

# RIGHT — load once at startup
class Harness:
    def __init__(self):
        self.model = AutoModelForCausalLM.from_pretrained(...)  # Load once
    
    def generate(self, prompt):
        # Use self.model — already loaded
        pass

Decision Tree: “Which Model Should I Download?”

Follow this tree from top to bottom. The first match is your answer.

START: What is your primary use case?

├─ Building an agent harness (tool use, ReAct loop)?
│   ├─ Mac with 16GB+ RAM?
│   │   └─ Llama 3.1 8B, GGUF Q4_K_M, via llama.cpp
│   │       Download: TheBloke/Llama-3.1-8B-Instruct-GGUF (Q4_K_M)
│   │
│   ├─ Mac with 8GB RAM?
│   │   └─ Phi-2 (2.7B) or Qwen 2.5 3B, GGUF Q4_K_M
│   │       Download: TheBloke/phi-2-GGUF (Q4_K_M)
│   │
│   ├─ NVIDIA GPU with 8GB+ VRAM?
│   │   └─ Mistral 7B Instruct, AWQ 4-bit
│   │       Download: TheBloke/Mistral-7B-Instruct-v0.2-AWQ
│   │
│   └─ No local hardware / cloud only?
│       └─ Use API: Claude Haiku (cheap) + Claude Sonnet (complex)
│           No download needed. Set ANTHROPIC_API_KEY.

├─ Code generation / code review?
│   └─ DeepSeek-Coder-V2-Lite (6.7B) or CodeLlama 7B
│       GGUF for Mac, AWQ for NVIDIA

├─ Embeddings / RAG?
│   └─ nomic-embed-text (137M)
│       Runs on CPU, no GPU needed
│       pip install sentence-transformers

├─ Complex reasoning / verification agent?
│   ├─ Have 64GB+ RAM or multi-GPU?
│   │   └─ Llama 3.1 70B, GGUF Q4_K_M or AWQ
│   │
│   └─ Less hardware?
│       └─ Use API: Claude Sonnet or Opus for verification
│           Hybrid route: local 7B for simple, API for complex

├─ Just learning / experimenting?
│   └─ ollama pull llama3.1:8b
│       Simplest path, works everywhere, good enough for learning

└─ Not sure?
    └─ Start with: ollama pull mistral:7b-instruct-q4_K_M
        It is fast, permissive license, good quality.
        You can always switch later.

Quick Reference: Model Download Commands

# GGUF models (for llama.cpp / Apple Silicon / CPU)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models

huggingface-cli download TheBloke/Llama-3.1-8B-Instruct-GGUF \
  llama-3.1-8b-instruct.Q4_K_M.gguf --local-dir ./models

# AWQ models (for NVIDIA GPUs)
# These download automatically when you load with from_pretrained()
# Model name: TheBloke/Mistral-7B-Instruct-v0.2-AWQ

# Ollama (simplest, manages downloads for you)
ollama pull mistral:7b-instruct-q4_K_M
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull deepseek-coder:6.7b-instruct-q4_K_M

What’s Changing

  • Model size distribution: Mean model size grew from 827M (2023) to 20.8B (2025)

    • Driven by quantization making large models practical
    • Mixture-of-Experts (MoE) enables efficiency at scale
  • Community focus shift: From training to adapting, quantizing, redistributing

    • Fine-tuning adapters (LoRA) more popular than full fine-tune
    • Quantized variants dominate downloads
    • Speed of deployment matters more than raw capability
  • Dominant pattern: Use base model + quantize + optionally fine-tune adapters

    • Example: Start Llama 3 -> Quantize to AWQ -> Add LoRA for domain knowledge
    • Cost: ~5% of training a model from scratch

Practical Selection for Your Harness

Agent / Tool-Use Harness

Recommended models (as of April 2026):

  • 7B SLM: Llama 3.1 8B, Mistral 7B, Phi-4
  • Quantization: AWQ 4-bit
  • Context: 16K-32K tokens
  • Speed: Fast enough for real-time loops
  • Cost: <$0.01 per million input tokens

Long-Context Harness

Enable:

  • GQA models with INT8/INT4 KV cache quantization
  • For maximum compression: TurboQuant (3-bit, 6x memory reduction, zero accuracy loss — see Doc 02)
  • 32K-100K+ context window models
  • Example: Llama 3.1 (128K context, GQA enabled) + INT8 KV cache

Specialized Tasks

  • Code: CodeLlama-13B (quantized) or DeepSeek-Coder
  • Math: Qwen Math specialist, DeepSeek-Math
  • Embeddings: nomic-embed-text (137M parameters, better than OpenAI embeddings for cost)
  • RAG: Combine Llama 7B + embedding model + vector store (Qdrant, ChromaDB)

Implementation Checklist

  • Identify your use case (agent, classification, embeddings, code)
  • Browse Hugging Face for top models in category
  • Check model card for license, training data, limitations
  • Try model on Spaces (if available)
  • Choose quantization (start with AWQ)
  • Download and benchmark locally before integration
  • Track model version in requirements.txt / lock file
  • Plan for model updates (check for new releases quarterly)

Reasoning Model Recommendations

The models above are all instruction-tuned. For tasks requiring multi-step logical reasoning (strategic analysis, inference chains, verification), reasoning models offer dramatically better quality. See Doc 01 for the full reasoning vs instruction model comparison.

ModelParamsType4-bit SizeRAM NeededRuns OnBest For
DeepSeek-R1-Distill-Qwen-14B14BReasoning~9GB32GB Apple SiliconM2/M3/M4 32GB+Strategic analysis, multi-step reasoning
QwQ-32B32BReasoning~18GB48GB+M4 Max 48GB+ or multi-GPUComplex reasoning, higher quality than 14B

Practical Guidance

  • For 32GB Apple Silicon: DeepSeek-R1-Distill-Qwen-14B is the clear choice. It fits comfortably in memory at 4-bit quantization and delivers reasoning quality far beyond instruction models of the same size.
  • QwQ-32B will not fit on a 32GB Mac — at ~18GB quantized plus context overhead, it needs 48GB+ RAM. Only viable on M4 Max 48GB+ or dedicated GPU setups.
  • Speed trade-off: Reasoning models are significantly slower. Expect ~173 seconds for a complex reasoning task on a 14B reasoning model versus ~25 seconds on a 14B instruction model. The quality difference on reasoning tasks justifies the wait.
  • When to use: For strategic analysis, multi-step inference, and verification tasks, use DeepSeek-R1 over instruction models of the same size. For content generation, formatting, and agent tool loops where speed matters, stick with instruction models.

Download Commands

# DeepSeek-R1-Distill-Qwen-14B via Ollama
ollama pull deepseek-r1:14b

# Or download GGUF for llama.cpp
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
  DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf --local-dir ./models

Choosing a Local Inference Runtime

Three options dominate local model inference on Apple Silicon. Each makes different trade-offs between ease of use, structured output guarantees, and raw speed.

Ollamallama-cpp-pythonMLX
Built forInteractive chatCross-platform inferenceApple Silicon specifically
IntegrationHTTP server (localhost:11434)In-process Python callsIn-process Python-native
Setupbrew install ollama then ollama pull modelCMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-pythonpip install mlx-lm
JSON enforcementformat: json flag (request, not guarantee)GBNF grammar constraints (token-level guarantee)Validate + retry
Speed on Apple SiliconFast (wraps llama.cpp)Fast (Metal bolt-on)Fastest (native Apple framework)
Model formatGGUF (auto-downloaded)Any GGUF fileMLX format (mlx-community on HuggingFace)
Best forInteractive use, quick prototypingProduction agents needing guaranteed JSON, cross-platformApple Silicon production agents prioritising speed

When to Use Each

  • Ollama: You want to try a model in 2 minutes. Great for exploration, not for automated agents (HTTP overhead on every call).
  • llama-cpp-python: Your agent needs mechanically guaranteed JSON output (GBNF grammar constraints force valid output at the token level). Or you need cross-platform support (Linux/Mac/Windows).
  • MLX: You’re running on Apple Silicon and speed matters more than grammar constraints. Apple’s own framework optimised for unified memory. Pair with validate-and-retry for JSON reliability.

Quick Code Examples

Ollama (HTTP-based, simplest setup):

import ollama

response = ollama.chat(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "What is a hash table?"}],
)
print(response["message"]["content"])

llama-cpp-python (in-process, with GBNF grammar for guaranteed JSON):

from llama_cpp import Llama

llm = Llama(model_path="./models/qwen2.5-7b.Q4_K_M.gguf", n_gpu_layers=-1, verbose=False)
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is a hash table?"}],
    max_tokens=200,
    response_format={"type": "json_object"},  # or use grammar= for GBNF
)
print(output["choices"][0]["message"]["content"])

MLX (Apple Silicon native, fastest):

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
response = generate(model, tokenizer, prompt="What is a hash table?", max_tokens=200)
print(response)

Validation Checklist

How do you know you got this right?

Performance Checks

  • Model loads without CUDA/memory errors on your target hardware
  • First inference completes in <10 seconds (local) or <2 seconds (API)
  • Quantized model (AWQ 4-bit) runs at 50+ tokens/sec on target GPU
  • Memory usage matches documentation (no unexpected OOM mid-session)

Implementation Checks

  • Downloaded and pinned exact model version in lock file
  • Model card checked: license acceptable for your use case
  • Tested on 2+ example inputs from your domain
  • Ran 5+ inference samples to verify consistent output
  • Know which quantization variant you’re using (AWQ/GPTQ/GGUF)
  • Benchmarked against at least one alternative model in the same category
  • Have a fallback model listed if primary one becomes unavailable

Integration Checks

  • Model integrates cleanly with harness (correct API for transformers/ollama/etc)
  • Tool integration works: web search, code execution, file ops don’t break with this model
  • Token limits checked: model’s context window matches your use case
  • Understood quantization impact: quality loss acceptable? latency gain worth it?

Common Failure Modes

  • Model loads but inference errors: Wrong tokenizer/processor for model
  • Out of memory on quantized model: Quantization still too large for GPU; try smaller base model
  • Inference super slow: Quantization not actually enabled; check model loading code
  • Model produces gibberish: Wrong prompt format for instruction-tuned model; check model card
  • “Model not found” errors: Incorrect HF repo name or no internet access during download

Sign-Off Criteria

  • Model runs end-to-end in your harness on a real task
  • Performance metrics (latency, memory) match or exceed benchmarks
  • Quality on test cases acceptable (measured by metric in doc 16)
  • Quantization validated: speed gain vs quality trade-off acceptable
  • Model version pinned and documented in your project

See Also

  • Doc 02 (KV Cache Optimization): Understand how quantization affects attention computation
  • Doc 06 (Harness Architecture): Integrate model selection into component 1 (LLM/AI Model)
  • Doc 08 (Claw-Code Python): Reference implementation with multi-provider model support
  • Doc 16 (Evaluation & Benchmarking): Measure quality metrics to validate model choice