Hardware Landscape — The Harness Handbook Reference

Overview

Choosing hardware for AI is a cost-performance-power trade-off. You need to match your workload (training, inference, local, cloud) to the right chip. This guide covers what’s available, why you’d buy it, and how much it costs.

TL;DR: For local development use Apple M-series or RTX 4070. For production use cloud GPUs (H100/H200). For edge inference use mobile chips or M1/M2.

1. CPU vs GPU vs AI Chips: Fundamentals

Hardware	What It Does	Best For	Cost	Power
CPU	Sequential execution, smart branching, all general tasks	Everything (training, inference, serving, glue code)	$50–$500	10–150W
GPU	Parallel processing, 10,000+ threads, linear algebra	Training, batch inference, matrix ops	$200–$12,000	200–600W
TPU	Custom-built tensor operations	Google Cloud training/inference only	Cloud only	250–500W
Neural Engine	Optimized for 8-bit/16-bit inference	On-device AI (Apple, Qualcomm)	Built-in	1–10W
FPGA	Programmable hardware	Custom inference, latency-critical	$500–$5,000	50–300W

Why the differences?

CPUs are like smart generalists. They handle branching, complex logic, and sequential work. One core can do one thing at a time, but it’s flexible.
GPUs are like dumb sprinters. They have 10,000+ cores that run the same instruction on different data. Perfect for matrix multiplication (what neural networks do), terrible at decision-making.
Neural Engines are specialized GPUs that optimize for inference at 8-bit or 16-bit precision. They use less power and space but can’t train models.
TPUs are Google’s custom silicon—not available to the public except via Google Cloud.

Practical implication: If you’re training, you need a GPU (or TPU cloud). If you’re running inference on a server, GPU or CPU works (GPU is faster for batch, CPU is fine for single requests). If you’re running on a phone or laptop, use the Neural Engine or Apple M-series.

2. NVIDIA Ecosystem: The Default GPU Choice

NVIDIA dominates because they own CUDA (the software that lets you use GPUs), have the best drivers, and have been optimizing for AI for 15 years. Here’s what’s available:

Data Center & Training

GPU	VRAM	Price	TFLOPS (FP32)	Best For	Cloud Availability
H200	141GB	$38,000	67	Large models, training	AWS, GCP, Azure (early)
H100	80GB	$32,000	67	Training, large inference	AWS, GCP, Azure
A100	40/80GB	$10,000–$18,000	19.5	Training, batch inference	AWS, GCP, Azure, on-prem
A6000	48GB	$6,500	38.7	Research, production inference	AWS, on-prem

What these numbers mean:

VRAM: Bigger = larger models fit in memory. H200’s 141GB holds massive models without offloading.
TFLOPS: Floating-point operations per second. FP32 is shown here, but practical ML workloads use TF32 or bfloat16 (2× throughput). Higher = faster, but not everything scales linearly (memory bandwidth matters too).
Price per TFLOP: H100 = ~$478/TFLOP, A100 = ~$513–$923/TFLOP. H100 is expensive but new, so cloud providers absorb the cost.

Consumer/Enthusiast GPUs

GPU	VRAM	Price	TFLOPS (FP32)	Best For
RTX 4090	24GB	$1,500	82.6	Local training, research
RTX 4080 Super	16GB	$1,200	52	High-end gaming + some training
RTX 4070 Ti Super	16GB	$900	44	Good training, better inference
RTX 4070	12GB	$600	29	Solid all-rounder
RTX 4070 Mobile	8GB	$1,500–$2,500 (laptop)	21	Laptop training
L40	48GB	$10,000	90.5	Inference-optimized, data center
L4	24GB	$3,000	30.3	Edge inference, data center

Decision points:

If you’re buying one GPU for local work, RTX 4070 is the sweet spot: $600, handles 7B–13B models, good for most projects.
If budget allows, RTX 4090 is best for research (~82.6 TFLOPS FP32), but requires good cooling and a 1500W+ PSU.
If you only care about inference (not training), L40 or L4 are more cost-effective in data centers.

AMD Alternative

GPU	VRAM	Price	Best For
RX 7900 XTX	24GB	$700	Budget alternative to RTX 4080
MI300X	192GB	$20,000	Cloud training (AMD alternative to H100)

Trade-off: AMD is cheaper but ROCm (AMD’s CUDA equivalent) is less mature. Libraries like PyTorch support it, but fewer optimizations exist. Use if you must save money.

Power & Thermal

GPU	Power Draw	PSU Required	Cooling Notes
RTX 4090	450W	1500W	Needs aftermarket cooling, loud at full load
RTX 4080	320W	1000W	Standard tower cooler sufficient
RTX 4070	200W	750W	Quiet operation possible
H100	700W	Data center PSU	Requires liquid cooling in data centers
A100	400W	Data center PSU	Requires good ventilation

Cost to run 24/7: RTX 4090 at $0.15/kWh = 450W × 8,760 hours × $0.15 = ~$590/year. M-series laptops: ~$20/year.

3. Apple Silicon (M-series): The Unified Memory Advantage

Apple’s M-series chips are the secret weapon for local development and edge inference. The magic word: unified memory.

The Unified Memory Difference

Traditional GPU (NVIDIA):

CPU → PCIe → GPU Memory
Data copy: CPU has data → send to GPU (slow)
Compute: GPU does work
Result copy: GPU memory → send back to CPU (slow)

Apple M-series (Unified Memory):

CPU + GPU share the same memory
No copying. CPU and GPU access the same data instantly.

Performance impact: 20–40% faster for many workloads because no data copy overhead. NVIDIA is working on this (NvLink, but only on server GPUs), but consumer NVIDIA GPUs still have this limitation.

Apple M-Series Lineup

Chip	Cores	Unified Memory	Price (laptop)	Best For
M3	8-core CPU, 10-core GPU	8/16/24GB	$1,500–$2,000	Local dev, 7B models
M3 Max	12-core CPU, 18-core GPU	48GB	$3,500	Serious local training, large models
M4	10-core CPU, 10-core GPU	16/24GB	$1,600–$2,100	Faster than M3 (especially CPU)
M4 Pro	12-core CPU, 20-core GPU	36GB	$2,500	Best price-to-performance
M4 Max	12-core CPU, 40-core GPU	96GB	$3,500–$4,000	High-end local work
M2 Ultra (Mac Studio)	20-core CPU, 40-core GPU	192GB	$7,000	Enterprise-class local

Real-World Examples

MacBook Air M3 with 16GB: Runs 7B models (Llama 2) at ~15 tokens/sec locally. Great for development.
MacBook Pro M3 Max with 48GB: Runs 13B models (Mistral, Llama 13B) at ~5–10 tokens/sec. Can fine-tune small adapters.
Mac Studio M2 Ultra with 192GB: Runs 70B models (Llama 70B) at ~1–2 tokens/sec. Can train small models.

Cost-Effectiveness

RTX 4090 + high-end PC: $2,500 total, 450W power, needs cooling setup. MacBook Pro M3 Max: $3,500, 35W typical power, completely portable.

If you value silence, portability, and efficiency, M-series wins. If you need maximum raw compute per dollar, NVIDIA wins.

4. Intel Arc: The Underdog GPU

Intel is trying to challenge NVIDIA with the Arc series. Results are mixed.

GPU	VRAM	Price	TFLOPS (FP32)	Status
Arc A770	8/16GB	$300–$400	19.7	Competitive with RTX 4070, cheaper
Arc A750	8GB	$200	17.2	Entry-level alternative
Flex 170	16GB	$2,500	17.6	Data center inference

Pros: Cheaper than NVIDIA, decent performance, integrated into some laptops. Cons: Driver support is immature (crashes, performance variance), fewer library optimizations, harder to debug issues.

When to buy: If you’re desperate for cheap GPU compute and can tolerate driver instability. Otherwise, RTX 4070 at $600 is safer.

Driver maturity timeline: Intel has been improving this, but NVIDIA is still the safe choice for production.

5. Consumer GPUs for Local AI: Decision Guide

The Choice Matrix

Budget	Primary Use	Best GPU	Price	Notes
$0–$500	Dev/inference	MacBook Air M3 or RTX 4070	$1,500–$2,000 (laptop) or $600 (card)	M3 is portable; RTX 4070 is powerful
$500–$1,200	Training + inference	RTX 4080 Super or Arc A770	$1,200 or $400	NVIDIA for safety; Arc for budget
$1,500–$3,000	High-end research	RTX 4090 or MacBook Pro M3 Max	$1,500 or $3,500	RTX 4090 = power; M3 Max = mobility
$3,000+	Enterprise/lab	Mac Studio M2 Ultra or RTX 4090 cluster	$7,000 or $1,500×N	Unified memory vs raw speed

My Recommendation for 2026

For local development: MacBook Pro M3 16GB ($1,800). Unified memory, zero config, great for 7B models.

If you need raw speed: RTX 4070 ($600) in a desktop PC ($500 for case/PSU/mobo). Total ~$1,100. Beats M3 in training speed, costs less.

If you have budget: RTX 4090 ($1,500). Best single GPU for research. Needs good cooling and a 1500W PSU.

For inference only: L40 ($10,000, enterprise) or RTX 4070 if building your own.

6. Mobile & Edge Chips: On-Device AI

The Hardware

Chip	Device	AI Performance	Power	Use Case
Apple A17 Pro	iPhone 15 Pro	16 TOPS	2–3W active	On-device vision, speech
Qualcomm Snapdragon 8 Gen 3	Android flagship	10 TOPS	2–4W active	On-device AI, gaming
Google Tensor 4	Pixel 9	8 TOPS	2–3W active	Tensor optimization for Pixel apps
MediaTek Dimensity 9300	Mid-range Android	6 TOPS	1–2W active	Budget on-device AI

Performance vs Servers

NVIDIA H100: ~67 TFLOPS (FP32); ~989 TFLOPS (FP16 Tensor Core, more practical for ML)
iPhone A17 Pro: 16 TOPS (0.016 TFLOPS)

Your phone is roughly 4,000× slower in raw FP32 throughput. But here’s the trade-off:

Metric	Phone	Server GPU
Latency	50–100ms	10–50ms (batch)
Power	2–3W	400–700W
Privacy	On-device, no upload	Shared infrastructure
Cost per inference	$0.0001 (amortized)	$0.001–$0.01

Real-World Usage

On-device models: Whisper (speech), Vision Transformer (image), small LLMs (3B or 7B with quantization)
Typical latency: 200–500ms for inference on 3B models
Battery impact: Minimal for occasional use, noticeable for continuous

Use mobile AI for:

Privacy-first features (voice command, on-device translation)
Reducing server load
Features that work offline

7. Specialized Hardware: Enterprise & Research

When Available

These are cloud/enterprise only. You can’t buy them for your home.

Hardware	Provider	Cost/Month	Best For
TPU v4e	Google Cloud	$2–$5 per accelerator	Training, huge models
AWS Trainium	AWS	Custom pricing	Training optimization, lower cost than GPU
AWS Inferentia	AWS	Custom pricing	High-throughput inference
Graphcore IPU	Graphcore (cloud partners)	Custom pricing	Custom AI workloads, research
Cerebras CS-3	Cerebras (cloud)	Custom pricing	Largest single-chip training (memory issues solved)

When to Use

TPU: If you’re training huge models (100B+) on Google Cloud. Google optimizes TPUs for Tensor processing, and they’re cheaper than H100s if you’re doing heavy work.

AWS Trainium: If training cost is your main concern. Generally cheaper per hour than GPUs for the same training job.

Others: Research only. Not production-ready or cost-effective for most teams.

8. Power and Thermal Considerations

Desktop PC Power Budget

GPU	Power Draw	Recommended PSU	Cooling Difficulty	Noise Level
RTX 4090	450W	1500W	High (needs good air or water)	Loud at full load
RTX 4080	320W	1000W	Medium (good tower cooler)	Moderate
RTX 4070	200W	750W	Low (standard cooler)	Quiet
RTX 4070 Mobile	140W	Laptop PSU	Built-in	Laptop fan noise

Real Operating Conditions

H100 in a data center: 700W + air/liquid cooling + rack space + facilities cost (~$20K/year total ownership for one GPU).

RTX 4090 on a desk: 450W continuous. At full load 24/7: 450W × 8,760 hours × $0.15/kWh = $590/year in electricity. Most people don’t run it 24/7, so ~$200–$300/year is realistic.

MacBook M3: 35W typical (single-core), peaks to 70W. Battery: ~15–20 hours per charge. At $0.15/kWh: ~$20/year if plugged in constantly.

Data Center Considerations

If you’re running GPUs in a data center:

Cooling: Proper airflow required. H100s need 200+ CFM per card.
Power distribution: Dedicated circuits, UPS backup.
Space: 2U rack space per 2–4 GPUs.
Cost: Rack space $500–$2,000/month, plus power, plus labor.

Bottom line: If you need sustained compute, cloud is often cheaper than owning hardware due to shared infrastructure costs.

9. Decision Matrix: What Hardware to Buy

Scenario 1: Solo Developer Learning AI

Decision	Choice	Why
Budget: $1,500–$2,000	MacBook Air M3 16GB	Portable, unified memory, sufficient for 7B models, good battery
Alternative (if desktop preferred)	RTX 4070 + PC	$1,100 total, faster training, more room for growth
Timeline: Immediate	Buy now	Both will be viable for years

Scenario 2: AI Research Team

Decision	Choice	Why
Local GPUs: Yes	2–4 RTX 4090s	$6K total, 5–10x faster than M-series
Cloud complement: Yes	AWS with H100s on-demand	For massive experiments, leave on-prem for iteration
Storage: Local NVMe RAID	4TB RAID 10	Working dataset cache, faster than cloud storage

Scenario 3: Production Inference API

Decision	Choice	Why
Where to run: AWS/GCP cloud	A100 or H100 clusters	elasticity, don’t own hardware, pay only for requests
GPU count: 4–8	Batch inference on multiple GPUs	Higher throughput per dollar
Load balancing: Kubernetes + vLLM	Auto-scale, queue requests	Efficient, fault-tolerant
On-prem alternative: Only if >10K req/sec	Buy A100s, need IT team	Once you exceed cloud cost, on-prem makes sense

Scenario 4: Budget Startup

Decision	Choice	Why
GPU for training	RTX 4070	$600, good for quick iteration, 12GB VRAM
Dev environment	MacBook M3 + RTX 4070 desktop	Portable dev on M3, heavy compute on RTX
Production inference	AWS Lambda + GPU (part-time) or EC2 with L4	No upfront cost, scale with usage

Scenario 5: Edge Deployment

Decision	Choice	Why
Phone/tablet	Existing hardware (A17/Snapdragon)	No extra purchase, on-device AI free
Custom inference device	Raspberry Pi 5 + M.2 accelerator or NVIDIA Jetson Orin	$200–$600, runs 3B models at 50–100ms
Low-power IoT	Google Coral TPU or NVIDIA Jetson Nano	<$100, runs <100MB models, very fast

10. Cloud vs On-Premise Economics

Cost Model

Cloud (AWS Example)

Training on H100: $3.00/hour per GPU

100-hour training job: 100 × $3 = $300
No upfront cost, no hardware to manage

Production inference on A100: $2.00/hour per GPU

1M inferences/month at 10 req/sec average
1 GPU handles ~200 req/sec = 0.05 GPUs needed
30 days × 24 hours × 0.05 GPU = 36 GPU-hours = $72/month

On-Premise (Break-Even Analysis)

RTX 4090 for training:

Hardware cost: $1,500
Power: $590/year
Cooling/space: $500/year (rough estimate for home)
3-year amortization: ($1,500 + $590×3 + $500×3) / 3 = $1,363/year or $0.155/hour

H100 in data center:

Hardware cost: $32,000
Power: 700W × 8,760 hours × $0.12 = $7,350/year
Space/cooling/labor: $15,000/year
3-year amortization: ($32,000 + $7,350×3 + $15,000×3) / 3 = $24,483/year or $2.80/hour

Break-even:

Cloud H100 at $3/hour vs on-prem at $2.80/hour is near parity
If you run >8,000 hours/year (1 GPU, 24/7), on-prem is cheaper
If you run <4,000 hours/year, cloud is cheaper (flexibility matters)

Decision Rule

Annual GPU Hours	Best Option
<2,000 hours	Pure cloud (AWS on-demand)
2,000–8,000 hours	Hybrid (cloud for spikes, local for baseline)
>8,000 hours	On-prem (one GPU)
>50,000 hours	On-prem cluster (multiple GPUs)

Hybrid Approach (Recommended for Teams)

Local: RTX 4070 or M-series for development and prototyping
Cloud: AWS H100 for large training jobs (spin up, train, spin down)
Cost: Development is local (low), big experiments are cloud (cheaper per compute hour due to scale)

11. Unified Memory Advantage Deep Dive

Why It Matters

NVIDIA GPU Memory Architecture (PCIe bottleneck):

Typical PCIe 4.0 bandwidth: 32 GB/sec
Training 70B model with 2 GPUs needs ~140 GB data
Moving data GPU→GPU: 140 GB / 32 GB/sec = 4.4 seconds per iteration
(This is why NvLink exists on H100s—but not on consumer GPUs)

Apple Unified Memory (no PCIe):

Memory bandwidth: 100+ GB/sec (system memory)
CPU and GPU access same data: zero copy overhead
For inference: 20–40% faster because no data copy

Practical Example: 7B Model Inference

NVIDIA RTX 4090:

Load 7B model from storage to CPU memory: 14GB
Copy to GPU memory: 14GB / 32 GB/sec PCIe = 0.44 seconds
Inference: 15 tokens/sec
Copy results back: negligible

Apple M3 (16GB unified):

Load 7B model: 14GB (already in unified memory)
Inference: 15 tokens/sec
No copy overhead

Result: Apple is ~5–10% faster for inference on models that fit in memory, because no copying. For models that don’t fit (and need offloading), NVIDIA is faster.

When Unified Memory Doesn’t Matter

Large models (70B+): Don’t fit in M3 Max 48GB, need offloading anyway (loses advantage)
Batch training: NVIDIA’s CUDA libraries are optimized for batching; Apple’s are not
Server inference: VRAM != unified memory (still has bandwidth limit)

Why NVIDIA Doesn’t Have This (Consumer)

NVIDIA’s architecture separates CPU and GPU—they’re different instruction sets. It’s hard to merge them without redesigning everything. A100/H100 have NvLink (connects GPUs at high bandwidth), but consumer GPUs use PCIe, which is slow.

Apple unified CPU + GPU because they control the whole stack (chip design, software). NVIDIA can’t do this without breaking 20 years of CUDA.

12. Future Hardware: 2026 and Beyond

Expected Releases

Vendor	Hardware	Expected	What’s New
NVIDIA	Blackwell (H100 successor)	Q2 2025 (likely shipping now in 2026)	2x performance, better power efficiency, NvLink 5.0
NVIDIA	RTX 5000 series	Q4 2025	consumer Blackwell, ~3x faster than RTX 4090
Apple	M5 chip	Spring 2026	Likely 20% faster than M4, more GPU cores
Intel	Arc 4-series (Battlemage)	Q2–Q4 2025	Driver improvements, better performance/watt
AMD	RDNA4	Q1–Q2 2026	Competitor to RTX 5000 series
Cerebras	Wafer-Scale Engine 4	2026	On-chip, not PCIe; massive memory, research only
Google	TPU v5e	Now available	Better cost per training TFLOP

What Will Actually Matter

Power efficiency: As electricity costs rise, watts-per-TFLOP becomes critical
HBM memory: Blackwell uses HBM (faster, higher bandwidth), not GDDR6
Unified memory adoption: May see more ARM-based chips with unified memory
Sparse compute: Models with fewer parameters become standard (efficiency wins)
On-device AI: Phones get better Neural Engines; less need to send data to servers

Safe Bets for Buying Now

RTX 4070: Will work for years. If new cards are 3x faster, so what—4070 still runs 7B models fine.
M3/M4: Will be supported for development for 5+ years minimum (Apple’s track record).
Cloud compute: Always flexible. Doesn’t matter if you’re using H100 or Blackwell; pay per hour.

Quick Reference: Hardware by Use Case

Local Development (Laptop)

Best: MacBook Pro M4 16GB ($2,500)
Runner-up: MacBook Air M3 16GB ($1,800)
Why: Unified memory, portable, zero setup

Local Development (Desktop)

Best: RTX 4070 + PC ($1,100 total)
Runner-up: RTX 4090 if you have $2,000+ budget
Why: Fastest, most expandable

Training (Home Lab)

Best: RTX 4090 ($1,500) or RTX 4080 ($1,200)
Setup: i9 CPU, 64GB RAM, 1500W PSU, good cooling
Cost: $3,000–$4,000 total for GPU + system

Training (Cloud)

Best: AWS with on-demand H100s or Trainium
Cost: $3–$10/hour per GPU depending on instance type
Recommendation: Always start here. Buy hardware only if you exceed cloud cost.

Production Inference

Scale: <10K req/sec: AWS A100 or H100 on-demand
Scale: 10K–100K req/sec: Dedicated instances (cheaper per request)
Scale: >100K req/sec: Own cluster (break-even on hardware)

Edge (Phone/Tablet)

Use built-in Neural Engine: A17, Snapdragon 8, Tensor 4
Cost: $0 (already in device)
Typical latency: 100–500ms for 3B models

Edge (Custom Device)

Best: Google Coral TPU ($50–$100) or NVIDIA Jetson Nano ($100–$200)
For: Running pre-trained 100MB–1GB models offline
Latency: 50–200ms

Summary: The Cost-Performance Frontier

As of April 2026:

Best value: RTX 4070 ($600 GPU + $500 system = $1,100 total). Handles 7B–13B models for training and inference. Most people should buy this.

Best mobility: MacBook Air M3 ($1,800). Unified memory, silent, 15–20 hour battery, sufficient for most dev work.

Best raw power: RTX 4090 ($1,500) for single GPU. Needs good cooling and power supply.

Best for production: AWS with H100 or A100 on-demand. Pay per use, elasticity, no hardware to manage.

Best for edge: Use existing phone chips (A17, Snapdragon, Tensor). Or Raspberry Pi 5 + Coral TPU (~$200) for custom devices.

Future-proof: Whatever you buy in 2026 will be obsolete in 3–5 years. Don’t overspend on hardware you’ll replace. Buy what solves today’s problem, assume you’ll upgrade.

13. Hardware Detection Script

Before choosing models or optimizations, know what you have. This script detects your hardware and recommends what models you can run.

"""
hardware_detect.py — Detect AI-relevant hardware and recommend model sizes.

Works on Linux (NVIDIA/AMD GPUs), macOS (Apple Silicon), and Windows.
Requires: psutil (pip install psutil)
Optional: torch, pynvml (for GPU details)
"""

import platform
import subprocess
import shutil
import json
from dataclasses import dataclass, field


@dataclass
class GPUInfo:
    name: str = "Unknown"
    vram_gb: float = 0.0
    cuda_version: str = "N/A"
    driver_version: str = "N/A"
    compute_capability: str = "N/A"


@dataclass
class CPUInfo:
    name: str = "Unknown"
    cores_physical: int = 0
    cores_logical: int = 0
    architecture: str = "Unknown"


@dataclass
class SystemInfo:
    cpu: CPUInfo = field(default_factory=CPUInfo)
    gpus: list = field(default_factory=list)
    ram_gb: float = 0.0
    os_name: str = "Unknown"
    has_neural_engine: bool = False
    neural_engine_cores: int = 0
    unified_memory: bool = False
    apple_chip: str = ""


def detect_cpu() -> CPUInfo:
    """Detect CPU type, cores, and architecture."""
    import psutil

    cpu = CPUInfo()
    cpu.cores_physical = psutil.cpu_count(logical=False) or 0
    cpu.cores_logical = psutil.cpu_count(logical=True) or 0
    cpu.architecture = platform.machine()

    system = platform.system()
    if system == "Darwin":
        try:
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.brand_string"],
                capture_output=True, text=True, timeout=5
            )
            cpu.name = result.stdout.strip() or "Apple Silicon"
        except (subprocess.TimeoutExpired, FileNotFoundError):
            cpu.name = "Apple Silicon (detection failed)"
    elif system == "Linux":
        try:
            with open("/proc/cpuinfo", "r") as f:
                for line in f:
                    if "model name" in line:
                        cpu.name = line.split(":")[1].strip()
                        break
        except FileNotFoundError:
            cpu.name = "Unknown Linux CPU"
    elif system == "Windows":
        cpu.name = platform.processor() or "Unknown Windows CPU"

    return cpu


def detect_nvidia_gpu() -> list[GPUInfo]:
    """Detect NVIDIA GPUs using nvidia-smi (no Python deps needed)."""
    gpus = []

    if not shutil.which("nvidia-smi"):
        return gpus

    try:
        result = subprocess.run(
            [
                "nvidia-smi",
                "--query-gpu=name,memory.total,driver_version",
                "--format=csv,noheader,nounits",
            ],
            capture_output=True, text=True, timeout=10,
        )
        if result.returncode != 0:
            return gpus

        for line in result.stdout.strip().split("\n"):
            parts = [p.strip() for p in line.split(",")]
            if len(parts) >= 3:
                gpu = GPUInfo()
                gpu.name = parts[0]
                gpu.vram_gb = round(float(parts[1]) / 1024, 1)
                gpu.driver_version = parts[2]
                gpus.append(gpu)

        # Get CUDA version separately
        cuda_result = subprocess.run(
            ["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"],
            capture_output=True, text=True, timeout=10,
        )
        if cuda_result.returncode == 0:
            caps = cuda_result.stdout.strip().split("\n")
            for i, cap in enumerate(caps):
                if i < len(gpus):
                    gpus[i].compute_capability = cap.strip()

        # Get CUDA toolkit version
        cuda_ver = subprocess.run(
            ["nvcc", "--version"],
            capture_output=True, text=True, timeout=10,
        )
        if cuda_ver.returncode == 0:
            for line in cuda_ver.stdout.split("\n"):
                if "release" in line.lower():
                    version = line.split("release")[-1].split(",")[0].strip()
                    for gpu in gpus:
                        gpu.cuda_version = version

    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass

    return gpus


def detect_apple_silicon() -> dict:
    """Detect Apple Silicon details including Neural Engine."""
    info = {
        "chip": "",
        "neural_engine": False,
        "neural_engine_cores": 0,
        "unified_memory": False,
        "gpu_cores": 0,
    }

    if platform.system() != "Darwin" or platform.machine() != "arm64":
        return info

    info["unified_memory"] = True

    try:
        result = subprocess.run(
            ["sysctl", "-n", "hw.optional.arm.FEAT_FP16"],
            capture_output=True, text=True, timeout=5,
        )
        # All Apple Silicon has Neural Engine
        info["neural_engine"] = True
    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass

    # Detect chip name from system_profiler
    try:
        result = subprocess.run(
            ["system_profiler", "SPHardwareDataType", "-json"],
            capture_output=True, text=True, timeout=15,
        )
        if result.returncode == 0:
            data = json.loads(result.stdout)
            hw = data.get("SPHardwareDataType", [{}])[0]
            chip_name = hw.get("chip_type", "")
            info["chip"] = chip_name

            # Neural Engine core counts by generation
            ne_cores = {
                "M1": 16, "M2": 16, "M3": 16, "M4": 16,
                "M1 Pro": 16, "M1 Max": 16, "M1 Ultra": 32,
                "M2 Pro": 16, "M2 Max": 16, "M2 Ultra": 32,
                "M3 Pro": 16, "M3 Max": 16,
                "M4 Pro": 16, "M4 Max": 16,
            }
            for chip, cores in ne_cores.items():
                if chip in chip_name:
                    info["neural_engine_cores"] = cores
                    break
            else:
                if "Apple" in chip_name:
                    info["neural_engine_cores"] = 16  # default

            # GPU core count from system_profiler
            gpu_cores_str = hw.get("number_processors", "")
            if "gpu" in str(gpu_cores_str).lower():
                info["gpu_cores"] = int(
                    "".join(c for c in str(gpu_cores_str) if c.isdigit()) or "0"
                )
    except (subprocess.TimeoutExpired, FileNotFoundError, json.JSONDecodeError):
        pass

    return info


def detect_ram_gb() -> float:
    """Detect total system RAM in GB."""
    import psutil
    return round(psutil.virtual_memory().total / (1024 ** 3), 1)


def recommend_model_size(system: SystemInfo) -> dict:
    """Recommend maximum model size based on detected hardware."""
    recommendations = {
        "max_model_params": "",
        "quantization": "",
        "framework": "",
        "reasoning": [],
    }

    # Determine available memory for models
    available_vram = 0.0
    has_gpu = False

    if system.gpus:
        has_gpu = True
        available_vram = max(gpu.vram_gb for gpu in system.gpus)
    elif system.unified_memory:
        # Apple Silicon: ~75% of RAM usable for models
        available_vram = system.ram_gb * 0.75

    # Model size estimates (quantized with AWQ/GGUF Q4):
    # 7B  = ~4GB,  13B = ~8GB,  34B = ~20GB,
    # 70B = ~40GB, 180B = ~100GB
    if available_vram >= 100:
        recommendations["max_model_params"] = "180B"
        recommendations["quantization"] = "AWQ 4-bit or FP16 for 70B"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — can run 180B quantized or 70B at FP16"
        )
    elif available_vram >= 40:
        recommendations["max_model_params"] = "70B"
        recommendations["quantization"] = "AWQ 4-bit recommended"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 70B fits with 4-bit quantization"
        )
    elif available_vram >= 20:
        recommendations["max_model_params"] = "34B"
        recommendations["quantization"] = "AWQ 4-bit or GGUF Q4_K_M"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 34B fits comfortably quantized"
        )
    elif available_vram >= 8:
        recommendations["max_model_params"] = "13B"
        recommendations["quantization"] = "GGUF Q4_K_M recommended"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 13B fits with quantization"
        )
    elif available_vram >= 4:
        recommendations["max_model_params"] = "7B"
        recommendations["quantization"] = "GGUF Q4_K_M required"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 7B at 4-bit quantization"
        )
    else:
        recommendations["max_model_params"] = "3B or smaller"
        recommendations["quantization"] = "GGUF Q4_0 (most aggressive)"
        recommendations["reasoning"].append(
            f"Only {available_vram:.0f}GB available — limited to small models"
        )

    # Framework recommendation
    if system.unified_memory:
        recommendations["framework"] = "llama.cpp (Metal) or MLX"
        recommendations["reasoning"].append(
            "Apple Silicon detected — use MLX or llama.cpp with Metal acceleration"
        )
    elif has_gpu and any("NVIDIA" in g.name or "GeForce" in g.name or "RTX" in g.name
                         for g in system.gpus):
        recommendations["framework"] = "vLLM, TGI, or llama.cpp (CUDA)"
        recommendations["reasoning"].append(
            "NVIDIA GPU detected — use CUDA-accelerated inference"
        )
    elif has_gpu:
        recommendations["framework"] = "llama.cpp (ROCm or Vulkan)"
        recommendations["reasoning"].append(
            "Non-NVIDIA GPU — use llama.cpp with ROCm or Vulkan backend"
        )
    else:
        recommendations["framework"] = "llama.cpp (CPU mode)"
        recommendations["reasoning"].append(
            "No GPU detected — CPU inference only, expect slow performance"
        )

    return recommendations


def detect_all() -> SystemInfo:
    """Run all detection and return a SystemInfo object."""
    system = SystemInfo()
    system.os_name = f"{platform.system()} {platform.release()}"
    system.cpu = detect_cpu()
    system.ram_gb = detect_ram_gb()
    system.gpus = detect_nvidia_gpu()

    apple = detect_apple_silicon()
    system.has_neural_engine = apple["neural_engine"]
    system.neural_engine_cores = apple["neural_engine_cores"]
    system.unified_memory = apple["unified_memory"]
    system.apple_chip = apple["chip"]

    return system


def print_report(system: SystemInfo):
    """Print a formatted hardware report with recommendations."""
    print("=" * 60)
    print("  AI HARDWARE DETECTION REPORT")
    print("=" * 60)

    print(f"\n--- Operating System ---")
    print(f"  OS:           {system.os_name}")

    print(f"\n--- CPU ---")
    print(f"  Model:        {system.cpu.name}")
    print(f"  Architecture: {system.cpu.architecture}")
    print(f"  Cores:        {system.cpu.cores_physical} physical, "
          f"{system.cpu.cores_logical} logical")

    print(f"\n--- Memory ---")
    print(f"  Total RAM:    {system.ram_gb} GB")
    if system.unified_memory:
        print(f"  Type:         Unified Memory (shared CPU/GPU)")
    else:
        print(f"  Type:         System RAM (separate from GPU VRAM)")

    if system.gpus:
        print(f"\n--- GPU(s) ---")
        for i, gpu in enumerate(system.gpus):
            print(f"  GPU {i}:        {gpu.name}")
            print(f"    VRAM:       {gpu.vram_gb} GB")
            print(f"    CUDA:       {gpu.cuda_version}")
            print(f"    Driver:     {gpu.driver_version}")
            print(f"    Compute:    {gpu.compute_capability}")
    else:
        print(f"\n--- GPU ---")
        print(f"  No NVIDIA GPU detected")
        if system.apple_chip:
            print(f"  Apple chip:   {system.apple_chip} (integrated GPU)")

    if system.has_neural_engine:
        print(f"\n--- Neural Engine ---")
        print(f"  Present:      Yes")
        print(f"  Cores:        {system.neural_engine_cores}")

    # Recommendations
    recs = recommend_model_size(system)
    print(f"\n--- Recommendations ---")
    print(f"  Max model:    {recs['max_model_params']} parameters")
    print(f"  Quantization: {recs['quantization']}")
    print(f"  Framework:    {recs['framework']}")
    for reason in recs["reasoning"]:
        print(f"  * {reason}")

    print("\n" + "=" * 60)


if __name__ == "__main__":
    system = detect_all()
    print_report(system)

Example output on a MacBook Pro M4 Max with 64GB:

============================================================
  AI HARDWARE DETECTION REPORT
============================================================

--- Operating System ---
  OS:           Darwin 25.3.0

--- CPU ---
  Model:        Apple M4 Max
  Architecture: arm64
  Cores:        14 physical, 14 logical

--- Memory ---
  Total RAM:    64.0 GB
  Type:         Unified Memory (shared CPU/GPU)

--- GPU ---
  No NVIDIA GPU detected
  Apple chip:   Apple M4 Max (integrated GPU)

--- Neural Engine ---
  Present:      Yes
  Cores:        16

--- Recommendations ---
  Max model:    34B parameters
  Quantization: AWQ 4-bit or GGUF Q4_K_M
  Framework:    llama.cpp (Metal) or MLX
  * 48GB available — 34B fits comfortably quantized
  * Apple Silicon detected — use MLX or llama.cpp with Metal acceleration
============================================================

14. Inference Benchmark Script

Numbers in spec sheets are theoretical. This script measures actual performance on your hardware: tokens per second, latency, and memory usage.

"""
benchmark_inference.py — Measure real inference performance on your hardware.

Requires: llama-cpp-python (pip install llama-cpp-python)
          psutil (pip install psutil)

Usage:
    python benchmark_inference.py --model path/to/model.gguf
    python benchmark_inference.py --model path/to/model.gguf --prompt "Explain gravity"
    python benchmark_inference.py --model path/to/model.gguf --runs 5
"""

import argparse
import time
import os
import statistics
from dataclasses import dataclass


@dataclass
class BenchmarkResult:
    model_name: str
    model_size_gb: float
    prompt_tokens: int
    generated_tokens: int
    time_to_first_token_ms: float
    tokens_per_second: float
    total_time_seconds: float
    peak_memory_gb: float
    hardware: str


def get_memory_usage_gb() -> float:
    """Get current process memory usage in GB."""
    import psutil
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 ** 3)


def get_model_size_gb(model_path: str) -> float:
    """Get model file size in GB."""
    return os.path.getsize(model_path) / (1024 ** 3)


def get_hardware_name() -> str:
    """Get a short hardware description."""
    import platform
    system = platform.system()
    machine = platform.machine()

    if system == "Darwin" and machine == "arm64":
        import subprocess
        try:
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.brand_string"],
                capture_output=True, text=True, timeout=5,
            )
            return result.stdout.strip()
        except Exception:
            return "Apple Silicon"

    import shutil
    if shutil.which("nvidia-smi"):
        import subprocess
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
                capture_output=True, text=True, timeout=10,
            )
            gpus = result.stdout.strip().split("\n")
            return gpus[0] if gpus else "NVIDIA GPU"
        except Exception:
            return "NVIDIA GPU"

    return f"{system} {machine} (CPU only)"


def run_single_benchmark(
    model_path: str,
    prompt: str,
    max_tokens: int = 128,
    n_ctx: int = 2048,
    n_gpu_layers: int = -1,
) -> BenchmarkResult:
    """Run a single inference benchmark."""
    from llama_cpp import Llama

    hardware = get_hardware_name()
    model_size = get_model_size_gb(model_path)
    model_name = os.path.basename(model_path)

    # Measure memory before loading
    mem_before = get_memory_usage_gb()

    # Load model (this is not part of inference timing)
    print(f"  Loading model: {model_name} ({model_size:.1f} GB)...")
    load_start = time.perf_counter()
    llm = Llama(
        model_path=model_path,
        n_ctx=n_ctx,
        n_gpu_layers=n_gpu_layers,
        verbose=False,
    )
    load_time = time.perf_counter() - load_start
    print(f"  Model loaded in {load_time:.1f}s")

    # Measure memory after loading
    mem_after_load = get_memory_usage_gb()

    # Run inference
    print(f"  Running inference (max {max_tokens} tokens)...")
    tokens_generated = 0
    first_token_time = None
    start_time = time.perf_counter()

    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        echo=False,
    )

    end_time = time.perf_counter()
    total_time = end_time - start_time

    # Extract results
    generated_text = output["choices"][0]["text"]
    tokens_generated = output["usage"]["completion_tokens"]
    prompt_tokens = output["usage"]["prompt_tokens"]

    # Peak memory
    mem_peak = get_memory_usage_gb()

    # Calculate metrics
    tokens_per_second = tokens_generated / total_time if total_time > 0 else 0

    # Estimate time to first token (approximate — llama.cpp doesn't expose this
    # directly in the simple API, so we estimate from prompt eval time)
    ttft_ms = (total_time / tokens_generated * 1000) if tokens_generated > 0 else 0

    result = BenchmarkResult(
        model_name=model_name,
        model_size_gb=model_size,
        prompt_tokens=prompt_tokens,
        generated_tokens=tokens_generated,
        time_to_first_token_ms=ttft_ms,
        tokens_per_second=tokens_per_second,
        total_time_seconds=total_time,
        peak_memory_gb=mem_peak,
        hardware=hardware,
    )

    # Clean up
    del llm

    return result


def run_benchmark(
    model_path: str,
    prompt: str = "Explain the theory of relativity in simple terms.",
    max_tokens: int = 128,
    runs: int = 3,
    n_gpu_layers: int = -1,
):
    """Run multiple benchmark iterations and report statistics."""
    print("=" * 60)
    print("  INFERENCE BENCHMARK")
    print("=" * 60)

    if not os.path.exists(model_path):
        print(f"\nError: Model file not found: {model_path}")
        return

    results = []
    for i in range(runs):
        print(f"\n--- Run {i + 1}/{runs} ---")
        result = run_single_benchmark(
            model_path=model_path,
            prompt=prompt,
            max_tokens=max_tokens,
            n_gpu_layers=n_gpu_layers,
        )
        results.append(result)
        print(f"  Tokens/sec: {result.tokens_per_second:.1f}")
        print(f"  Total time: {result.total_time_seconds:.2f}s")
        print(f"  Tokens generated: {result.generated_tokens}")

    # Statistics
    tps_values = [r.tokens_per_second for r in results]
    latency_values = [r.total_time_seconds for r in results]
    memory_values = [r.peak_memory_gb for r in results]

    print("\n" + "=" * 60)
    print("  RESULTS SUMMARY")
    print("=" * 60)
    print(f"\n  Hardware:       {results[0].hardware}")
    print(f"  Model:          {results[0].model_name}")
    print(f"  Model size:     {results[0].model_size_gb:.1f} GB")
    print(f"  Runs:           {runs}")
    print(f"\n  Tokens/sec:     {statistics.mean(tps_values):.1f} "
          f"(min={min(tps_values):.1f}, max={max(tps_values):.1f})")
    if runs > 1:
        print(f"  Std dev:        {statistics.stdev(tps_values):.1f} tok/s")
    print(f"  Avg latency:    {statistics.mean(latency_values):.2f}s "
          f"for {max_tokens} tokens")
    print(f"  Peak memory:    {max(memory_values):.1f} GB")

    # Compare to reference numbers
    print(f"\n  --- Reference Comparison ---")
    print_reference_comparison(results[0])
    print("\n" + "=" * 60)


# Reference benchmarks: approximate tokens/sec for common hardware + model combos
REFERENCE_BENCHMARKS = {
    "7B-Q4": {
        "RTX 4090":           90,
        "RTX 4070":           55,
        "RTX 4070 Ti Super":  65,
        "M3 (16GB)":          15,
        "M3 Max (48GB)":      25,
        "M4 Pro (36GB)":      30,
        "M4 Max (64GB)":      35,
        "A100 (80GB)":        120,
        "H100 (80GB)":        180,
        "CPU only (8-core)":  5,
    },
    "13B-Q4": {
        "RTX 4090":           55,
        "RTX 4070":           30,
        "M3 Max (48GB)":      12,
        "M4 Max (64GB)":      20,
        "A100 (80GB)":        70,
        "H100 (80GB)":        110,
        "CPU only (8-core)":  2,
    },
    "34B-Q4": {
        "RTX 4090":           25,
        "M4 Max (64GB)":      10,
        "A100 (80GB)":        40,
        "H100 (80GB)":        65,
    },
    "70B-Q4": {
        "RTX 4090":           8,
        "M2 Ultra (192GB)":   5,
        "A100 (80GB)":        20,
        "H100 (80GB)":        35,
    },
}


def print_reference_comparison(result: BenchmarkResult):
    """Print how the result compares to known reference benchmarks."""
    # Determine model size category
    size_gb = result.model_size_gb
    if size_gb < 6:
        category = "7B-Q4"
    elif size_gb < 10:
        category = "13B-Q4"
    elif size_gb < 25:
        category = "34B-Q4"
    else:
        category = "70B-Q4"

    refs = REFERENCE_BENCHMARKS.get(category, {})
    if not refs:
        print("  No reference data for this model size.")
        return

    print(f"  Category: {category} (based on {size_gb:.1f}GB file size)")
    print(f"  Your result: {result.tokens_per_second:.1f} tok/s")
    print(f"  Reference numbers for {category}:")
    for hw, tps in sorted(refs.items(), key=lambda x: x[1], reverse=True):
        marker = ""
        if result.tokens_per_second > 0:
            ratio = result.tokens_per_second / tps
            if 0.8 <= ratio <= 1.2:
                marker = " <-- similar to your hardware"
        print(f"    {hw:25s}  {tps:>6} tok/s{marker}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Benchmark LLM inference")
    parser.add_argument("--model", required=True, help="Path to GGUF model file")
    parser.add_argument("--prompt", default="Explain the theory of relativity.",
                        help="Prompt to use")
    parser.add_argument("--max-tokens", type=int, default=128,
                        help="Max tokens to generate")
    parser.add_argument("--runs", type=int, default=3,
                        help="Number of benchmark runs")
    parser.add_argument("--cpu-only", action="store_true",
                        help="Force CPU-only inference")
    args = parser.parse_args()

    n_gpu = 0 if args.cpu_only else -1
    run_benchmark(
        model_path=args.model,
        prompt=args.prompt,
        max_tokens=args.max_tokens,
        runs=args.runs,
        n_gpu_layers=n_gpu,
    )

Usage:

# Basic benchmark
python benchmark_inference.py --model models/llama-7b-q4.gguf

# Custom prompt, more runs
python benchmark_inference.py --model models/mistral-7b-q4.gguf --runs 5 --prompt "Write a Python function"

# Force CPU-only (to compare GPU vs CPU)
python benchmark_inference.py --model models/llama-7b-q4.gguf --cpu-only

15. Cost-Per-TFLOP Analysis

Raw specs are meaningless without cost context. This section breaks down the actual cost to compute on each GPU.

Consumer GPU Cost Per TFLOP

GPU	TFLOPS (FP16)	Purchase Price	$/TFLOP (Purchase)	Effective $/TFLOP (3yr amortized)
RTX 4070	58	$600	$10.34	$14.06 (incl. power)
RTX 4070 Ti Super	74	$900	$12.16	$16.22
RTX 4080 Super	94	$1,200	$12.77	$16.74
RTX 4090	164	$1,500	$9.15	$13.83 (power-hungry)
RX 7900 XTX	61	$700	$11.48	$15.06

Data Center GPU Cost Per TFLOP

GPU	TFLOPS (FP16)	Purchase Price	$/TFLOP (Purchase)	Cloud $/hr	Cloud $/TFLOP-hr
A100 (80GB)	312	$15,000	$48.08	$2.00	$0.0064
H100 (80GB)	990	$32,000	$32.32	$3.00	$0.0030
H200 (141GB)	990	$38,000	$38.38	$4.50	$0.0045
L40	181	$10,000	$55.25	$1.50	$0.0083

Cloud vs On-Prem Break-Even Calculator

"""
cost_breakeven.py — Calculate cloud vs on-prem break-even point.

No dependencies required — pure Python.
"""

from dataclasses import dataclass


@dataclass
class GPUSpec:
    name: str
    purchase_price: float      # USD
    tflops_fp16: float
    power_watts: float
    cloud_hourly: float        # $/hr for equivalent cloud instance


# Common GPU specs (April 2026 pricing)
GPU_CATALOG = {
    "rtx_4070": GPUSpec("RTX 4070", 600, 58, 200, 0.50),
    "rtx_4080": GPUSpec("RTX 4080 Super", 1200, 94, 320, 0.80),
    "rtx_4090": GPUSpec("RTX 4090", 1500, 164, 450, 1.20),
    "a100_80": GPUSpec("A100 80GB", 15000, 312, 400, 2.00),
    "h100": GPUSpec("H100 80GB", 32000, 990, 700, 3.00),
    "h200": GPUSpec("H200 141GB", 38000, 990, 4.50, 4.50),
    "l40": GPUSpec("L40 48GB", 10000, 181, 300, 1.50),
}


def calculate_onprem_hourly(
    gpu: GPUSpec,
    electricity_per_kwh: float = 0.15,
    cooling_overhead: float = 1.3,     # PUE (power usage effectiveness)
    annual_maintenance: float = 500,   # IT labor, replacements
    amortization_years: int = 3,
) -> float:
    """Calculate effective hourly cost of on-prem GPU."""
    hours_per_year = 8760

    # Hardware amortization
    hardware_hourly = gpu.purchase_price / (amortization_years * hours_per_year)

    # Power cost (GPU + cooling overhead)
    power_hourly = (gpu.power_watts / 1000) * electricity_per_kwh * cooling_overhead

    # Maintenance
    maintenance_hourly = annual_maintenance / hours_per_year

    return hardware_hourly + power_hourly + maintenance_hourly


def find_breakeven_hours(
    gpu: GPUSpec,
    electricity_per_kwh: float = 0.15,
    amortization_years: int = 3,
) -> float:
    """Find annual hours where on-prem cost equals cloud cost."""
    # On-prem fixed costs (annual)
    annual_hardware = gpu.purchase_price / amortization_years
    annual_maintenance = 500

    # On-prem variable costs (per hour)
    power_per_hour = (gpu.power_watts / 1000) * electricity_per_kwh * 1.3

    # Cloud variable cost (per hour)
    cloud_per_hour = gpu.cloud_hourly

    # Break-even: annual_fixed + hours * power_per_hour = hours * cloud_per_hour
    # hours * (cloud - power) = annual_fixed
    # hours = annual_fixed / (cloud - power)
    cost_diff = cloud_per_hour - power_per_hour
    if cost_diff <= 0:
        return float("inf")  # Cloud is cheaper per hour — on-prem never breaks even

    annual_fixed = annual_hardware + annual_maintenance
    return annual_fixed / cost_diff


def print_analysis(electricity_per_kwh: float = 0.15):
    """Print full cost analysis for all GPUs."""
    print("=" * 80)
    print("  CLOUD vs ON-PREM COST ANALYSIS")
    print(f"  Electricity rate: ${electricity_per_kwh}/kWh | "
          f"Amortization: 3 years | PUE: 1.3")
    print("=" * 80)

    print(f"\n{'GPU':20s} {'Cloud $/hr':>10s} {'On-Prem $/hr':>12s} "
          f"{'Break-Even':>12s} {'Annual Power':>12s}")
    print("-" * 70)

    for key, gpu in GPU_CATALOG.items():
        onprem_hourly = calculate_onprem_hourly(gpu, electricity_per_kwh)
        breakeven = find_breakeven_hours(gpu, electricity_per_kwh)
        annual_power = (gpu.power_watts / 1000) * 8760 * electricity_per_kwh

        breakeven_str = f"{breakeven:.0f} hrs/yr" if breakeven < 50000 else "Never"

        print(f"{gpu.name:20s} ${gpu.cloud_hourly:>8.2f} ${onprem_hourly:>10.3f} "
              f"{breakeven_str:>12s} ${annual_power:>10.0f}")

    print(f"\n  Break-even = annual hours where on-prem becomes cheaper than cloud")
    print(f"  On-prem cost includes hardware amortization, power, cooling (PUE), "
          f"and $500/yr maintenance")

    # Scenario analysis
    print(f"\n{'':=<80}")
    print("  SCENARIO ANALYSIS: RTX 4090")
    print(f"{'':=<80}")
    gpu = GPU_CATALOG["rtx_4090"]
    scenarios = [
        ("Hobby (4 hrs/week)", 4 * 52),
        ("Part-time (20 hrs/week)", 20 * 52),
        ("Full-time (40 hrs/week)", 40 * 52),
        ("Always-on (24/7)", 8760),
    ]

    for label, hours in scenarios:
        cloud_cost = hours * gpu.cloud_hourly
        onprem_cost = (
            gpu.purchase_price / 3  # amortization
            + (gpu.power_watts / 1000) * hours * electricity_per_kwh * 1.3
            + 500  # maintenance
        )
        cheaper = "On-prem" if onprem_cost < cloud_cost else "Cloud"
        savings = abs(cloud_cost - onprem_cost)
        print(f"  {label:30s} Cloud: ${cloud_cost:>8,.0f}/yr  "
              f"On-prem: ${onprem_cost:>8,.0f}/yr  "
              f"Winner: {cheaper} (saves ${savings:,.0f})")


if __name__ == "__main__":
    print_analysis(electricity_per_kwh=0.15)
    print("\n--- With cheap electricity ($0.08/kWh) ---\n")
    print_analysis(electricity_per_kwh=0.08)

Example output:

================================================================================
  CLOUD vs ON-PREM COST ANALYSIS
  Electricity rate: $0.15/kWh | Amortization: 3 years | PUE: 1.3
================================================================================

GPU                  Cloud $/hr On-Prem $/hr   Break-Even  Annual Power
----------------------------------------------------------------------
RTX 4070             $    0.50 $     0.064      456 hrs/yr $        263
RTX 4090             $    1.20 $     0.145      527 hrs/yr $        592
A100 80GB            $    2.00 $     0.627      3651 hrs/yr $       526
H100 80GB            $    3.00 $     1.349      6576 hrs/yr $       920

  SCENARIO ANALYSIS: RTX 4090
  Hobby (4 hrs/week)            Cloud: $     250/yr  On-prem: $   1,027/yr  Winner: Cloud
  Part-time (20 hrs/week)       Cloud: $   1,248/yr  On-prem: $   1,091/yr  Winner: On-prem
  Full-time (40 hrs/week)       Cloud: $   2,496/yr  On-prem: $   1,155/yr  Winner: On-prem
  Always-on (24/7)              Cloud: $  10,512/yr  On-prem: $   1,507/yr  Winner: On-prem

16. Mobile & Edge Hardware: Expanded Comparison

Detailed Mobile SoC Comparison (2026)

Chip	Device	CPU Cores	GPU Cores	NPU TOPS	RAM	Process	Release
Apple A18 Pro	iPhone 16 Pro	6 (2P+4E)	6-core	35 TOPS	8GB	3nm	Sep 2024
Apple A17 Pro	iPhone 15 Pro	6 (2P+4E)	6-core	16 TOPS	8GB	3nm	Sep 2023
Snapdragon 8 Gen 3	Galaxy S24 Ultra, etc.	8 (1+5+2)	Adreno 750	45 TOPS	8–16GB	4nm	Nov 2023
Snapdragon 8 Elite	Galaxy S25 Ultra, etc.	8 (2+6)	Adreno 830	75 TOPS	12–16GB	3nm	Oct 2024
Google Tensor G4	Pixel 9	8 (1+3+4)	Mali-G715	8 TOPS	12GB	4nm	Aug 2024
MediaTek Dimensity 9300	OnePlus 12, etc.	8 (1+3+4)	Immortalis-G720	37 TOPS	8–16GB	4nm	Nov 2023
Samsung Exynos 2400	Galaxy S24 (select)	10 (1+2+3+4)	Xclipse 940	14.7 TOPS	8–12GB	4nm	Jan 2024

Edge Compute Devices for AI

Device	Processor	AI Performance	RAM	Power	Price	Best For
Raspberry Pi 5	Cortex-A76 (4-core)	~2 TOPS (CPU)	4–8GB	5–12W	$60–$80	Prototyping, IoT
RPi 5 + Coral M.2 TPU	Cortex-A76 + Edge TPU	4 TOPS (TPU) + 2 (CPU)	4–8GB	8–15W	$100–$140	Edge inference
NVIDIA Jetson Orin Nano	Cortex-A78AE + Ampere GPU	40 TOPS	4–8GB	7–15W	$200–$300	Robotics, CV
NVIDIA Jetson AGX Orin	Cortex-A78AE + Ampere GPU	275 TOPS	32–64GB	15–60W	$900–$2,000	High-end edge
Intel NUC (Arc GPU)	i7 + Arc A770M	~13 TFLOPS FP16	16–32GB	35–100W	$800–$1,200	Compact workstation
Orange Pi 5 Plus	RK3588 (Mali-G610)	~6 TOPS (NPU)	4–32GB	5–20W	$90–$200	Budget edge AI

What Can Actually Run Where (Practical Model Sizes)

"""
edge_model_fit.py — Check which models fit on which edge devices.

No dependencies — pure Python reference table.
"""

EDGE_DEVICES = {
    "Raspberry Pi 5 (8GB)": {
        "ram_gb": 8, "usable_gb": 5, "compute": "CPU",
        "expected_tok_s": {"3B-Q4": 1.5, "1.5B-Q4": 3},
    },
    "RPi 5 + Coral TPU": {
        "ram_gb": 8, "usable_gb": 5, "compute": "TPU+CPU",
        "expected_tok_s": {"3B-Q4": 2, "1.5B-Q4": 5},
    },
    "Jetson Orin Nano (8GB)": {
        "ram_gb": 8, "usable_gb": 6, "compute": "GPU",
        "expected_tok_s": {"7B-Q4": 8, "3B-Q4": 20, "1.5B-Q4": 35},
    },
    "Jetson AGX Orin (64GB)": {
        "ram_gb": 64, "usable_gb": 55, "compute": "GPU",
        "expected_tok_s": {"34B-Q4": 5, "13B-Q4": 15, "7B-Q4": 40},
    },
    "iPhone 15 Pro (A17)": {
        "ram_gb": 8, "usable_gb": 4, "compute": "Neural Engine",
        "expected_tok_s": {"3B-Q4": 12, "1.5B-Q4": 25},
    },
    "iPhone 16 Pro (A18)": {
        "ram_gb": 8, "usable_gb": 4, "compute": "Neural Engine",
        "expected_tok_s": {"3B-Q4": 18, "1.5B-Q4": 35},
    },
    "Galaxy S25 Ultra (8 Elite)": {
        "ram_gb": 16, "usable_gb": 8, "compute": "NPU",
        "expected_tok_s": {"7B-Q4": 5, "3B-Q4": 15, "1.5B-Q4": 30},
    },
    "Pixel 9 Pro (Tensor G4)": {
        "ram_gb": 12, "usable_gb": 5, "compute": "TPU",
        "expected_tok_s": {"3B-Q4": 8, "1.5B-Q4": 18},
    },
}

MODEL_SIZES_GB = {
    "1.5B-Q4": 1.0,
    "3B-Q4": 2.0,
    "7B-Q4": 4.0,
    "13B-Q4": 8.0,
    "34B-Q4": 20.0,
    "70B-Q4": 40.0,
}


def check_compatibility():
    """Print device/model compatibility matrix."""
    models = list(MODEL_SIZES_GB.keys())

    print(f"\n{'Device':30s}", end="")
    for m in models:
        print(f" {m:>10s}", end="")
    print()
    print("-" * (30 + 11 * len(models)))

    for device_name, specs in EDGE_DEVICES.items():
        print(f"{device_name:30s}", end="")
        for model in models:
            size = MODEL_SIZES_GB[model]
            if size <= specs["usable_gb"]:
                tok_s = specs["expected_tok_s"].get(model, "?")
                if isinstance(tok_s, (int, float)):
                    print(f" {tok_s:>7.0f}t/s", end="")
                else:
                    print(f"     {'yes':>5s}", end="")
            else:
                print(f"     {'---':>5s}", end="")
        print()


if __name__ == "__main__":
    print("=" * 96)
    print("  EDGE DEVICE / MODEL COMPATIBILITY MATRIX")
    print("  Values show expected tokens/second. '---' = does not fit in memory.")
    print("=" * 96)
    check_compatibility()

Output:

Device                         1.5B-Q4     3B-Q4     7B-Q4    13B-Q4    34B-Q4    70B-Q4
------------------------------------------------------------------------------------------
Raspberry Pi 5 (8GB)                3t/s      2t/s      yes       ---       ---       ---
RPi 5 + Coral TPU                   5t/s      2t/s      yes       ---       ---       ---
Jetson Orin Nano (8GB)             35t/s     20t/s      8t/s      ---       ---       ---
Jetson AGX Orin (64GB)               yes       yes     40t/s     15t/s      5t/s      ---
iPhone 15 Pro (A17)                25t/s     12t/s      ---       ---       ---       ---
iPhone 16 Pro (A18)                35t/s     18t/s      ---       ---       ---       ---
Galaxy S25 Ultra (8 Elite)         30t/s     15t/s      5t/s      ---       ---       ---
Pixel 9 Pro (Tensor G4)            18t/s      8t/s      ---       ---       ---       ---

17. Power Consumption Analysis

Watts Per Inference by GPU

Power draw varies dramatically between idle, light inference, and full-load training. These numbers represent sustained inference workloads.

GPU	Idle Power	Inference Power	Training Power	Annual Cost (Inference 24/7)	Annual Cost (8hrs/day)
RTX 4070	15W	120W	200W	$158	$53
RTX 4070 Ti Super	20W	170W	285W	$223	$74
RTX 4080 Super	25W	200W	320W	$263	$88
RTX 4090	30W	280W	450W	$368	$123
A100 (80GB)	50W	250W	400W	$329	$110
H100 (80GB)	60W	350W	700W	$460	$153
Apple M3	5W	25W	35W	$33	$11
Apple M4 Max	8W	45W	70W	$59	$20

Assumes $0.15/kWh electricity rate.

Power Cost Calculator

"""
power_cost.py — Calculate electricity costs for AI hardware.

No dependencies — pure Python.
"""


def annual_power_cost(
    power_watts: float,
    hours_per_day: float = 24,
    electricity_rate: float = 0.15,
    pue: float = 1.0,
) -> float:
    """Calculate annual electricity cost."""
    daily_kwh = (power_watts * pue * hours_per_day) / 1000
    return daily_kwh * 365 * electricity_rate


def compare_power_costs(electricity_rate: float = 0.15):
    """Compare power costs across hardware for different usage patterns."""

    hardware = [
        ("RTX 4070 (inference)",    120),
        ("RTX 4090 (inference)",    280),
        ("RTX 4090 (training)",     450),
        ("A100 (inference)",        250),
        ("H100 (inference)",        350),
        ("H100 (training)",         700),
        ("Apple M4 Max (inference)", 45),
        ("Apple M4 Max (training)",  70),
        ("Jetson Orin Nano",         10),
        ("Raspberry Pi 5",            8),
    ]

    usage_patterns = [
        ("Hobby (2h/day)", 2),
        ("Dev (8h/day)", 8),
        ("Production (24/7)", 24),
    ]

    print(f"{'Hardware':35s}", end="")
    for label, _ in usage_patterns:
        print(f" {label:>18s}", end="")
    print()
    print("-" * (35 + 19 * len(usage_patterns)))

    for name, watts in hardware:
        print(f"{name:35s}", end="")
        for _, hours in usage_patterns:
            cost = annual_power_cost(watts, hours, electricity_rate)
            print(f" ${cost:>15,.0f}/yr", end="")
        print()


def when_power_matters():
    """Show when power cost becomes a significant factor in TCO."""
    print("\n" + "=" * 70)
    print("  WHEN DOES POWER COST MATTER?")
    print("=" * 70)

    scenarios = [
        {
            "name": "Home developer (RTX 4090)",
            "gpu_cost": 1500,
            "power_watts": 280,
            "hours_day": 4,
            "rate": 0.15,
        },
        {
            "name": "Small startup (4x RTX 4090 server)",
            "gpu_cost": 6000,
            "power_watts": 1120,
            "hours_day": 16,
            "rate": 0.12,
        },
        {
            "name": "Data center (8x H100)",
            "gpu_cost": 256000,
            "power_watts": 5600,
            "hours_day": 24,
            "rate": 0.08,
        },
    ]

    for s in scenarios:
        annual_power = annual_power_cost(
            s["power_watts"], s["hours_day"], s["rate"], pue=1.3
        )
        three_year_power = annual_power * 3
        hardware_cost = s["gpu_cost"]
        power_pct = (three_year_power / (hardware_cost + three_year_power)) * 100

        print(f"\n  {s['name']}")
        print(f"    Hardware cost:          ${hardware_cost:>10,.0f}")
        print(f"    3-year power cost:      ${three_year_power:>10,.0f}")
        print(f"    Power as % of 3yr TCO:  {power_pct:>9.1f}%")

        if power_pct > 30:
            print(f"    --> Power is a MAJOR cost factor. Optimize for efficiency.")
        elif power_pct > 15:
            print(f"    --> Power is significant. Consider it in purchasing decisions.")
        else:
            print(f"    --> Power cost is minor. Focus on GPU performance instead.")


if __name__ == "__main__":
    print("=" * 90)
    print("  ANNUAL ELECTRICITY COST BY HARDWARE AND USAGE")
    print(f"  Rate: $0.15/kWh")
    print("=" * 90)
    compare_power_costs(0.15)

    print("\n\n--- With cheap industrial power ($0.06/kWh) ---\n")
    compare_power_costs(0.06)

    when_power_matters()

When Power Cost Matters: Rules of Thumb

Situation	Power as % of TCO	Action
Home developer, 4 hrs/day	5–10%	Ignore power cost. Buy the fastest GPU you can afford.
Always-on inference server, 24/7	15–30%	Power matters. Consider RTX 4070 over 4090 for inference (better perf/watt).
Data center, 100+ GPUs	30–50%	Power is a major expense. Optimize PUE, consider liquid cooling, use efficient GPUs (H200 > H100).
Edge/mobile	<1%	Irrelevant for cost. Matters for battery life and thermal throttling.

Key insight: For most individual developers, electricity costs are noise — a few hundred dollars per year. For data centers running hundreds of GPUs 24/7, power can equal or exceed hardware amortization over 3 years.

18. Hardware Decision Tree

Instead of reading tables, answer a few questions and get a recommendation.

"""
hardware_selector.py — Interactive hardware recommendation engine.

No dependencies — pure Python.

Usage:
    python hardware_selector.py
    # Or call programmatically:
    from hardware_selector import recommend_hardware
    result = recommend_hardware(budget=2000, use_case="inference", location="home")
"""

from dataclasses import dataclass


@dataclass
class Recommendation:
    primary: str
    alternative: str
    estimated_cost: str
    reasoning: list
    warnings: list


def recommend_hardware(
    budget: int,
    use_case: str,
    location: str,
    model_size: str = "7B",
    priority: str = "balanced",
) -> Recommendation:
    """
    Recommend hardware based on constraints.

    Args:
        budget: Maximum spend in USD (0 = cloud only)
        use_case: "training", "inference", "both", "development", "edge"
        location: "home", "office", "datacenter", "mobile"
        model_size: "3B", "7B", "13B", "34B", "70B", "180B"
        priority: "speed", "cost", "efficiency", "portability", "balanced"

    Returns:
        Recommendation with primary choice, alternative, reasoning, and warnings.
    """
    rec = Recommendation(
        primary="", alternative="", estimated_cost="",
        reasoning=[], warnings=[],
    )

    # Parse model size to determine VRAM needs
    size_to_vram = {
        "3B": 2, "7B": 4, "13B": 8, "34B": 20, "70B": 40, "180B": 100,
    }
    needed_vram = size_to_vram.get(model_size, 4)

    # --- Edge / Mobile ---
    if use_case == "edge" or location == "mobile":
        if model_size in ("3B", "7B"):
            rec.primary = "NVIDIA Jetson Orin Nano (8GB)"
            rec.alternative = "Raspberry Pi 5 + Coral TPU"
            rec.estimated_cost = "$200–$300"
            rec.reasoning.append(
                f"{model_size} models fit on Jetson with good performance"
            )
        elif model_size == "13B":
            rec.primary = "NVIDIA Jetson AGX Orin (64GB)"
            rec.alternative = "Cloud API with local cache"
            rec.estimated_cost = "$900–$2,000"
            rec.reasoning.append("13B requires significant edge compute")
        else:
            rec.primary = "Cloud API (too large for edge)"
            rec.alternative = "Quantize to smaller model"
            rec.estimated_cost = "Variable"
            rec.warnings.append(
                f"{model_size} is too large for edge devices. "
                f"Consider distillation to 7B or smaller."
            )
        return rec

    # --- Portability Priority ---
    if priority == "portability" or location == "mobile":
        if budget >= 3500 and needed_vram <= 40:
            rec.primary = "MacBook Pro M4 Max (64GB)"
            rec.alternative = "MacBook Pro M4 Pro (36GB)"
            rec.estimated_cost = "$3,500–$4,000"
            rec.reasoning.append("Unified memory handles models up to 34B")
            rec.reasoning.append("Silent, portable, 15hr battery")
        elif budget >= 2500:
            rec.primary = "MacBook Pro M4 Pro (36GB)"
            rec.alternative = "MacBook Pro M4 (24GB)"
            rec.estimated_cost = "$2,500–$3,000"
            rec.reasoning.append("Good balance of portability and capability")
        else:
            rec.primary = "MacBook Air M3 (16GB)"
            rec.alternative = "Framework Laptop + eGPU"
            rec.estimated_cost = "$1,500–$1,800"
            rec.reasoning.append("Handles 7B models, extremely portable")
            if model_size not in ("3B", "7B"):
                rec.warnings.append(
                    f"16GB limits you to 7B models. "
                    f"Budget more for {model_size}."
                )
        return rec

    # --- Training Focus ---
    if use_case == "training":
        if location == "datacenter" or budget >= 30000:
            rec.primary = "Cloud H100 instances (on-demand)"
            rec.alternative = "On-prem H100 if >8000 hrs/year"
            rec.estimated_cost = "$3–$4/hr cloud, $32K purchase"
            rec.reasoning.append("H100 is the training standard")
            rec.reasoning.append(
                "Cloud is cheaper unless you run >8000 hrs/year"
            )
        elif budget >= 1500:
            rec.primary = "RTX 4090 (24GB)"
            rec.alternative = "RTX 4080 Super (16GB)"
            rec.estimated_cost = "$1,500 GPU + $1,000 system"
            rec.reasoning.append("Best consumer GPU for training")
            rec.reasoning.append("Handles 7B–13B training, 34B with LoRA")
            if model_size in ("70B", "180B"):
                rec.warnings.append(
                    f"Cannot train {model_size} locally. Use cloud or LoRA/QLoRA."
                )
        elif budget >= 600:
            rec.primary = "RTX 4070 (12GB)"
            rec.alternative = "RTX 4070 Ti Super (16GB) for $300 more"
            rec.estimated_cost = "$600 GPU + $500 system"
            rec.reasoning.append("Budget training card, handles 7B with QLoRA")
            if model_size not in ("3B", "7B"):
                rec.warnings.append(
                    f"12GB VRAM limits training to 7B. "
                    f"Use cloud for {model_size}."
                )
        else:
            rec.primary = "Cloud GPU (AWS/GCP spot instances)"
            rec.alternative = "Google Colab Pro ($10/month)"
            rec.estimated_cost = "$0.30–$1.00/hr"
            rec.reasoning.append("Budget too low for dedicated training hardware")

        return rec

    # --- Inference Focus ---
    if use_case == "inference":
        if location == "datacenter":
            if model_size in ("70B", "180B"):
                rec.primary = "A100 or H100 cluster (cloud)"
                rec.alternative = "On-prem L40 cluster for cost savings"
                rec.estimated_cost = "$2–$4/hr per GPU"
            else:
                rec.primary = "L4 or L40 (inference-optimized)"
                rec.alternative = "A100 for flexibility"
                rec.estimated_cost = "$1–$2/hr"
            rec.reasoning.append("Inference-optimized GPUs save 30–40% vs training GPUs")
        elif budget >= 1500:
            rec.primary = "RTX 4090 (24GB)"
            rec.alternative = "RTX 4070 (better perf/watt for inference)"
            rec.estimated_cost = "$1,500"
            rec.reasoning.append("RTX 4070 is often better for inference-only")
            rec.warnings.append(
                "RTX 4090 is overkill for inference-only workloads. "
                "RTX 4070 offers 85% of inference speed at 40% of the price."
            )
        elif budget >= 600:
            rec.primary = "RTX 4070 (12GB)"
            rec.alternative = "MacBook Pro M4 (24GB) if portability matters"
            rec.estimated_cost = "$600"
            rec.reasoning.append("Sweet spot for local inference up to 13B")
        else:
            rec.primary = "MacBook Air M3 (16GB) or Cloud API"
            rec.alternative = "Used RTX 3060 12GB (~$250)"
            rec.estimated_cost = "$250–$1,500"
            rec.reasoning.append("Limited budget: M3 for portability, used GPU for speed")

        return rec

    # --- Development (Both training and inference) ---
    if budget >= 3000:
        rec.primary = "RTX 4090 desktop + MacBook Air M3 for mobility"
        rec.alternative = "MacBook Pro M4 Max (64GB) for all-in-one"
        rec.estimated_cost = "$3,000–$4,000"
        rec.reasoning.append("Desktop for heavy compute, laptop for coding anywhere")
    elif budget >= 1500:
        rec.primary = "MacBook Pro M4 Pro (36GB)"
        rec.alternative = "RTX 4070 desktop ($1,100)"
        rec.estimated_cost = "$1,500–$2,500"
        rec.reasoning.append("Good balance for development workflow")
    elif budget >= 600:
        rec.primary = "RTX 4070 + budget PC"
        rec.alternative = "MacBook Air M3 (16GB)"
        rec.estimated_cost = "$600–$1,100"
        rec.reasoning.append("Best value for serious development")
    else:
        rec.primary = "Google Colab Pro + any laptop"
        rec.alternative = "Used RTX 3060 12GB"
        rec.estimated_cost = "$10/month + existing hardware"
        rec.reasoning.append("Cloud-first approach on a tight budget")

    return rec


def interactive_selector():
    """Run the interactive hardware selector."""
    print("=" * 60)
    print("  AI HARDWARE SELECTOR")
    print("=" * 60)

    print("\nAnswer these questions to get a recommendation.\n")

    # Budget
    print("1. What's your budget?")
    print("   a) Under $500")
    print("   b) $500–$1,500")
    print("   c) $1,500–$3,500")
    print("   d) $3,500+")
    print("   e) Cloud only (no hardware purchase)")
    budget_map = {"a": 300, "b": 1000, "c": 2500, "d": 5000, "e": 0}
    budget_choice = input("   Choice [a-e]: ").strip().lower()
    budget = budget_map.get(budget_choice, 1000)

    # Use case
    print("\n2. Primary use case?")
    print("   a) Training models")
    print("   b) Running inference (serving models)")
    print("   c) Both training and inference")
    print("   d) Development and experimentation")
    print("   e) Edge/IoT deployment")
    use_map = {
        "a": "training", "b": "inference", "c": "both",
        "d": "development", "e": "edge",
    }
    use_choice = input("   Choice [a-e]: ").strip().lower()
    use_case = use_map.get(use_choice, "development")

    # Location
    print("\n3. Where will it run?")
    print("   a) Home office")
    print("   b) Office/lab")
    print("   c) Data center")
    print("   d) Mobile/portable")
    loc_map = {
        "a": "home", "b": "office", "c": "datacenter", "d": "mobile",
    }
    loc_choice = input("   Choice [a-d]: ").strip().lower()
    location = loc_map.get(loc_choice, "home")

    # Model size
    print("\n4. Largest model you need to run?")
    print("   a) 3B (small, fast)")
    print("   b) 7B (standard)")
    print("   c) 13B (capable)")
    print("   d) 34B (very capable)")
    print("   e) 70B (frontier-class)")
    print("   f) 180B+ (largest)")
    size_map = {
        "a": "3B", "b": "7B", "c": "13B",
        "d": "34B", "e": "70B", "f": "180B",
    }
    size_choice = input("   Choice [a-f]: ").strip().lower()
    model_size = size_map.get(size_choice, "7B")

    # Priority
    print("\n5. Top priority?")
    print("   a) Speed (fastest possible)")
    print("   b) Cost (cheapest that works)")
    print("   c) Efficiency (best perf/watt)")
    print("   d) Portability (laptop/mobile)")
    print("   e) Balanced")
    pri_map = {
        "a": "speed", "b": "cost", "c": "efficiency",
        "d": "portability", "e": "balanced",
    }
    pri_choice = input("   Choice [a-e]: ").strip().lower()
    priority = pri_map.get(pri_choice, "balanced")

    # Get recommendation
    rec = recommend_hardware(budget, use_case, location, model_size, priority)

    print("\n" + "=" * 60)
    print("  RECOMMENDATION")
    print("=" * 60)
    print(f"\n  Primary:     {rec.primary}")
    print(f"  Alternative: {rec.alternative}")
    print(f"  Est. Cost:   {rec.estimated_cost}")
    print(f"\n  Reasoning:")
    for r in rec.reasoning:
        print(f"    - {r}")
    if rec.warnings:
        print(f"\n  Warnings:")
        for w in rec.warnings:
            print(f"    ! {w}")
    print("\n" + "=" * 60)


if __name__ == "__main__":
    interactive_selector()

Programmatic usage (no interaction needed):

from hardware_selector import recommend_hardware

# Startup with $2K budget doing inference
rec = recommend_hardware(budget=2000, use_case="inference", location="home", model_size="13B")
print(f"Buy: {rec.primary}")
print(f"Or:  {rec.alternative}")
for w in rec.warnings:
    print(f"Warning: {w}")

# Data center training
rec = recommend_hardware(budget=50000, use_case="training", location="datacenter", model_size="70B")
print(f"Buy: {rec.primary}")

19. Common Hardware Mistakes

These are real mistakes people make when buying AI hardware. Each one wastes money or performance.

Mistake 1: “Bought RTX 4090 for inference-only workload”

The problem: The RTX 4090 is a training beast with ~82.6 TFLOPS FP32, but inference doesn’t need that much compute. Inference is memory-bandwidth-bound, not compute-bound.

The numbers:

RTX 4090: $1,500, 280W inference, ~90 tok/s on 7B
RTX 4070: $600, 120W inference, ~55 tok/s on 7B
Cost per token: 4090 = 1.7x the price for 1.6x the speed

What to do instead: For inference-only, buy the RTX 4070 (or two RTX 4070s for $1,200 with 2x throughput). The 4090 only makes sense if you also train models.

def is_4090_worth_it(training_hours_per_month: int, inference_hours_per_month: int) -> str:
    """Determine if RTX 4090 is worth it over RTX 4070."""
    # 4090 advantage: 1.5x training speed, 1.6x inference speed
    # 4090 cost: 2.5x price, 2.3x power

    training_time_saved = training_hours_per_month * 0.33  # 33% faster
    value_of_time = 50  # $/hr for your time
    monthly_time_savings = training_time_saved * value_of_time

    price_diff = 1500 - 600  # $900 more
    monthly_power_diff = ((280 - 120) / 1000) * inference_hours_per_month * 0.15

    months_to_payback = price_diff / (monthly_time_savings - monthly_power_diff)

    if months_to_payback < 0:
        return ("RTX 4070 is better. You don't train enough to justify the 4090. "
                f"Training savings: ${monthly_time_savings:.0f}/mo, "
                f"Extra power: ${monthly_power_diff:.0f}/mo")
    elif months_to_payback > 24:
        return (f"RTX 4070 is better. Payback is {months_to_payback:.0f} months "
                f"— longer than the GPU's useful life.")
    else:
        return (f"RTX 4090 pays for itself in {months_to_payback:.0f} months. "
                f"Worth it if you train regularly.")


# Examples
print(is_4090_worth_it(training_hours_per_month=0, inference_hours_per_month=100))
# -> RTX 4070 is better. You don't train enough.

print(is_4090_worth_it(training_hours_per_month=40, inference_hours_per_month=100))
# -> RTX 4090 pays for itself in ~5 months.

Mistake 2: “Running FP32 on a GPU with tensor cores”

The problem: Modern NVIDIA GPUs (RTX 30xx, 40xx, A100, H100) have tensor cores that accelerate FP16 and BF16 operations by 2–4x. Running FP32 wastes half or more of your GPU’s capability.

The numbers:

RTX 4090 FP32: ~82.6 TFLOPS
RTX 4090 FP16 (tensor cores): ~165 TFLOPS — 2x faster, same GPU
H100 FP32: ~67 TFLOPS
H100 FP16 (tensor cores): ~989 TFLOPS — ~15x faster!

What to do instead: Always use mixed precision or FP16/BF16 for training and inference. PyTorch makes this easy:

"""
Correct: Using mixed precision to exploit tensor cores.
This example shows the difference between FP32 and FP16 training.
"""
import torch

# WRONG: FP32 training (wastes tensor cores)
def train_fp32(model, data, optimizer):
    """This ignores tensor cores entirely."""
    for batch in data:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()


# RIGHT: Mixed precision training (uses tensor cores)
def train_mixed_precision(model, data, optimizer):
    """2-4x faster on GPUs with tensor cores."""
    scaler = torch.amp.GradScaler("cuda")
    for batch in data:
        optimizer.zero_grad()
        with torch.amp.autocast("cuda"):  # Automatically uses FP16 where safe
            loss = model(batch)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()


# RIGHT: FP16 inference (maximum speed)
def inference_fp16(model, input_data):
    """Inference in FP16 — no accuracy loss for most models."""
    model = model.half()  # Convert model to FP16
    with torch.no_grad():
        with torch.amp.autocast("cuda"):
            output = model(input_data)
    return output


# Check if your GPU has tensor cores
def check_tensor_cores():
    """Check if current GPU supports tensor core acceleration."""
    if not torch.cuda.is_available():
        print("No CUDA GPU available.")
        return False

    capability = torch.cuda.get_device_capability()
    gpu_name = torch.cuda.get_device_name()

    # Tensor cores: compute capability >= 7.0 (Volta and newer)
    has_tensor = capability[0] >= 7
    print(f"GPU: {gpu_name}")
    print(f"Compute capability: {capability[0]}.{capability[1]}")
    print(f"Tensor cores: {'Yes' if has_tensor else 'No'}")

    if has_tensor:
        print("-> USE mixed precision (torch.amp.autocast) for 2-4x speedup!")
    else:
        print("-> FP32 is your only option. Consider upgrading GPU.")

    return has_tensor

Mistake 3: “Forgot to account for electricity in TCO”

The problem: People compare GPU purchase prices without factoring in electricity. For data center deployments, power can be 30–50% of total cost over 3 years.

The numbers (3-year TCO for 24/7 operation):

GPU	Purchase	3yr Electricity	3yr Cooling (PUE 1.3)	Total 3yr TCO	Electricity %
RTX 4070	$600	$789	$237	$1,626	49%
RTX 4090	$1,500	$1,774	$532	$3,806	47%
A100	$15,000	$1,577	$473	$17,050	9%
H100	$32,000	$2,759	$828	$35,587	8%

Key insight: For expensive data center GPUs, electricity is a small percentage because the hardware itself costs so much. For consumer GPUs running 24/7, electricity can approach or exceed the purchase price over 3 years.

Mistake 4: “Bought maximum RAM without checking bandwidth”

The problem: More RAM lets you load bigger models, but if memory bandwidth is low, the GPU starves waiting for data. This matters more for inference than training.

Example: A100 40GB (1,555 GB/s bandwidth) vs A100 80GB (2,039 GB/s bandwidth). The 80GB version is not just more memory — it has 31% more bandwidth. For inference on large models, the 80GB version can be 20–30% faster even when the model fits in 40GB.

Mistake 5: “Using consumer GPU in a data center rack”

The problem: RTX 4090 is not designed for 24/7 data center operation. It has:

A blower-style cooler designed for a PC case with airflow
No ECC memory (silent data corruption risk over months)
Consumer warranty that may be voided by data center use
Power connectors not designed for hot-swap

What to use instead: L40 or A6000 for data center inference. They cost more but have proper cooling, ECC memory, and data center support.

Mistake 6: “Ignoring quantization, buying more VRAM instead”

The problem: A 70B model at FP16 needs ~140GB VRAM. People buy H200 (141GB, $38,000) when they could quantize to 4-bit and fit it in an A100 80GB ($15,000) or even an RTX 4090 pair (2x24GB = 48GB for $3,000).

The math:

def model_memory_by_quantization(params_billions: float) -> dict:
    """Show memory requirements at different quantization levels."""
    results = {}
    # Bytes per parameter at each precision
    precisions = {
        "FP32 (full)":    4.0,
        "FP16 (half)":    2.0,
        "INT8":           1.0,
        "INT4 (AWQ/GGUF)": 0.5,
        "INT3 (aggressive)": 0.375,
    }

    for name, bytes_per_param in precisions.items():
        size_gb = (params_billions * 1e9 * bytes_per_param) / (1024 ** 3)
        # Add ~20% overhead for KV cache and runtime
        total_gb = size_gb * 1.2
        results[name] = {"model_gb": round(size_gb, 1), "total_gb": round(total_gb, 1)}

    return results


def print_quantization_comparison(params_billions: float):
    """Show how quantization changes hardware requirements."""
    results = model_memory_by_quantization(params_billions)

    print(f"\n  Memory requirements for {params_billions}B parameter model:")
    print(f"  {'Precision':25s} {'Model':>8s} {'+ Overhead':>10s} {'Fits In':>30s}")
    print("  " + "-" * 75)

    gpu_options = [
        ("RTX 4070 (12GB)", 12),
        ("RTX 4090 (24GB)", 24),
        ("M4 Max (64GB)", 48),
        ("A100 (80GB)", 80),
        ("H200 (141GB)", 141),
    ]

    for name, info in results.items():
        fits = [g[0] for g in gpu_options if g[1] >= info["total_gb"]]
        fits_str = ", ".join(fits[:2]) if fits else "Multi-GPU required"
        print(f"  {name:25s} {info['model_gb']:>6.1f}GB {info['total_gb']:>8.1f}GB "
              f"  {fits_str}")


# Show for common model sizes
for size in [7, 13, 34, 70]:
    print_quantization_comparison(size)

Output:

  Memory requirements for 7B parameter model:
  Precision                   Model  + Overhead                        Fits In
  ---------------------------------------------------------------------------
  FP32 (full)                 26.1GB    31.3GB   M4 Max (64GB), A100 (80GB)
  FP16 (half)                 13.0GB    15.6GB   RTX 4090 (24GB), M4 Max (64GB)
  INT8                         6.5GB     7.8GB   RTX 4070 (12GB), RTX 4090 (24GB)
  INT4 (AWQ/GGUF)              3.3GB     3.9GB   RTX 4070 (12GB), RTX 4090 (24GB)
  INT3 (aggressive)            2.4GB     2.9GB   RTX 4070 (12GB), RTX 4090 (24GB)

  Memory requirements for 70B parameter model:
  Precision                   Model  + Overhead                        Fits In
  ---------------------------------------------------------------------------
  FP32 (full)                260.8GB   312.9GB   Multi-GPU required
  FP16 (half)                130.4GB   156.4GB   H200 (141GB)
  INT8                        65.2GB    78.2GB   A100 (80GB), H200 (141GB)
  INT4 (AWQ/GGUF)             32.6GB    39.1GB   M4 Max (64GB), A100 (80GB)
  INT3 (aggressive)           24.4GB    29.3GB   M4 Max (64GB), A100 (80GB)

Bottom line: Always quantize before buying more VRAM. AWQ 4-bit quantization has negligible quality loss for inference and cuts memory requirements by 4x.

Validation Checklist

How do you know you got this right?

Performance Checks

Benchmarked your hardware using the detection script (Section 13) and recorded actual TFLOPS, memory bandwidth, and VRAM
Know your VRAM limit and maximum model size at each precision level (FP16, int8, int4)
Measured real inference latency (tokens/second) on your target model, not just theoretical TFLOPS

Implementation Checks

Hardware selected using the decision matrix (Section 9) based on your actual workload (training vs inference, batch vs real-time)
Power consumption and annual electricity cost calculated for your setup (use the formula: watts/1000 * hours/day * 365 * $/kWh)
Break-even analysis completed: on-premise vs cloud, with your actual GPU-hours/year usage
Thermal solution verified: passive cooling sufficient (M-series), or active cooling adequate under sustained load (RTX series)
Quantization tested before buying more VRAM: confirmed AWQ int4 quality is acceptable for your use case
Memory headroom verified: model + KV cache + OS overhead fits within 60-70% of total device RAM
Cloud provider pricing compared across at least 2 providers (AWS, Lambda, Runpod) for your workload profile

Integration Checks

Hardware supports your framework stack (CUDA for PyTorch/TensorFlow, Metal/MLX for Apple Silicon)
Model serving architecture planned: single-user development vs multi-user API (determines GPU count and type)
Upgrade path identified: know what hardware to move to when you outgrow current setup

Common Failure Modes

Buying RTX 4090 for inference-only: Overspend of $900+ vs RTX 4070 which delivers 60% of the speed at 20% of the cost. Fix: match GPU to workload type.
Using cloud for steady-state 24/7 workload: Break-even with owned hardware is typically month 3. Fix: run break-even analysis before committing to cloud.
Ignoring power draw in TCO: 450W GPU running 24/7 costs $591/year in electricity alone. Fix: include power in all hardware cost comparisons.
Assuming M-series can’t train: It can fine-tune via LoRA; just slower than discrete GPUs. Fix: use MLX for local fine-tuning on M-series before dismissing it.

Sign-Off Criteria

Total cost of ownership calculated for 3-year and 5-year horizons (hardware + power + cooling + maintenance)
Hardware decision documented with rationale (why X over Y, with cost and performance justification)
Verified model fits in memory on target hardware by running actual inference, not just calculating theoretical fit
Scaling plan defined: what happens when you need 2x, 5x, or 10x current capacity
Power and cooling infrastructure confirmed adequate for chosen hardware (especially for multi-GPU setups)

20. AI Infrastructure: Networking Between Chips

Individual chips are fast. The bottleneck in large-scale AI is connecting them. When you train a 405B-parameter model across 16,384 GPUs, the network between those GPUs determines whether your cluster runs at 80% efficiency or 30%. Broadcom, NVIDIA, and increasingly the Ultra Ethernet Consortium are fighting over this layer.

Why Networking Matters for AI

Large model training and high-throughput inference require constant communication between accelerators. Every forward pass of a distributed model sends gradients, activations, and KV cache data across the network. If the network is slower than the compute, GPUs sit idle waiting for data. This is called the communication bottleneck, and it is the single biggest efficiency problem in large AI clusters.

The math: An H100 GPU produces ~3.9 TB/s of internal memory bandwidth. If it is connected to other GPUs via a 400 Gbps Ethernet link (50 GB/s), the network is ~78x slower than the GPU’s internal bus. The GPU spends most of its time waiting.

The Three Interconnect Technologies

Technology	Bandwidth Per Link	Latency	Range	Vendor Lock-In	Cost
NVLink/NVSwitch	900 GB/s (NVLink 5)	<1 us	Within a node (72 GPUs max)	NVIDIA only	Included in GPU price
InfiniBand NDR	400 Gbps (50 GB/s)	1-2 us	Rack to data center	NVIDIA (Mellanox)	$5,000-$15,000/port
Ethernet 800G	800 Gbps (100 GB/s)	2-5 us	Data center to global	Multi-vendor (Broadcom, Cisco, Arista)	$2,000-$8,000/port

How they work together (two-level hierarchy):

Level 1 (intra-node): NVLink/NVSwitch connects GPUs within a single server or NVLink domain (up to 72 GPUs). Sub-microsecond latency, TB/s aggregate bandwidth. This is the fast lane.
Level 2 (inter-node): Ethernet or InfiniBand connects NVLink domains across racks. Microsecond latency, 400G-800G per NIC. This is the highway between buildings.

Broadcom’s Role: The Networking Fabric Provider

Broadcom does not make GPUs or AI accelerators (those are NVIDIA, Google, Meta, AMD). Broadcom makes the networking silicon that connects them, and the custom ASIC design platform that hyperscalers use to build their own chips.

Two distinct businesses:

Ethernet Switch ASICs — Broadcom’s Tomahawk series dominates data center switching:
- Tomahawk 6 (2025): 102.4 Tbps total switching capacity, the highest-bandwidth switch chip ever built
- Used in switches from Arista, Cisco, and others that form the backbone of AI data centers
- Supports 800 Gbps per port, 128 ports per switch
XPU Custom Silicon Platform — Broadcom designs custom AI accelerators for hyperscalers:
- Google TPU: Broadcom has co-designed Google’s Tensor Processing Units since 2015, with a supply agreement extending through 2031
- Meta MTIA: Extended partnership announced April 2026 for multiple generations of Meta Training and Inference Accelerators, starting with the first 2nm-process custom AI silicon, scaling to multi-gigawatt deployment by 2029
- Additional customers: Anthropic, OpenAI, ByteDance, and others
- Revenue: $8.4 billion in AI semiconductor revenue in Q1 FY2026 (106% YoY growth)

Ethernet vs InfiniBand: The 2026 Landscape

NVIDIA has historically dominated AI networking with InfiniBand (acquired via Mellanox in 2020). Broadcom is leading the charge to replace InfiniBand with Ethernet, which would break NVIDIA’s networking monopoly.

Why Ethernet is winning:

Ultra Ethernet Consortium (UEC) 1.0 specification released June 2025, adding InfiniBand-like features (adaptive routing, congestion control, hardware packet reordering) to Ethernet
Cost: Ethernet switches and NICs are 40-60% cheaper than InfiniBand equivalents
Multi-vendor: Broadcom, Cisco, Arista, AMD all ship Ethernet silicon; InfiniBand is NVIDIA-only
Scale: IP routing enables larger fabric scales than InfiniBand subnets
Operational tooling: Enterprise networking teams already know Ethernet

Where InfiniBand still wins:

Lowest latency (1-2 us vs 2-5 us for Ethernet)
Mature RDMA implementation (RoCEv2 on Ethernet is catching up but still requires tuning)
Proven at extreme scale (NVIDIA’s own DGX SuperPOD clusters)

Current recommendation: For new enterprise and cloud AI clusters of 64+ GPUs, RoCEv2 over 800G Ethernet with Broadcom Tomahawk switches is the default choice. InfiniBand remains relevant for latency-critical training workloads at NVIDIA-exclusive sites.

What This Means for Harness Builders

If you are building an AI agent harness that calls cloud APIs, networking infrastructure is invisible to you — the cloud provider handles it. But understanding this layer matters for:

Cost estimation: Networking is 15-25% of a large AI cluster’s total cost. When cloud providers price inference endpoints, networking costs are baked in.
Latency budgets: Inter-node communication adds 2-10 ms to distributed inference. If your harness chains multiple model calls, this compounds.
Provider selection: Hyperscalers building their own chips (Google TPU, Meta MTIA, Amazon Trainium) with Broadcom networking will offer cheaper inference than NVIDIA-GPU-only providers, because they avoid NVIDIA’s GPU and InfiniBand markup.
Edge vs cloud decisions: The networking layer is what makes cloud inference expensive at scale. If your model fits on a single device, you bypass all of this.

21. Qualcomm Edge AI and Hexagon NPU

Qualcomm is the dominant player in mobile and IoT AI inference. If Apple owns the premium phone AI experience (Neural Engine + CoreML), Qualcomm owns the rest: Android phones, IoT devices, automotive systems, and XR headsets. Their AI stack runs on billions of devices.

Architecture: Qualcomm AI Engine

Qualcomm’s AI approach is heterogeneous computing — distributing AI workloads across multiple processors on a single chip:

Component	Role	Best For
Hexagon NPU	Dedicated neural processing unit with tensor cores	Sustained inference, LLMs, image models
Adreno GPU	Graphics processor with compute shaders	Parallel inference, image generation
Kryo/Oryon CPU	General-purpose cores	Control flow, pre/post-processing, small models
Sensing Hub	Low-power always-on processor	Wake words, ambient sensing, always-on detection

The Qualcomm AI Engine orchestrates workload placement across these processors. A single inference request might use the NPU for the main model, the CPU for tokenization, and the GPU for image post-processing.

Hexagon NPU: Specifications by Generation

Chip	NPU TOPS	Process	Key Features	Devices
Snapdragon 8 Gen 3	45 TOPS	4nm	Dual Hexagon cores, INT4/INT8/FP16	Galaxy S24 Ultra, OnePlus 12
Snapdragon 8 Elite	75 TOPS	3nm	Enhanced tensor cores, 3x faster than 8 Gen 2	Galaxy S25 Ultra, OnePlus 13
Snapdragon X Elite	45 TOPS	4nm	Laptop-class, 12-core Oryon CPU	Windows laptops (Surface, Lenovo, Dell)

What 75 TOPS means in practice: TOPS (Tera Operations Per Second) measures raw INT8 throughput. For comparison, Apple A18 Pro delivers 35 TOPS from its Neural Engine. But TOPS alone does not determine real-world performance — memory bandwidth, software optimization, and model compatibility matter as much.

On-Device LLM Performance

Running LLMs directly on a phone, with no cloud connection:

Model	Parameters	Quantization	Snapdragon 8 Elite	Notes
Llama 3.2 3B Instruct	3B	W4A16	~10 tok/s	Usable for chat, voice commands
Llama 3.1 8B Instruct	8B	W4A16	~5 tok/s	Slower but more capable, 2048 context
Small vision models	1-3B	INT8	15-30 tok/s	Real-time image understanding

Comparison with Apple:

iPhone 16 Pro (A18 Pro): ~18 tok/s on 3B models, ~35 tok/s on 1.5B models
Galaxy S25 Ultra (8 Elite): ~15 tok/s on 3B, ~5 tok/s on 7B (can run larger models due to 16GB RAM vs 8GB)

The trade-off: Apple is faster on small models; Qualcomm can run bigger models because Android flagships have more RAM (12-16GB vs 8GB).

Qualcomm AI Hub: Developer Workflow

Qualcomm AI Hub is the equivalent of Apple’s CoreML Tools — it converts, optimizes, and deploys models to Qualcomm hardware. The workflow:

Start with a trained model (PyTorch, ONNX, TensorFlow)
Export and optimize via AI Hub (quantization, graph optimization, NPU code generation)
Compile to QNN context binary (precompiled, device-specific format)
Deploy using Qualcomm Genie runtime (for LLMs) or QNN SDK (for other models)

"""
Qualcomm AI Hub: Export a model for on-device inference.

Requires: pip install qai-hub-models
          Qualcomm AI Hub account (free)

This compiles a Llama model for Snapdragon 8 Elite NPU execution.
"""

# Export Llama 3.1 8B for Snapdragon (single command)
# python -m qai_hub_models.models.llama_v3_1_8b_instruct.export

# Programmatic usage:
import qai_hub_models

# List available pre-optimized models
# Categories: image classification, object detection, LLMs,
#             image generation, speech recognition, and more

# The export process handles:
# 1. Model download from HuggingFace
# 2. Quantization (W4A16 for LLMs, INT8 for vision)
# 3. Graph optimization for Hexagon NPU
# 4. Compilation to QNN context binary
# 5. Performance profiling on target device

# Output: a .bin file ready for on-device deployment
# Compilation typically completes in minutes, not hours

Developer experience: Qualcomm AI Hub abstracts the complexity of NPU compilation behind a single export command. It supports converting PyTorch or ONNX models to any on-device runtime: LiteRT (Google), ONNX Runtime, or Qualcomm’s native QNN stack. The model zoo includes 175+ pre-optimized models.

Qualcomm Insight Platform

The Qualcomm Insight Platform is a separate product focused on edge AI for video intelligence and security. It is a SaaS platform that runs AI models on Qualcomm-powered cameras and edge boxes for real-time video analytics — object detection, person tracking, anomaly detection. It uses an LLM-based conversational engine for querying video data.

This is relevant for IoT/edge deployments but not for building a typical AI agent harness.

When to Use Qualcomm for AI

Scenario	Use Qualcomm?	Why
Android app with on-device AI	Yes	Hexagon NPU is the best Android AI accelerator
IoT/edge device (cameras, sensors)	Yes	Low power, good NPU, large ecosystem
Windows laptop AI	Maybe	Snapdragon X Elite runs models well, but Intel/AMD have competitive NPUs
Cloud inference	No	Use NVIDIA GPUs or cloud TPUs
Training models	No	NPUs are inference-only
Cross-platform agent harness	Indirect	Your harness calls APIs; the NPU accelerates the on-device runtime beneath

Qualcomm vs Apple Neural Engine: Summary

Aspect	Qualcomm (Snapdragon 8 Elite)	Apple (A18 Pro)
NPU TOPS	75 TOPS	35 TOPS
Max device RAM	16 GB	8 GB
Largest on-device model	8B (quantized)	3B (quantized)
Developer tools	AI Hub, QNN SDK	CoreML Tools, MLX
Framework	QNN, ONNX Runtime, LiteRT	CoreML, MLX
Ecosystem	Android, IoT, automotive, XR	iPhone, iPad, Mac
Advantage	More RAM, larger models, open ecosystem	Faster per-TOPS, tighter integration, better optimization

22. OpenVINO: Intel’s Inference Optimization Toolkit

OpenVINO (Open Visual Inference and Neural network Optimization) is Intel’s open-source toolkit for optimizing and deploying AI inference on Intel hardware. If you have Intel CPUs, integrated GPUs, or Intel NPUs, OpenVINO can make your models run 2-5x faster than naive PyTorch or TensorFlow inference.

What It Does

OpenVINO sits between your trained model and Intel hardware. It takes a model from any major framework, converts it to an optimized intermediate representation, applies hardware-specific optimizations (quantization, kernel fusion, graph optimization), and runs inference using the best available Intel hardware.

Trained Model (PyTorch/ONNX/TF) --> OpenVINO Converter --> Optimized IR --> Intel Hardware
                                         |                                    |
                                    Quantization (NNCF)               CPU / GPU / NPU
                                    Graph optimization
                                    Kernel fusion

Supported Hardware

Intel Hardware	What It Is	OpenVINO Support	Best For
Intel CPUs (Core, Xeon)	General-purpose processors	Full (primary target)	Server inference, any workload
Intel Arc GPUs	Discrete graphics cards	Full	Parallel inference, image models
Intel integrated GPUs	Built into Core processors	Full	Laptop/desktop inference
Intel NPU (Meteor Lake+)	Dedicated neural accelerator	Full	Always-on AI, efficient inference
Intel Gaudi	AI training/inference accelerator	Separate SDK	Data center training (not OpenVINO)

Quick Start: Model Conversion and Inference

"""
openvino_quickstart.py -- Convert and run a model with OpenVINO.

Requires: pip install openvino nncf
          pip install torch torchvision  (for model download)

Works on any machine with an Intel CPU (no GPU required).
"""

import openvino as ov
import numpy as np


# --- Step 1: Convert a PyTorch model to OpenVINO ---

def convert_pytorch_model():
    """Convert a PyTorch model to OpenVINO IR format."""
    import torch
    from torchvision.models import mobilenet_v2, MobileNet_V2_Weights

    # Load a pretrained model
    model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
    model.eval()

    # Create example input
    example_input = torch.randn(1, 3, 224, 224)

    # Convert to OpenVINO (one line)
    ov_model = ov.convert_model(model, example_input=example_input)

    # Save for later use (optional — avoids re-conversion)
    ov.save_model(ov_model, "mobilenet_v2.xml")

    return ov_model


# --- Step 2: Run inference ---

def run_inference(model_path: str = "mobilenet_v2.xml"):
    """Load and run an OpenVINO model."""
    # Initialize the runtime
    core = ov.Core()

    # List available devices
    print(f"Available devices: {core.available_devices}")
    # Example output: ['CPU', 'GPU', 'NPU']

    # Compile model for a specific device
    # "CPU" = Intel CPU, "GPU" = Intel integrated/Arc GPU, "NPU" = Intel NPU
    # "AUTO" = let OpenVINO pick the best device
    compiled_model = core.compile_model(model_path, "AUTO")

    # Run inference
    input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
    result = compiled_model([input_data])

    # Get output
    output = result[0]
    predicted_class = np.argmax(output)
    print(f"Predicted class: {predicted_class}")

    return predicted_class


# --- Step 3: Optimize with quantization ---

def quantize_model(model_path: str = "mobilenet_v2.xml"):
    """Apply INT8 quantization using NNCF for ~2x speedup."""
    import nncf

    core = ov.Core()
    ov_model = core.read_model(model_path)

    # Post-training quantization (no retraining needed)
    # Requires a small calibration dataset (100-300 samples)
    def calibration_data():
        for _ in range(100):
            yield [np.random.randn(1, 3, 224, 224).astype(np.float32)]

    quantized_model = nncf.quantize(
        ov_model,
        nncf.Dataset(calibration_data()),
    )

    ov.save_model(quantized_model, "mobilenet_v2_int8.xml")
    print("Quantized model saved. Expected ~2x speedup on Intel CPUs.")


if __name__ == "__main__":
    print("Converting PyTorch model to OpenVINO...")
    convert_pytorch_model()

    print("\nRunning inference...")
    run_inference()

    print("\nQuantizing model...")
    quantize_model()

LLM Inference with OpenVINO GenAI

OpenVINO has expanded beyond computer vision to support generative AI workloads:

"""
openvino_llm.py -- Run an LLM with OpenVINO on Intel hardware.

Requires: pip install openvino-genai optimum[openvino]

Convert a HuggingFace model first:
    optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct \
        --weight-format int4 llama-1b-ov
"""

import openvino_genai as ov_genai


def run_llm(model_dir: str = "llama-1b-ov"):
    """Run LLM inference on Intel CPU/GPU."""
    # Load the model (automatically selects best device)
    pipe = ov_genai.LLMPipeline(model_dir, "CPU")

    # Generate text
    result = pipe.generate(
        "Explain what a KV cache is in one paragraph.",
        max_new_tokens=128,
        temperature=0.7,
    )
    print(result)


if __name__ == "__main__":
    run_llm()

Key GenAI features in OpenVINO 2026.0:

Mixture of Experts (MoE) model support (GPT-OSS-20B, Qwen3-30B)
Speculative decoding with EAGLE-3 on CPU, GPU, and NPU
Text-to-video pipeline (LTX-Video model)
Whisper speech-to-text with word-level timestamps
INT4 data-aware weight compression for MoE models

OpenVINO vs CoreML vs TensorRT

Aspect	OpenVINO	CoreML	TensorRT
Vendor	Intel (open-source)	Apple (proprietary)	NVIDIA (proprietary)
Target hardware	Intel CPU, GPU, NPU	Apple Neural Engine, GPU, CPU	NVIDIA GPUs only
Input formats	PyTorch, ONNX, TF, PaddlePaddle, JAX	PyTorch, ONNX, TF (via coremltools)	ONNX, PyTorch (via torch-tensorrt)
Quantization	INT8, INT4, FP8-4BLUT (NNCF)	INT8, palettization, pruning	FP8, INT8, INT4
LLM support	Yes (OpenVINO GenAI)	Yes (CoreML for Apple Intelligence)	Yes (TensorRT-LLM)
Typical speedup	2-5x over PyTorch on Intel CPUs	3-10x on Neural Engine	2-6x on NVIDIA GPUs
Open source	Yes (Apache 2.0)	No	No (limited source available)
Cross-platform	Linux, Windows, macOS (Intel only)	macOS, iOS only	Linux, Windows (NVIDIA only)

ONNX Ecosystem Integration

OpenVINO fits into the broader ONNX ecosystem as one of several execution providers:

PyTorch Model
     |
     v
ONNX Format (universal interchange)
     |
     +-- ONNX Runtime + OpenVINO EP  --> Intel hardware
     +-- ONNX Runtime + TensorRT EP  --> NVIDIA hardware
     +-- ONNX Runtime + CoreML EP    --> Apple hardware
     +-- ONNX Runtime + QNN EP       --> Qualcomm hardware
     +-- ONNX Runtime + DirectML EP  --> Windows GPUs

This means you can export your model to ONNX once and run it on any hardware via the appropriate execution provider. OpenVINO can be used either standalone (direct API) or as an ONNX Runtime execution provider.

When to Use OpenVINO

Scenario	Use OpenVINO?	Why
Server inference on Intel Xeon CPUs	Yes	Primary use case, significant speedup over raw PyTorch
Laptop inference on Intel Core	Yes	Good acceleration, especially with integrated GPU and NPU
Edge devices with Intel chips	Yes	Supports NPU for efficient always-on inference
NVIDIA GPU inference	No	Use TensorRT or vLLM instead
Apple Silicon inference	No	Use CoreML or MLX instead
Qualcomm device inference	No	Use QNN SDK or AI Hub instead
Cross-platform deployment	Maybe	Use ONNX Runtime with OpenVINO EP for Intel, other EPs for other hardware
Building an AI agent harness	Unlikely	Your harness likely calls cloud APIs; OpenVINO matters if you self-host inference on Intel hardware

Practical Relevance for Harness Builders

OpenVINO is most relevant if you are:

Self-hosting inference on Intel server hardware (common in enterprise environments where GPU procurement is slow or restricted)
Running models on Intel laptops for local development without an NVIDIA GPU
Deploying edge AI on Intel-based IoT devices (Intel NUC, industrial PCs)
Using ONNX Runtime as your inference backend and want Intel-optimized execution

If your harness calls cloud inference APIs (OpenAI, Anthropic, Google), OpenVINO is irrelevant — the cloud provider handles hardware optimization. If you run models locally on Apple Silicon, use MLX or CoreML instead.