Skip to main content
Reference

Hardware Landscape

GPU vs CPU comparison, NVIDIA (H100, RTX), Apple M-series, mobile chips, Broadcom AI networking, Qualcomm Hexagon NPU, Intel OpenVINO — hardware detection scripts, benchmarks, cost-per-TFLOP analysis, and a hardware selector tool.

Overview

Choosing hardware for AI is a cost-performance-power trade-off. You need to match your workload (training, inference, local, cloud) to the right chip. This guide covers what’s available, why you’d buy it, and how much it costs.

TL;DR: For local development use Apple M-series or RTX 4070. For production use cloud GPUs (H100/H200). For edge inference use mobile chips or M1/M2.


1. CPU vs GPU vs AI Chips: Fundamentals

HardwareWhat It DoesBest ForCostPower
CPUSequential execution, smart branching, all general tasksEverything (training, inference, serving, glue code)$50–$50010–150W
GPUParallel processing, 10,000+ threads, linear algebraTraining, batch inference, matrix ops$200–$12,000200–600W
TPUCustom-built tensor operationsGoogle Cloud training/inference onlyCloud only250–500W
Neural EngineOptimized for 8-bit/16-bit inferenceOn-device AI (Apple, Qualcomm)Built-in1–10W
FPGAProgrammable hardwareCustom inference, latency-critical$500–$5,00050–300W

Why the differences?

  • CPUs are like smart generalists. They handle branching, complex logic, and sequential work. One core can do one thing at a time, but it’s flexible.
  • GPUs are like dumb sprinters. They have 10,000+ cores that run the same instruction on different data. Perfect for matrix multiplication (what neural networks do), terrible at decision-making.
  • Neural Engines are specialized GPUs that optimize for inference at 8-bit or 16-bit precision. They use less power and space but can’t train models.
  • TPUs are Google’s custom silicon—not available to the public except via Google Cloud.

Practical implication: If you’re training, you need a GPU (or TPU cloud). If you’re running inference on a server, GPU or CPU works (GPU is faster for batch, CPU is fine for single requests). If you’re running on a phone or laptop, use the Neural Engine or Apple M-series.


2. NVIDIA Ecosystem: The Default GPU Choice

NVIDIA dominates because they own CUDA (the software that lets you use GPUs), have the best drivers, and have been optimizing for AI for 15 years. Here’s what’s available:

Data Center & Training

GPUVRAMPriceTFLOPS (FP32)Best ForCloud Availability
H200141GB$38,00067Large models, trainingAWS, GCP, Azure (early)
H10080GB$32,00067Training, large inferenceAWS, GCP, Azure
A10040/80GB$10,000–$18,00019.5Training, batch inferenceAWS, GCP, Azure, on-prem
A600048GB$6,50038.7Research, production inferenceAWS, on-prem

What these numbers mean:

  • VRAM: Bigger = larger models fit in memory. H200’s 141GB holds massive models without offloading.
  • TFLOPS: Floating-point operations per second. FP32 is shown here, but practical ML workloads use TF32 or bfloat16 (2× throughput). Higher = faster, but not everything scales linearly (memory bandwidth matters too).
  • Price per TFLOP: H100 = ~$478/TFLOP, A100 = ~$513–$923/TFLOP. H100 is expensive but new, so cloud providers absorb the cost.

Consumer/Enthusiast GPUs

GPUVRAMPriceTFLOPS (FP32)Best For
RTX 409024GB$1,50082.6Local training, research
RTX 4080 Super16GB$1,20052High-end gaming + some training
RTX 4070 Ti Super16GB$90044Good training, better inference
RTX 407012GB$60029Solid all-rounder
RTX 4070 Mobile8GB$1,500–$2,500 (laptop)21Laptop training
L4048GB$10,00090.5Inference-optimized, data center
L424GB$3,00030.3Edge inference, data center

Decision points:

  • If you’re buying one GPU for local work, RTX 4070 is the sweet spot: $600, handles 7B–13B models, good for most projects.
  • If budget allows, RTX 4090 is best for research (~82.6 TFLOPS FP32), but requires good cooling and a 1500W+ PSU.
  • If you only care about inference (not training), L40 or L4 are more cost-effective in data centers.

AMD Alternative

GPUVRAMPriceBest For
RX 7900 XTX24GB$700Budget alternative to RTX 4080
MI300X192GB$20,000Cloud training (AMD alternative to H100)

Trade-off: AMD is cheaper but ROCm (AMD’s CUDA equivalent) is less mature. Libraries like PyTorch support it, but fewer optimizations exist. Use if you must save money.

Power & Thermal

GPUPower DrawPSU RequiredCooling Notes
RTX 4090450W1500WNeeds aftermarket cooling, loud at full load
RTX 4080320W1000WStandard tower cooler sufficient
RTX 4070200W750WQuiet operation possible
H100700WData center PSURequires liquid cooling in data centers
A100400WData center PSURequires good ventilation

Cost to run 24/7: RTX 4090 at $0.15/kWh = 450W × 8,760 hours × $0.15 = ~$590/year. M-series laptops: ~$20/year.


3. Apple Silicon (M-series): The Unified Memory Advantage

Apple’s M-series chips are the secret weapon for local development and edge inference. The magic word: unified memory.

The Unified Memory Difference

Traditional GPU (NVIDIA):

CPU → PCIe → GPU Memory
Data copy: CPU has data → send to GPU (slow)
Compute: GPU does work
Result copy: GPU memory → send back to CPU (slow)

Apple M-series (Unified Memory):

CPU + GPU share the same memory
No copying. CPU and GPU access the same data instantly.

Performance impact: 20–40% faster for many workloads because no data copy overhead. NVIDIA is working on this (NvLink, but only on server GPUs), but consumer NVIDIA GPUs still have this limitation.

Apple M-Series Lineup

ChipCoresUnified MemoryPrice (laptop)Best For
M38-core CPU, 10-core GPU8/16/24GB$1,500–$2,000Local dev, 7B models
M3 Max12-core CPU, 18-core GPU48GB$3,500Serious local training, large models
M410-core CPU, 10-core GPU16/24GB$1,600–$2,100Faster than M3 (especially CPU)
M4 Pro12-core CPU, 20-core GPU36GB$2,500Best price-to-performance
M4 Max12-core CPU, 40-core GPU96GB$3,500–$4,000High-end local work
M2 Ultra (Mac Studio)20-core CPU, 40-core GPU192GB$7,000Enterprise-class local

Real-World Examples

  • MacBook Air M3 with 16GB: Runs 7B models (Llama 2) at ~15 tokens/sec locally. Great for development.
  • MacBook Pro M3 Max with 48GB: Runs 13B models (Mistral, Llama 13B) at ~5–10 tokens/sec. Can fine-tune small adapters.
  • Mac Studio M2 Ultra with 192GB: Runs 70B models (Llama 70B) at ~1–2 tokens/sec. Can train small models.

Cost-Effectiveness

RTX 4090 + high-end PC: $2,500 total, 450W power, needs cooling setup. MacBook Pro M3 Max: $3,500, 35W typical power, completely portable.

If you value silence, portability, and efficiency, M-series wins. If you need maximum raw compute per dollar, NVIDIA wins.


4. Intel Arc: The Underdog GPU

Intel is trying to challenge NVIDIA with the Arc series. Results are mixed.

GPUVRAMPriceTFLOPS (FP32)Status
Arc A7708/16GB$300–$40019.7Competitive with RTX 4070, cheaper
Arc A7508GB$20017.2Entry-level alternative
Flex 17016GB$2,50017.6Data center inference

Pros: Cheaper than NVIDIA, decent performance, integrated into some laptops. Cons: Driver support is immature (crashes, performance variance), fewer library optimizations, harder to debug issues.

When to buy: If you’re desperate for cheap GPU compute and can tolerate driver instability. Otherwise, RTX 4070 at $600 is safer.

Driver maturity timeline: Intel has been improving this, but NVIDIA is still the safe choice for production.


5. Consumer GPUs for Local AI: Decision Guide

The Choice Matrix

BudgetPrimary UseBest GPUPriceNotes
$0–$500Dev/inferenceMacBook Air M3 or RTX 4070$1,500–$2,000 (laptop) or $600 (card)M3 is portable; RTX 4070 is powerful
$500–$1,200Training + inferenceRTX 4080 Super or Arc A770$1,200 or $400NVIDIA for safety; Arc for budget
$1,500–$3,000High-end researchRTX 4090 or MacBook Pro M3 Max$1,500 or $3,500RTX 4090 = power; M3 Max = mobility
$3,000+Enterprise/labMac Studio M2 Ultra or RTX 4090 cluster$7,000 or $1,500×NUnified memory vs raw speed

My Recommendation for 2026

For local development: MacBook Pro M3 16GB ($1,800). Unified memory, zero config, great for 7B models.

If you need raw speed: RTX 4070 ($600) in a desktop PC ($500 for case/PSU/mobo). Total ~$1,100. Beats M3 in training speed, costs less.

If you have budget: RTX 4090 ($1,500). Best single GPU for research. Needs good cooling and a 1500W PSU.

For inference only: L40 ($10,000, enterprise) or RTX 4070 if building your own.


6. Mobile & Edge Chips: On-Device AI

The Hardware

ChipDeviceAI PerformancePowerUse Case
Apple A17 ProiPhone 15 Pro16 TOPS2–3W activeOn-device vision, speech
Qualcomm Snapdragon 8 Gen 3Android flagship10 TOPS2–4W activeOn-device AI, gaming
Google Tensor 4Pixel 98 TOPS2–3W activeTensor optimization for Pixel apps
MediaTek Dimensity 9300Mid-range Android6 TOPS1–2W activeBudget on-device AI

Performance vs Servers

  • NVIDIA H100: ~67 TFLOPS (FP32); ~989 TFLOPS (FP16 Tensor Core, more practical for ML)
  • iPhone A17 Pro: 16 TOPS (0.016 TFLOPS)

Your phone is roughly 4,000× slower in raw FP32 throughput. But here’s the trade-off:

MetricPhoneServer GPU
Latency50–100ms10–50ms (batch)
Power2–3W400–700W
PrivacyOn-device, no uploadShared infrastructure
Cost per inference$0.0001 (amortized)$0.001–$0.01

Real-World Usage

  • On-device models: Whisper (speech), Vision Transformer (image), small LLMs (3B or 7B with quantization)
  • Typical latency: 200–500ms for inference on 3B models
  • Battery impact: Minimal for occasional use, noticeable for continuous

Use mobile AI for:

  • Privacy-first features (voice command, on-device translation)
  • Reducing server load
  • Features that work offline

7. Specialized Hardware: Enterprise & Research

When Available

These are cloud/enterprise only. You can’t buy them for your home.

HardwareProviderCost/MonthBest For
TPU v4eGoogle Cloud$2–$5 per acceleratorTraining, huge models
AWS TrainiumAWSCustom pricingTraining optimization, lower cost than GPU
AWS InferentiaAWSCustom pricingHigh-throughput inference
Graphcore IPUGraphcore (cloud partners)Custom pricingCustom AI workloads, research
Cerebras CS-3Cerebras (cloud)Custom pricingLargest single-chip training (memory issues solved)

When to Use

TPU: If you’re training huge models (100B+) on Google Cloud. Google optimizes TPUs for Tensor processing, and they’re cheaper than H100s if you’re doing heavy work.

AWS Trainium: If training cost is your main concern. Generally cheaper per hour than GPUs for the same training job.

Others: Research only. Not production-ready or cost-effective for most teams.


8. Power and Thermal Considerations

Desktop PC Power Budget

GPUPower DrawRecommended PSUCooling DifficultyNoise Level
RTX 4090450W1500WHigh (needs good air or water)Loud at full load
RTX 4080320W1000WMedium (good tower cooler)Moderate
RTX 4070200W750WLow (standard cooler)Quiet
RTX 4070 Mobile140WLaptop PSUBuilt-inLaptop fan noise

Real Operating Conditions

H100 in a data center: 700W + air/liquid cooling + rack space + facilities cost (~$20K/year total ownership for one GPU).

RTX 4090 on a desk: 450W continuous. At full load 24/7: 450W × 8,760 hours × $0.15/kWh = $590/year in electricity. Most people don’t run it 24/7, so ~$200–$300/year is realistic.

MacBook M3: 35W typical (single-core), peaks to 70W. Battery: ~15–20 hours per charge. At $0.15/kWh: ~$20/year if plugged in constantly.

Data Center Considerations

If you’re running GPUs in a data center:

  • Cooling: Proper airflow required. H100s need 200+ CFM per card.
  • Power distribution: Dedicated circuits, UPS backup.
  • Space: 2U rack space per 2–4 GPUs.
  • Cost: Rack space $500–$2,000/month, plus power, plus labor.

Bottom line: If you need sustained compute, cloud is often cheaper than owning hardware due to shared infrastructure costs.


9. Decision Matrix: What Hardware to Buy

Scenario 1: Solo Developer Learning AI

DecisionChoiceWhy
Budget: $1,500–$2,000MacBook Air M3 16GBPortable, unified memory, sufficient for 7B models, good battery
Alternative (if desktop preferred)RTX 4070 + PC$1,100 total, faster training, more room for growth
Timeline: ImmediateBuy nowBoth will be viable for years

Scenario 2: AI Research Team

DecisionChoiceWhy
Local GPUs: Yes2–4 RTX 4090s$6K total, 5–10x faster than M-series
Cloud complement: YesAWS with H100s on-demandFor massive experiments, leave on-prem for iteration
Storage: Local NVMe RAID4TB RAID 10Working dataset cache, faster than cloud storage

Scenario 3: Production Inference API

DecisionChoiceWhy
Where to run: AWS/GCP cloudA100 or H100 clusterselasticity, don’t own hardware, pay only for requests
GPU count: 4–8Batch inference on multiple GPUsHigher throughput per dollar
Load balancing: Kubernetes + vLLMAuto-scale, queue requestsEfficient, fault-tolerant
On-prem alternative: Only if >10K req/secBuy A100s, need IT teamOnce you exceed cloud cost, on-prem makes sense

Scenario 4: Budget Startup

DecisionChoiceWhy
GPU for trainingRTX 4070$600, good for quick iteration, 12GB VRAM
Dev environmentMacBook M3 + RTX 4070 desktopPortable dev on M3, heavy compute on RTX
Production inferenceAWS Lambda + GPU (part-time) or EC2 with L4No upfront cost, scale with usage

Scenario 5: Edge Deployment

DecisionChoiceWhy
Phone/tabletExisting hardware (A17/Snapdragon)No extra purchase, on-device AI free
Custom inference deviceRaspberry Pi 5 + M.2 accelerator or NVIDIA Jetson Orin$200–$600, runs 3B models at 50–100ms
Low-power IoTGoogle Coral TPU or NVIDIA Jetson Nano<$100, runs <100MB models, very fast

10. Cloud vs On-Premise Economics

Cost Model

Cloud (AWS Example)

Training on H100: $3.00/hour per GPU

  • 100-hour training job: 100 × $3 = $300
  • No upfront cost, no hardware to manage

Production inference on A100: $2.00/hour per GPU

  • 1M inferences/month at 10 req/sec average
  • 1 GPU handles ~200 req/sec = 0.05 GPUs needed
  • 30 days × 24 hours × 0.05 GPU = 36 GPU-hours = $72/month

On-Premise (Break-Even Analysis)

RTX 4090 for training:

  • Hardware cost: $1,500
  • Power: $590/year
  • Cooling/space: $500/year (rough estimate for home)
  • 3-year amortization: ($1,500 + $590×3 + $500×3) / 3 = $1,363/year or $0.155/hour

H100 in data center:

  • Hardware cost: $32,000
  • Power: 700W × 8,760 hours × $0.12 = $7,350/year
  • Space/cooling/labor: $15,000/year
  • 3-year amortization: ($32,000 + $7,350×3 + $15,000×3) / 3 = $24,483/year or $2.80/hour

Break-even:

  • Cloud H100 at $3/hour vs on-prem at $2.80/hour is near parity
  • If you run >8,000 hours/year (1 GPU, 24/7), on-prem is cheaper
  • If you run <4,000 hours/year, cloud is cheaper (flexibility matters)

Decision Rule

Annual GPU HoursBest Option
<2,000 hoursPure cloud (AWS on-demand)
2,000–8,000 hoursHybrid (cloud for spikes, local for baseline)
>8,000 hoursOn-prem (one GPU)
>50,000 hoursOn-prem cluster (multiple GPUs)
  • Local: RTX 4070 or M-series for development and prototyping
  • Cloud: AWS H100 for large training jobs (spin up, train, spin down)
  • Cost: Development is local (low), big experiments are cloud (cheaper per compute hour due to scale)

11. Unified Memory Advantage Deep Dive

Why It Matters

NVIDIA GPU Memory Architecture (PCIe bottleneck):

Typical PCIe 4.0 bandwidth: 32 GB/sec
Training 70B model with 2 GPUs needs ~140 GB data
Moving data GPU→GPU: 140 GB / 32 GB/sec = 4.4 seconds per iteration
(This is why NvLink exists on H100s—but not on consumer GPUs)

Apple Unified Memory (no PCIe):

Memory bandwidth: 100+ GB/sec (system memory)
CPU and GPU access same data: zero copy overhead
For inference: 20–40% faster because no data copy

Practical Example: 7B Model Inference

NVIDIA RTX 4090:

  1. Load 7B model from storage to CPU memory: 14GB
  2. Copy to GPU memory: 14GB / 32 GB/sec PCIe = 0.44 seconds
  3. Inference: 15 tokens/sec
  4. Copy results back: negligible

Apple M3 (16GB unified):

  1. Load 7B model: 14GB (already in unified memory)
  2. Inference: 15 tokens/sec
  3. No copy overhead

Result: Apple is ~5–10% faster for inference on models that fit in memory, because no copying. For models that don’t fit (and need offloading), NVIDIA is faster.

When Unified Memory Doesn’t Matter

  • Large models (70B+): Don’t fit in M3 Max 48GB, need offloading anyway (loses advantage)
  • Batch training: NVIDIA’s CUDA libraries are optimized for batching; Apple’s are not
  • Server inference: VRAM != unified memory (still has bandwidth limit)

Why NVIDIA Doesn’t Have This (Consumer)

NVIDIA’s architecture separates CPU and GPU—they’re different instruction sets. It’s hard to merge them without redesigning everything. A100/H100 have NvLink (connects GPUs at high bandwidth), but consumer GPUs use PCIe, which is slow.

Apple unified CPU + GPU because they control the whole stack (chip design, software). NVIDIA can’t do this without breaking 20 years of CUDA.


12. Future Hardware: 2026 and Beyond

Expected Releases

VendorHardwareExpectedWhat’s New
NVIDIABlackwell (H100 successor)Q2 2025 (likely shipping now in 2026)2x performance, better power efficiency, NvLink 5.0
NVIDIARTX 5000 seriesQ4 2025consumer Blackwell, ~3x faster than RTX 4090
AppleM5 chipSpring 2026Likely 20% faster than M4, more GPU cores
IntelArc 4-series (Battlemage)Q2–Q4 2025Driver improvements, better performance/watt
AMDRDNA4Q1–Q2 2026Competitor to RTX 5000 series
CerebrasWafer-Scale Engine 42026On-chip, not PCIe; massive memory, research only
GoogleTPU v5eNow availableBetter cost per training TFLOP

What Will Actually Matter

  1. Power efficiency: As electricity costs rise, watts-per-TFLOP becomes critical
  2. HBM memory: Blackwell uses HBM (faster, higher bandwidth), not GDDR6
  3. Unified memory adoption: May see more ARM-based chips with unified memory
  4. Sparse compute: Models with fewer parameters become standard (efficiency wins)
  5. On-device AI: Phones get better Neural Engines; less need to send data to servers

Safe Bets for Buying Now

  • RTX 4070: Will work for years. If new cards are 3x faster, so what—4070 still runs 7B models fine.
  • M3/M4: Will be supported for development for 5+ years minimum (Apple’s track record).
  • Cloud compute: Always flexible. Doesn’t matter if you’re using H100 or Blackwell; pay per hour.

Quick Reference: Hardware by Use Case

Local Development (Laptop)

  • Best: MacBook Pro M4 16GB ($2,500)
  • Runner-up: MacBook Air M3 16GB ($1,800)
  • Why: Unified memory, portable, zero setup

Local Development (Desktop)

  • Best: RTX 4070 + PC ($1,100 total)
  • Runner-up: RTX 4090 if you have $2,000+ budget
  • Why: Fastest, most expandable

Training (Home Lab)

  • Best: RTX 4090 ($1,500) or RTX 4080 ($1,200)
  • Setup: i9 CPU, 64GB RAM, 1500W PSU, good cooling
  • Cost: $3,000–$4,000 total for GPU + system

Training (Cloud)

  • Best: AWS with on-demand H100s or Trainium
  • Cost: $3–$10/hour per GPU depending on instance type
  • Recommendation: Always start here. Buy hardware only if you exceed cloud cost.

Production Inference

  • Scale: <10K req/sec: AWS A100 or H100 on-demand
  • Scale: 10K–100K req/sec: Dedicated instances (cheaper per request)
  • Scale: >100K req/sec: Own cluster (break-even on hardware)

Edge (Phone/Tablet)

  • Use built-in Neural Engine: A17, Snapdragon 8, Tensor 4
  • Cost: $0 (already in device)
  • Typical latency: 100–500ms for 3B models

Edge (Custom Device)

  • Best: Google Coral TPU ($50–$100) or NVIDIA Jetson Nano ($100–$200)
  • For: Running pre-trained 100MB–1GB models offline
  • Latency: 50–200ms

Summary: The Cost-Performance Frontier

As of April 2026:

Best value: RTX 4070 ($600 GPU + $500 system = $1,100 total). Handles 7B–13B models for training and inference. Most people should buy this.

Best mobility: MacBook Air M3 ($1,800). Unified memory, silent, 15–20 hour battery, sufficient for most dev work.

Best raw power: RTX 4090 ($1,500) for single GPU. Needs good cooling and power supply.

Best for production: AWS with H100 or A100 on-demand. Pay per use, elasticity, no hardware to manage.

Best for edge: Use existing phone chips (A17, Snapdragon, Tensor). Or Raspberry Pi 5 + Coral TPU (~$200) for custom devices.

Future-proof: Whatever you buy in 2026 will be obsolete in 3–5 years. Don’t overspend on hardware you’ll replace. Buy what solves today’s problem, assume you’ll upgrade.


13. Hardware Detection Script

Before choosing models or optimizations, know what you have. This script detects your hardware and recommends what models you can run.

"""
hardware_detect.py — Detect AI-relevant hardware and recommend model sizes.

Works on Linux (NVIDIA/AMD GPUs), macOS (Apple Silicon), and Windows.
Requires: psutil (pip install psutil)
Optional: torch, pynvml (for GPU details)
"""

import platform
import subprocess
import shutil
import json
from dataclasses import dataclass, field


@dataclass
class GPUInfo:
    name: str = "Unknown"
    vram_gb: float = 0.0
    cuda_version: str = "N/A"
    driver_version: str = "N/A"
    compute_capability: str = "N/A"


@dataclass
class CPUInfo:
    name: str = "Unknown"
    cores_physical: int = 0
    cores_logical: int = 0
    architecture: str = "Unknown"


@dataclass
class SystemInfo:
    cpu: CPUInfo = field(default_factory=CPUInfo)
    gpus: list = field(default_factory=list)
    ram_gb: float = 0.0
    os_name: str = "Unknown"
    has_neural_engine: bool = False
    neural_engine_cores: int = 0
    unified_memory: bool = False
    apple_chip: str = ""


def detect_cpu() -> CPUInfo:
    """Detect CPU type, cores, and architecture."""
    import psutil

    cpu = CPUInfo()
    cpu.cores_physical = psutil.cpu_count(logical=False) or 0
    cpu.cores_logical = psutil.cpu_count(logical=True) or 0
    cpu.architecture = platform.machine()

    system = platform.system()
    if system == "Darwin":
        try:
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.brand_string"],
                capture_output=True, text=True, timeout=5
            )
            cpu.name = result.stdout.strip() or "Apple Silicon"
        except (subprocess.TimeoutExpired, FileNotFoundError):
            cpu.name = "Apple Silicon (detection failed)"
    elif system == "Linux":
        try:
            with open("/proc/cpuinfo", "r") as f:
                for line in f:
                    if "model name" in line:
                        cpu.name = line.split(":")[1].strip()
                        break
        except FileNotFoundError:
            cpu.name = "Unknown Linux CPU"
    elif system == "Windows":
        cpu.name = platform.processor() or "Unknown Windows CPU"

    return cpu


def detect_nvidia_gpu() -> list[GPUInfo]:
    """Detect NVIDIA GPUs using nvidia-smi (no Python deps needed)."""
    gpus = []

    if not shutil.which("nvidia-smi"):
        return gpus

    try:
        result = subprocess.run(
            [
                "nvidia-smi",
                "--query-gpu=name,memory.total,driver_version",
                "--format=csv,noheader,nounits",
            ],
            capture_output=True, text=True, timeout=10,
        )
        if result.returncode != 0:
            return gpus

        for line in result.stdout.strip().split("\n"):
            parts = [p.strip() for p in line.split(",")]
            if len(parts) >= 3:
                gpu = GPUInfo()
                gpu.name = parts[0]
                gpu.vram_gb = round(float(parts[1]) / 1024, 1)
                gpu.driver_version = parts[2]
                gpus.append(gpu)

        # Get CUDA version separately
        cuda_result = subprocess.run(
            ["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"],
            capture_output=True, text=True, timeout=10,
        )
        if cuda_result.returncode == 0:
            caps = cuda_result.stdout.strip().split("\n")
            for i, cap in enumerate(caps):
                if i < len(gpus):
                    gpus[i].compute_capability = cap.strip()

        # Get CUDA toolkit version
        cuda_ver = subprocess.run(
            ["nvcc", "--version"],
            capture_output=True, text=True, timeout=10,
        )
        if cuda_ver.returncode == 0:
            for line in cuda_ver.stdout.split("\n"):
                if "release" in line.lower():
                    version = line.split("release")[-1].split(",")[0].strip()
                    for gpu in gpus:
                        gpu.cuda_version = version

    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass

    return gpus


def detect_apple_silicon() -> dict:
    """Detect Apple Silicon details including Neural Engine."""
    info = {
        "chip": "",
        "neural_engine": False,
        "neural_engine_cores": 0,
        "unified_memory": False,
        "gpu_cores": 0,
    }

    if platform.system() != "Darwin" or platform.machine() != "arm64":
        return info

    info["unified_memory"] = True

    try:
        result = subprocess.run(
            ["sysctl", "-n", "hw.optional.arm.FEAT_FP16"],
            capture_output=True, text=True, timeout=5,
        )
        # All Apple Silicon has Neural Engine
        info["neural_engine"] = True
    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass

    # Detect chip name from system_profiler
    try:
        result = subprocess.run(
            ["system_profiler", "SPHardwareDataType", "-json"],
            capture_output=True, text=True, timeout=15,
        )
        if result.returncode == 0:
            data = json.loads(result.stdout)
            hw = data.get("SPHardwareDataType", [{}])[0]
            chip_name = hw.get("chip_type", "")
            info["chip"] = chip_name

            # Neural Engine core counts by generation
            ne_cores = {
                "M1": 16, "M2": 16, "M3": 16, "M4": 16,
                "M1 Pro": 16, "M1 Max": 16, "M1 Ultra": 32,
                "M2 Pro": 16, "M2 Max": 16, "M2 Ultra": 32,
                "M3 Pro": 16, "M3 Max": 16,
                "M4 Pro": 16, "M4 Max": 16,
            }
            for chip, cores in ne_cores.items():
                if chip in chip_name:
                    info["neural_engine_cores"] = cores
                    break
            else:
                if "Apple" in chip_name:
                    info["neural_engine_cores"] = 16  # default

            # GPU core count from system_profiler
            gpu_cores_str = hw.get("number_processors", "")
            if "gpu" in str(gpu_cores_str).lower():
                info["gpu_cores"] = int(
                    "".join(c for c in str(gpu_cores_str) if c.isdigit()) or "0"
                )
    except (subprocess.TimeoutExpired, FileNotFoundError, json.JSONDecodeError):
        pass

    return info


def detect_ram_gb() -> float:
    """Detect total system RAM in GB."""
    import psutil
    return round(psutil.virtual_memory().total / (1024 ** 3), 1)


def recommend_model_size(system: SystemInfo) -> dict:
    """Recommend maximum model size based on detected hardware."""
    recommendations = {
        "max_model_params": "",
        "quantization": "",
        "framework": "",
        "reasoning": [],
    }

    # Determine available memory for models
    available_vram = 0.0
    has_gpu = False

    if system.gpus:
        has_gpu = True
        available_vram = max(gpu.vram_gb for gpu in system.gpus)
    elif system.unified_memory:
        # Apple Silicon: ~75% of RAM usable for models
        available_vram = system.ram_gb * 0.75

    # Model size estimates (quantized with AWQ/GGUF Q4):
    # 7B  = ~4GB,  13B = ~8GB,  34B = ~20GB,
    # 70B = ~40GB, 180B = ~100GB
    if available_vram >= 100:
        recommendations["max_model_params"] = "180B"
        recommendations["quantization"] = "AWQ 4-bit or FP16 for 70B"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — can run 180B quantized or 70B at FP16"
        )
    elif available_vram >= 40:
        recommendations["max_model_params"] = "70B"
        recommendations["quantization"] = "AWQ 4-bit recommended"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 70B fits with 4-bit quantization"
        )
    elif available_vram >= 20:
        recommendations["max_model_params"] = "34B"
        recommendations["quantization"] = "AWQ 4-bit or GGUF Q4_K_M"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 34B fits comfortably quantized"
        )
    elif available_vram >= 8:
        recommendations["max_model_params"] = "13B"
        recommendations["quantization"] = "GGUF Q4_K_M recommended"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 13B fits with quantization"
        )
    elif available_vram >= 4:
        recommendations["max_model_params"] = "7B"
        recommendations["quantization"] = "GGUF Q4_K_M required"
        recommendations["reasoning"].append(
            f"{available_vram:.0f}GB available — 7B at 4-bit quantization"
        )
    else:
        recommendations["max_model_params"] = "3B or smaller"
        recommendations["quantization"] = "GGUF Q4_0 (most aggressive)"
        recommendations["reasoning"].append(
            f"Only {available_vram:.0f}GB available — limited to small models"
        )

    # Framework recommendation
    if system.unified_memory:
        recommendations["framework"] = "llama.cpp (Metal) or MLX"
        recommendations["reasoning"].append(
            "Apple Silicon detected — use MLX or llama.cpp with Metal acceleration"
        )
    elif has_gpu and any("NVIDIA" in g.name or "GeForce" in g.name or "RTX" in g.name
                         for g in system.gpus):
        recommendations["framework"] = "vLLM, TGI, or llama.cpp (CUDA)"
        recommendations["reasoning"].append(
            "NVIDIA GPU detected — use CUDA-accelerated inference"
        )
    elif has_gpu:
        recommendations["framework"] = "llama.cpp (ROCm or Vulkan)"
        recommendations["reasoning"].append(
            "Non-NVIDIA GPU — use llama.cpp with ROCm or Vulkan backend"
        )
    else:
        recommendations["framework"] = "llama.cpp (CPU mode)"
        recommendations["reasoning"].append(
            "No GPU detected — CPU inference only, expect slow performance"
        )

    return recommendations


def detect_all() -> SystemInfo:
    """Run all detection and return a SystemInfo object."""
    system = SystemInfo()
    system.os_name = f"{platform.system()} {platform.release()}"
    system.cpu = detect_cpu()
    system.ram_gb = detect_ram_gb()
    system.gpus = detect_nvidia_gpu()

    apple = detect_apple_silicon()
    system.has_neural_engine = apple["neural_engine"]
    system.neural_engine_cores = apple["neural_engine_cores"]
    system.unified_memory = apple["unified_memory"]
    system.apple_chip = apple["chip"]

    return system


def print_report(system: SystemInfo):
    """Print a formatted hardware report with recommendations."""
    print("=" * 60)
    print("  AI HARDWARE DETECTION REPORT")
    print("=" * 60)

    print(f"\n--- Operating System ---")
    print(f"  OS:           {system.os_name}")

    print(f"\n--- CPU ---")
    print(f"  Model:        {system.cpu.name}")
    print(f"  Architecture: {system.cpu.architecture}")
    print(f"  Cores:        {system.cpu.cores_physical} physical, "
          f"{system.cpu.cores_logical} logical")

    print(f"\n--- Memory ---")
    print(f"  Total RAM:    {system.ram_gb} GB")
    if system.unified_memory:
        print(f"  Type:         Unified Memory (shared CPU/GPU)")
    else:
        print(f"  Type:         System RAM (separate from GPU VRAM)")

    if system.gpus:
        print(f"\n--- GPU(s) ---")
        for i, gpu in enumerate(system.gpus):
            print(f"  GPU {i}:        {gpu.name}")
            print(f"    VRAM:       {gpu.vram_gb} GB")
            print(f"    CUDA:       {gpu.cuda_version}")
            print(f"    Driver:     {gpu.driver_version}")
            print(f"    Compute:    {gpu.compute_capability}")
    else:
        print(f"\n--- GPU ---")
        print(f"  No NVIDIA GPU detected")
        if system.apple_chip:
            print(f"  Apple chip:   {system.apple_chip} (integrated GPU)")

    if system.has_neural_engine:
        print(f"\n--- Neural Engine ---")
        print(f"  Present:      Yes")
        print(f"  Cores:        {system.neural_engine_cores}")

    # Recommendations
    recs = recommend_model_size(system)
    print(f"\n--- Recommendations ---")
    print(f"  Max model:    {recs['max_model_params']} parameters")
    print(f"  Quantization: {recs['quantization']}")
    print(f"  Framework:    {recs['framework']}")
    for reason in recs["reasoning"]:
        print(f"  * {reason}")

    print("\n" + "=" * 60)


if __name__ == "__main__":
    system = detect_all()
    print_report(system)

Example output on a MacBook Pro M4 Max with 64GB:

============================================================
  AI HARDWARE DETECTION REPORT
============================================================

--- Operating System ---
  OS:           Darwin 25.3.0

--- CPU ---
  Model:        Apple M4 Max
  Architecture: arm64
  Cores:        14 physical, 14 logical

--- Memory ---
  Total RAM:    64.0 GB
  Type:         Unified Memory (shared CPU/GPU)

--- GPU ---
  No NVIDIA GPU detected
  Apple chip:   Apple M4 Max (integrated GPU)

--- Neural Engine ---
  Present:      Yes
  Cores:        16

--- Recommendations ---
  Max model:    34B parameters
  Quantization: AWQ 4-bit or GGUF Q4_K_M
  Framework:    llama.cpp (Metal) or MLX
  * 48GB available — 34B fits comfortably quantized
  * Apple Silicon detected — use MLX or llama.cpp with Metal acceleration
============================================================

14. Inference Benchmark Script

Numbers in spec sheets are theoretical. This script measures actual performance on your hardware: tokens per second, latency, and memory usage.

"""
benchmark_inference.py — Measure real inference performance on your hardware.

Requires: llama-cpp-python (pip install llama-cpp-python)
          psutil (pip install psutil)

Usage:
    python benchmark_inference.py --model path/to/model.gguf
    python benchmark_inference.py --model path/to/model.gguf --prompt "Explain gravity"
    python benchmark_inference.py --model path/to/model.gguf --runs 5
"""

import argparse
import time
import os
import statistics
from dataclasses import dataclass


@dataclass
class BenchmarkResult:
    model_name: str
    model_size_gb: float
    prompt_tokens: int
    generated_tokens: int
    time_to_first_token_ms: float
    tokens_per_second: float
    total_time_seconds: float
    peak_memory_gb: float
    hardware: str


def get_memory_usage_gb() -> float:
    """Get current process memory usage in GB."""
    import psutil
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 ** 3)


def get_model_size_gb(model_path: str) -> float:
    """Get model file size in GB."""
    return os.path.getsize(model_path) / (1024 ** 3)


def get_hardware_name() -> str:
    """Get a short hardware description."""
    import platform
    system = platform.system()
    machine = platform.machine()

    if system == "Darwin" and machine == "arm64":
        import subprocess
        try:
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.brand_string"],
                capture_output=True, text=True, timeout=5,
            )
            return result.stdout.strip()
        except Exception:
            return "Apple Silicon"

    import shutil
    if shutil.which("nvidia-smi"):
        import subprocess
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
                capture_output=True, text=True, timeout=10,
            )
            gpus = result.stdout.strip().split("\n")
            return gpus[0] if gpus else "NVIDIA GPU"
        except Exception:
            return "NVIDIA GPU"

    return f"{system} {machine} (CPU only)"


def run_single_benchmark(
    model_path: str,
    prompt: str,
    max_tokens: int = 128,
    n_ctx: int = 2048,
    n_gpu_layers: int = -1,
) -> BenchmarkResult:
    """Run a single inference benchmark."""
    from llama_cpp import Llama

    hardware = get_hardware_name()
    model_size = get_model_size_gb(model_path)
    model_name = os.path.basename(model_path)

    # Measure memory before loading
    mem_before = get_memory_usage_gb()

    # Load model (this is not part of inference timing)
    print(f"  Loading model: {model_name} ({model_size:.1f} GB)...")
    load_start = time.perf_counter()
    llm = Llama(
        model_path=model_path,
        n_ctx=n_ctx,
        n_gpu_layers=n_gpu_layers,
        verbose=False,
    )
    load_time = time.perf_counter() - load_start
    print(f"  Model loaded in {load_time:.1f}s")

    # Measure memory after loading
    mem_after_load = get_memory_usage_gb()

    # Run inference
    print(f"  Running inference (max {max_tokens} tokens)...")
    tokens_generated = 0
    first_token_time = None
    start_time = time.perf_counter()

    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        echo=False,
    )

    end_time = time.perf_counter()
    total_time = end_time - start_time

    # Extract results
    generated_text = output["choices"][0]["text"]
    tokens_generated = output["usage"]["completion_tokens"]
    prompt_tokens = output["usage"]["prompt_tokens"]

    # Peak memory
    mem_peak = get_memory_usage_gb()

    # Calculate metrics
    tokens_per_second = tokens_generated / total_time if total_time > 0 else 0

    # Estimate time to first token (approximate — llama.cpp doesn't expose this
    # directly in the simple API, so we estimate from prompt eval time)
    ttft_ms = (total_time / tokens_generated * 1000) if tokens_generated > 0 else 0

    result = BenchmarkResult(
        model_name=model_name,
        model_size_gb=model_size,
        prompt_tokens=prompt_tokens,
        generated_tokens=tokens_generated,
        time_to_first_token_ms=ttft_ms,
        tokens_per_second=tokens_per_second,
        total_time_seconds=total_time,
        peak_memory_gb=mem_peak,
        hardware=hardware,
    )

    # Clean up
    del llm

    return result


def run_benchmark(
    model_path: str,
    prompt: str = "Explain the theory of relativity in simple terms.",
    max_tokens: int = 128,
    runs: int = 3,
    n_gpu_layers: int = -1,
):
    """Run multiple benchmark iterations and report statistics."""
    print("=" * 60)
    print("  INFERENCE BENCHMARK")
    print("=" * 60)

    if not os.path.exists(model_path):
        print(f"\nError: Model file not found: {model_path}")
        return

    results = []
    for i in range(runs):
        print(f"\n--- Run {i + 1}/{runs} ---")
        result = run_single_benchmark(
            model_path=model_path,
            prompt=prompt,
            max_tokens=max_tokens,
            n_gpu_layers=n_gpu_layers,
        )
        results.append(result)
        print(f"  Tokens/sec: {result.tokens_per_second:.1f}")
        print(f"  Total time: {result.total_time_seconds:.2f}s")
        print(f"  Tokens generated: {result.generated_tokens}")

    # Statistics
    tps_values = [r.tokens_per_second for r in results]
    latency_values = [r.total_time_seconds for r in results]
    memory_values = [r.peak_memory_gb for r in results]

    print("\n" + "=" * 60)
    print("  RESULTS SUMMARY")
    print("=" * 60)
    print(f"\n  Hardware:       {results[0].hardware}")
    print(f"  Model:          {results[0].model_name}")
    print(f"  Model size:     {results[0].model_size_gb:.1f} GB")
    print(f"  Runs:           {runs}")
    print(f"\n  Tokens/sec:     {statistics.mean(tps_values):.1f} "
          f"(min={min(tps_values):.1f}, max={max(tps_values):.1f})")
    if runs > 1:
        print(f"  Std dev:        {statistics.stdev(tps_values):.1f} tok/s")
    print(f"  Avg latency:    {statistics.mean(latency_values):.2f}s "
          f"for {max_tokens} tokens")
    print(f"  Peak memory:    {max(memory_values):.1f} GB")

    # Compare to reference numbers
    print(f"\n  --- Reference Comparison ---")
    print_reference_comparison(results[0])
    print("\n" + "=" * 60)


# Reference benchmarks: approximate tokens/sec for common hardware + model combos
REFERENCE_BENCHMARKS = {
    "7B-Q4": {
        "RTX 4090":           90,
        "RTX 4070":           55,
        "RTX 4070 Ti Super":  65,
        "M3 (16GB)":          15,
        "M3 Max (48GB)":      25,
        "M4 Pro (36GB)":      30,
        "M4 Max (64GB)":      35,
        "A100 (80GB)":        120,
        "H100 (80GB)":        180,
        "CPU only (8-core)":  5,
    },
    "13B-Q4": {
        "RTX 4090":           55,
        "RTX 4070":           30,
        "M3 Max (48GB)":      12,
        "M4 Max (64GB)":      20,
        "A100 (80GB)":        70,
        "H100 (80GB)":        110,
        "CPU only (8-core)":  2,
    },
    "34B-Q4": {
        "RTX 4090":           25,
        "M4 Max (64GB)":      10,
        "A100 (80GB)":        40,
        "H100 (80GB)":        65,
    },
    "70B-Q4": {
        "RTX 4090":           8,
        "M2 Ultra (192GB)":   5,
        "A100 (80GB)":        20,
        "H100 (80GB)":        35,
    },
}


def print_reference_comparison(result: BenchmarkResult):
    """Print how the result compares to known reference benchmarks."""
    # Determine model size category
    size_gb = result.model_size_gb
    if size_gb < 6:
        category = "7B-Q4"
    elif size_gb < 10:
        category = "13B-Q4"
    elif size_gb < 25:
        category = "34B-Q4"
    else:
        category = "70B-Q4"

    refs = REFERENCE_BENCHMARKS.get(category, {})
    if not refs:
        print("  No reference data for this model size.")
        return

    print(f"  Category: {category} (based on {size_gb:.1f}GB file size)")
    print(f"  Your result: {result.tokens_per_second:.1f} tok/s")
    print(f"  Reference numbers for {category}:")
    for hw, tps in sorted(refs.items(), key=lambda x: x[1], reverse=True):
        marker = ""
        if result.tokens_per_second > 0:
            ratio = result.tokens_per_second / tps
            if 0.8 <= ratio <= 1.2:
                marker = " <-- similar to your hardware"
        print(f"    {hw:25s}  {tps:>6} tok/s{marker}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Benchmark LLM inference")
    parser.add_argument("--model", required=True, help="Path to GGUF model file")
    parser.add_argument("--prompt", default="Explain the theory of relativity.",
                        help="Prompt to use")
    parser.add_argument("--max-tokens", type=int, default=128,
                        help="Max tokens to generate")
    parser.add_argument("--runs", type=int, default=3,
                        help="Number of benchmark runs")
    parser.add_argument("--cpu-only", action="store_true",
                        help="Force CPU-only inference")
    args = parser.parse_args()

    n_gpu = 0 if args.cpu_only else -1
    run_benchmark(
        model_path=args.model,
        prompt=args.prompt,
        max_tokens=args.max_tokens,
        runs=args.runs,
        n_gpu_layers=n_gpu,
    )

Usage:

# Basic benchmark
python benchmark_inference.py --model models/llama-7b-q4.gguf

# Custom prompt, more runs
python benchmark_inference.py --model models/mistral-7b-q4.gguf --runs 5 --prompt "Write a Python function"

# Force CPU-only (to compare GPU vs CPU)
python benchmark_inference.py --model models/llama-7b-q4.gguf --cpu-only

15. Cost-Per-TFLOP Analysis

Raw specs are meaningless without cost context. This section breaks down the actual cost to compute on each GPU.

Consumer GPU Cost Per TFLOP

GPUTFLOPS (FP16)Purchase Price$/TFLOP (Purchase)Effective $/TFLOP (3yr amortized)
RTX 407058$600$10.34$14.06 (incl. power)
RTX 4070 Ti Super74$900$12.16$16.22
RTX 4080 Super94$1,200$12.77$16.74
RTX 4090164$1,500$9.15$13.83 (power-hungry)
RX 7900 XTX61$700$11.48$15.06

Data Center GPU Cost Per TFLOP

GPUTFLOPS (FP16)Purchase Price$/TFLOP (Purchase)Cloud $/hrCloud $/TFLOP-hr
A100 (80GB)312$15,000$48.08$2.00$0.0064
H100 (80GB)990$32,000$32.32$3.00$0.0030
H200 (141GB)990$38,000$38.38$4.50$0.0045
L40181$10,000$55.25$1.50$0.0083

Cloud vs On-Prem Break-Even Calculator

"""
cost_breakeven.py — Calculate cloud vs on-prem break-even point.

No dependencies required — pure Python.
"""

from dataclasses import dataclass


@dataclass
class GPUSpec:
    name: str
    purchase_price: float      # USD
    tflops_fp16: float
    power_watts: float
    cloud_hourly: float        # $/hr for equivalent cloud instance


# Common GPU specs (April 2026 pricing)
GPU_CATALOG = {
    "rtx_4070": GPUSpec("RTX 4070", 600, 58, 200, 0.50),
    "rtx_4080": GPUSpec("RTX 4080 Super", 1200, 94, 320, 0.80),
    "rtx_4090": GPUSpec("RTX 4090", 1500, 164, 450, 1.20),
    "a100_80": GPUSpec("A100 80GB", 15000, 312, 400, 2.00),
    "h100": GPUSpec("H100 80GB", 32000, 990, 700, 3.00),
    "h200": GPUSpec("H200 141GB", 38000, 990, 4.50, 4.50),
    "l40": GPUSpec("L40 48GB", 10000, 181, 300, 1.50),
}


def calculate_onprem_hourly(
    gpu: GPUSpec,
    electricity_per_kwh: float = 0.15,
    cooling_overhead: float = 1.3,     # PUE (power usage effectiveness)
    annual_maintenance: float = 500,   # IT labor, replacements
    amortization_years: int = 3,
) -> float:
    """Calculate effective hourly cost of on-prem GPU."""
    hours_per_year = 8760

    # Hardware amortization
    hardware_hourly = gpu.purchase_price / (amortization_years * hours_per_year)

    # Power cost (GPU + cooling overhead)
    power_hourly = (gpu.power_watts / 1000) * electricity_per_kwh * cooling_overhead

    # Maintenance
    maintenance_hourly = annual_maintenance / hours_per_year

    return hardware_hourly + power_hourly + maintenance_hourly


def find_breakeven_hours(
    gpu: GPUSpec,
    electricity_per_kwh: float = 0.15,
    amortization_years: int = 3,
) -> float:
    """Find annual hours where on-prem cost equals cloud cost."""
    # On-prem fixed costs (annual)
    annual_hardware = gpu.purchase_price / amortization_years
    annual_maintenance = 500

    # On-prem variable costs (per hour)
    power_per_hour = (gpu.power_watts / 1000) * electricity_per_kwh * 1.3

    # Cloud variable cost (per hour)
    cloud_per_hour = gpu.cloud_hourly

    # Break-even: annual_fixed + hours * power_per_hour = hours * cloud_per_hour
    # hours * (cloud - power) = annual_fixed
    # hours = annual_fixed / (cloud - power)
    cost_diff = cloud_per_hour - power_per_hour
    if cost_diff <= 0:
        return float("inf")  # Cloud is cheaper per hour — on-prem never breaks even

    annual_fixed = annual_hardware + annual_maintenance
    return annual_fixed / cost_diff


def print_analysis(electricity_per_kwh: float = 0.15):
    """Print full cost analysis for all GPUs."""
    print("=" * 80)
    print("  CLOUD vs ON-PREM COST ANALYSIS")
    print(f"  Electricity rate: ${electricity_per_kwh}/kWh | "
          f"Amortization: 3 years | PUE: 1.3")
    print("=" * 80)

    print(f"\n{'GPU':20s} {'Cloud $/hr':>10s} {'On-Prem $/hr':>12s} "
          f"{'Break-Even':>12s} {'Annual Power':>12s}")
    print("-" * 70)

    for key, gpu in GPU_CATALOG.items():
        onprem_hourly = calculate_onprem_hourly(gpu, electricity_per_kwh)
        breakeven = find_breakeven_hours(gpu, electricity_per_kwh)
        annual_power = (gpu.power_watts / 1000) * 8760 * electricity_per_kwh

        breakeven_str = f"{breakeven:.0f} hrs/yr" if breakeven < 50000 else "Never"

        print(f"{gpu.name:20s} ${gpu.cloud_hourly:>8.2f} ${onprem_hourly:>10.3f} "
              f"{breakeven_str:>12s} ${annual_power:>10.0f}")

    print(f"\n  Break-even = annual hours where on-prem becomes cheaper than cloud")
    print(f"  On-prem cost includes hardware amortization, power, cooling (PUE), "
          f"and $500/yr maintenance")

    # Scenario analysis
    print(f"\n{'':=<80}")
    print("  SCENARIO ANALYSIS: RTX 4090")
    print(f"{'':=<80}")
    gpu = GPU_CATALOG["rtx_4090"]
    scenarios = [
        ("Hobby (4 hrs/week)", 4 * 52),
        ("Part-time (20 hrs/week)", 20 * 52),
        ("Full-time (40 hrs/week)", 40 * 52),
        ("Always-on (24/7)", 8760),
    ]

    for label, hours in scenarios:
        cloud_cost = hours * gpu.cloud_hourly
        onprem_cost = (
            gpu.purchase_price / 3  # amortization
            + (gpu.power_watts / 1000) * hours * electricity_per_kwh * 1.3
            + 500  # maintenance
        )
        cheaper = "On-prem" if onprem_cost < cloud_cost else "Cloud"
        savings = abs(cloud_cost - onprem_cost)
        print(f"  {label:30s} Cloud: ${cloud_cost:>8,.0f}/yr  "
              f"On-prem: ${onprem_cost:>8,.0f}/yr  "
              f"Winner: {cheaper} (saves ${savings:,.0f})")


if __name__ == "__main__":
    print_analysis(electricity_per_kwh=0.15)
    print("\n--- With cheap electricity ($0.08/kWh) ---\n")
    print_analysis(electricity_per_kwh=0.08)

Example output:

================================================================================
  CLOUD vs ON-PREM COST ANALYSIS
  Electricity rate: $0.15/kWh | Amortization: 3 years | PUE: 1.3
================================================================================

GPU                  Cloud $/hr On-Prem $/hr   Break-Even  Annual Power
----------------------------------------------------------------------
RTX 4070             $    0.50 $     0.064      456 hrs/yr $        263
RTX 4090             $    1.20 $     0.145      527 hrs/yr $        592
A100 80GB            $    2.00 $     0.627      3651 hrs/yr $       526
H100 80GB            $    3.00 $     1.349      6576 hrs/yr $       920

  SCENARIO ANALYSIS: RTX 4090
  Hobby (4 hrs/week)            Cloud: $     250/yr  On-prem: $   1,027/yr  Winner: Cloud
  Part-time (20 hrs/week)       Cloud: $   1,248/yr  On-prem: $   1,091/yr  Winner: On-prem
  Full-time (40 hrs/week)       Cloud: $   2,496/yr  On-prem: $   1,155/yr  Winner: On-prem
  Always-on (24/7)              Cloud: $  10,512/yr  On-prem: $   1,507/yr  Winner: On-prem

16. Mobile & Edge Hardware: Expanded Comparison

Detailed Mobile SoC Comparison (2026)

ChipDeviceCPU CoresGPU CoresNPU TOPSRAMProcessRelease
Apple A18 ProiPhone 16 Pro6 (2P+4E)6-core35 TOPS8GB3nmSep 2024
Apple A17 ProiPhone 15 Pro6 (2P+4E)6-core16 TOPS8GB3nmSep 2023
Snapdragon 8 Gen 3Galaxy S24 Ultra, etc.8 (1+5+2)Adreno 75045 TOPS8–16GB4nmNov 2023
Snapdragon 8 EliteGalaxy S25 Ultra, etc.8 (2+6)Adreno 83075 TOPS12–16GB3nmOct 2024
Google Tensor G4Pixel 98 (1+3+4)Mali-G7158 TOPS12GB4nmAug 2024
MediaTek Dimensity 9300OnePlus 12, etc.8 (1+3+4)Immortalis-G72037 TOPS8–16GB4nmNov 2023
Samsung Exynos 2400Galaxy S24 (select)10 (1+2+3+4)Xclipse 94014.7 TOPS8–12GB4nmJan 2024

Edge Compute Devices for AI

DeviceProcessorAI PerformanceRAMPowerPriceBest For
Raspberry Pi 5Cortex-A76 (4-core)~2 TOPS (CPU)4–8GB5–12W$60–$80Prototyping, IoT
RPi 5 + Coral M.2 TPUCortex-A76 + Edge TPU4 TOPS (TPU) + 2 (CPU)4–8GB8–15W$100–$140Edge inference
NVIDIA Jetson Orin NanoCortex-A78AE + Ampere GPU40 TOPS4–8GB7–15W$200–$300Robotics, CV
NVIDIA Jetson AGX OrinCortex-A78AE + Ampere GPU275 TOPS32–64GB15–60W$900–$2,000High-end edge
Intel NUC (Arc GPU)i7 + Arc A770M~13 TFLOPS FP1616–32GB35–100W$800–$1,200Compact workstation
Orange Pi 5 PlusRK3588 (Mali-G610)~6 TOPS (NPU)4–32GB5–20W$90–$200Budget edge AI

What Can Actually Run Where (Practical Model Sizes)

"""
edge_model_fit.py — Check which models fit on which edge devices.

No dependencies — pure Python reference table.
"""

EDGE_DEVICES = {
    "Raspberry Pi 5 (8GB)": {
        "ram_gb": 8, "usable_gb": 5, "compute": "CPU",
        "expected_tok_s": {"3B-Q4": 1.5, "1.5B-Q4": 3},
    },
    "RPi 5 + Coral TPU": {
        "ram_gb": 8, "usable_gb": 5, "compute": "TPU+CPU",
        "expected_tok_s": {"3B-Q4": 2, "1.5B-Q4": 5},
    },
    "Jetson Orin Nano (8GB)": {
        "ram_gb": 8, "usable_gb": 6, "compute": "GPU",
        "expected_tok_s": {"7B-Q4": 8, "3B-Q4": 20, "1.5B-Q4": 35},
    },
    "Jetson AGX Orin (64GB)": {
        "ram_gb": 64, "usable_gb": 55, "compute": "GPU",
        "expected_tok_s": {"34B-Q4": 5, "13B-Q4": 15, "7B-Q4": 40},
    },
    "iPhone 15 Pro (A17)": {
        "ram_gb": 8, "usable_gb": 4, "compute": "Neural Engine",
        "expected_tok_s": {"3B-Q4": 12, "1.5B-Q4": 25},
    },
    "iPhone 16 Pro (A18)": {
        "ram_gb": 8, "usable_gb": 4, "compute": "Neural Engine",
        "expected_tok_s": {"3B-Q4": 18, "1.5B-Q4": 35},
    },
    "Galaxy S25 Ultra (8 Elite)": {
        "ram_gb": 16, "usable_gb": 8, "compute": "NPU",
        "expected_tok_s": {"7B-Q4": 5, "3B-Q4": 15, "1.5B-Q4": 30},
    },
    "Pixel 9 Pro (Tensor G4)": {
        "ram_gb": 12, "usable_gb": 5, "compute": "TPU",
        "expected_tok_s": {"3B-Q4": 8, "1.5B-Q4": 18},
    },
}

MODEL_SIZES_GB = {
    "1.5B-Q4": 1.0,
    "3B-Q4": 2.0,
    "7B-Q4": 4.0,
    "13B-Q4": 8.0,
    "34B-Q4": 20.0,
    "70B-Q4": 40.0,
}


def check_compatibility():
    """Print device/model compatibility matrix."""
    models = list(MODEL_SIZES_GB.keys())

    print(f"\n{'Device':30s}", end="")
    for m in models:
        print(f" {m:>10s}", end="")
    print()
    print("-" * (30 + 11 * len(models)))

    for device_name, specs in EDGE_DEVICES.items():
        print(f"{device_name:30s}", end="")
        for model in models:
            size = MODEL_SIZES_GB[model]
            if size <= specs["usable_gb"]:
                tok_s = specs["expected_tok_s"].get(model, "?")
                if isinstance(tok_s, (int, float)):
                    print(f" {tok_s:>7.0f}t/s", end="")
                else:
                    print(f"     {'yes':>5s}", end="")
            else:
                print(f"     {'---':>5s}", end="")
        print()


if __name__ == "__main__":
    print("=" * 96)
    print("  EDGE DEVICE / MODEL COMPATIBILITY MATRIX")
    print("  Values show expected tokens/second. '---' = does not fit in memory.")
    print("=" * 96)
    check_compatibility()

Output:

Device                         1.5B-Q4     3B-Q4     7B-Q4    13B-Q4    34B-Q4    70B-Q4
------------------------------------------------------------------------------------------
Raspberry Pi 5 (8GB)                3t/s      2t/s      yes       ---       ---       ---
RPi 5 + Coral TPU                   5t/s      2t/s      yes       ---       ---       ---
Jetson Orin Nano (8GB)             35t/s     20t/s      8t/s      ---       ---       ---
Jetson AGX Orin (64GB)               yes       yes     40t/s     15t/s      5t/s      ---
iPhone 15 Pro (A17)                25t/s     12t/s      ---       ---       ---       ---
iPhone 16 Pro (A18)                35t/s     18t/s      ---       ---       ---       ---
Galaxy S25 Ultra (8 Elite)         30t/s     15t/s      5t/s      ---       ---       ---
Pixel 9 Pro (Tensor G4)            18t/s      8t/s      ---       ---       ---       ---

17. Power Consumption Analysis

Watts Per Inference by GPU

Power draw varies dramatically between idle, light inference, and full-load training. These numbers represent sustained inference workloads.

GPUIdle PowerInference PowerTraining PowerAnnual Cost (Inference 24/7)Annual Cost (8hrs/day)
RTX 407015W120W200W$158$53
RTX 4070 Ti Super20W170W285W$223$74
RTX 4080 Super25W200W320W$263$88
RTX 409030W280W450W$368$123
A100 (80GB)50W250W400W$329$110
H100 (80GB)60W350W700W$460$153
Apple M35W25W35W$33$11
Apple M4 Max8W45W70W$59$20

Assumes $0.15/kWh electricity rate.

Power Cost Calculator

"""
power_cost.py — Calculate electricity costs for AI hardware.

No dependencies — pure Python.
"""


def annual_power_cost(
    power_watts: float,
    hours_per_day: float = 24,
    electricity_rate: float = 0.15,
    pue: float = 1.0,
) -> float:
    """Calculate annual electricity cost."""
    daily_kwh = (power_watts * pue * hours_per_day) / 1000
    return daily_kwh * 365 * electricity_rate


def compare_power_costs(electricity_rate: float = 0.15):
    """Compare power costs across hardware for different usage patterns."""

    hardware = [
        ("RTX 4070 (inference)",    120),
        ("RTX 4090 (inference)",    280),
        ("RTX 4090 (training)",     450),
        ("A100 (inference)",        250),
        ("H100 (inference)",        350),
        ("H100 (training)",         700),
        ("Apple M4 Max (inference)", 45),
        ("Apple M4 Max (training)",  70),
        ("Jetson Orin Nano",         10),
        ("Raspberry Pi 5",            8),
    ]

    usage_patterns = [
        ("Hobby (2h/day)", 2),
        ("Dev (8h/day)", 8),
        ("Production (24/7)", 24),
    ]

    print(f"{'Hardware':35s}", end="")
    for label, _ in usage_patterns:
        print(f" {label:>18s}", end="")
    print()
    print("-" * (35 + 19 * len(usage_patterns)))

    for name, watts in hardware:
        print(f"{name:35s}", end="")
        for _, hours in usage_patterns:
            cost = annual_power_cost(watts, hours, electricity_rate)
            print(f" ${cost:>15,.0f}/yr", end="")
        print()


def when_power_matters():
    """Show when power cost becomes a significant factor in TCO."""
    print("\n" + "=" * 70)
    print("  WHEN DOES POWER COST MATTER?")
    print("=" * 70)

    scenarios = [
        {
            "name": "Home developer (RTX 4090)",
            "gpu_cost": 1500,
            "power_watts": 280,
            "hours_day": 4,
            "rate": 0.15,
        },
        {
            "name": "Small startup (4x RTX 4090 server)",
            "gpu_cost": 6000,
            "power_watts": 1120,
            "hours_day": 16,
            "rate": 0.12,
        },
        {
            "name": "Data center (8x H100)",
            "gpu_cost": 256000,
            "power_watts": 5600,
            "hours_day": 24,
            "rate": 0.08,
        },
    ]

    for s in scenarios:
        annual_power = annual_power_cost(
            s["power_watts"], s["hours_day"], s["rate"], pue=1.3
        )
        three_year_power = annual_power * 3
        hardware_cost = s["gpu_cost"]
        power_pct = (three_year_power / (hardware_cost + three_year_power)) * 100

        print(f"\n  {s['name']}")
        print(f"    Hardware cost:          ${hardware_cost:>10,.0f}")
        print(f"    3-year power cost:      ${three_year_power:>10,.0f}")
        print(f"    Power as % of 3yr TCO:  {power_pct:>9.1f}%")

        if power_pct > 30:
            print(f"    --> Power is a MAJOR cost factor. Optimize for efficiency.")
        elif power_pct > 15:
            print(f"    --> Power is significant. Consider it in purchasing decisions.")
        else:
            print(f"    --> Power cost is minor. Focus on GPU performance instead.")


if __name__ == "__main__":
    print("=" * 90)
    print("  ANNUAL ELECTRICITY COST BY HARDWARE AND USAGE")
    print(f"  Rate: $0.15/kWh")
    print("=" * 90)
    compare_power_costs(0.15)

    print("\n\n--- With cheap industrial power ($0.06/kWh) ---\n")
    compare_power_costs(0.06)

    when_power_matters()

When Power Cost Matters: Rules of Thumb

SituationPower as % of TCOAction
Home developer, 4 hrs/day5–10%Ignore power cost. Buy the fastest GPU you can afford.
Always-on inference server, 24/715–30%Power matters. Consider RTX 4070 over 4090 for inference (better perf/watt).
Data center, 100+ GPUs30–50%Power is a major expense. Optimize PUE, consider liquid cooling, use efficient GPUs (H200 > H100).
Edge/mobile<1%Irrelevant for cost. Matters for battery life and thermal throttling.

Key insight: For most individual developers, electricity costs are noise — a few hundred dollars per year. For data centers running hundreds of GPUs 24/7, power can equal or exceed hardware amortization over 3 years.


18. Hardware Decision Tree

Instead of reading tables, answer a few questions and get a recommendation.

"""
hardware_selector.py — Interactive hardware recommendation engine.

No dependencies — pure Python.

Usage:
    python hardware_selector.py
    # Or call programmatically:
    from hardware_selector import recommend_hardware
    result = recommend_hardware(budget=2000, use_case="inference", location="home")
"""

from dataclasses import dataclass


@dataclass
class Recommendation:
    primary: str
    alternative: str
    estimated_cost: str
    reasoning: list
    warnings: list


def recommend_hardware(
    budget: int,
    use_case: str,
    location: str,
    model_size: str = "7B",
    priority: str = "balanced",
) -> Recommendation:
    """
    Recommend hardware based on constraints.

    Args:
        budget: Maximum spend in USD (0 = cloud only)
        use_case: "training", "inference", "both", "development", "edge"
        location: "home", "office", "datacenter", "mobile"
        model_size: "3B", "7B", "13B", "34B", "70B", "180B"
        priority: "speed", "cost", "efficiency", "portability", "balanced"

    Returns:
        Recommendation with primary choice, alternative, reasoning, and warnings.
    """
    rec = Recommendation(
        primary="", alternative="", estimated_cost="",
        reasoning=[], warnings=[],
    )

    # Parse model size to determine VRAM needs
    size_to_vram = {
        "3B": 2, "7B": 4, "13B": 8, "34B": 20, "70B": 40, "180B": 100,
    }
    needed_vram = size_to_vram.get(model_size, 4)

    # --- Edge / Mobile ---
    if use_case == "edge" or location == "mobile":
        if model_size in ("3B", "7B"):
            rec.primary = "NVIDIA Jetson Orin Nano (8GB)"
            rec.alternative = "Raspberry Pi 5 + Coral TPU"
            rec.estimated_cost = "$200–$300"
            rec.reasoning.append(
                f"{model_size} models fit on Jetson with good performance"
            )
        elif model_size == "13B":
            rec.primary = "NVIDIA Jetson AGX Orin (64GB)"
            rec.alternative = "Cloud API with local cache"
            rec.estimated_cost = "$900–$2,000"
            rec.reasoning.append("13B requires significant edge compute")
        else:
            rec.primary = "Cloud API (too large for edge)"
            rec.alternative = "Quantize to smaller model"
            rec.estimated_cost = "Variable"
            rec.warnings.append(
                f"{model_size} is too large for edge devices. "
                f"Consider distillation to 7B or smaller."
            )
        return rec

    # --- Portability Priority ---
    if priority == "portability" or location == "mobile":
        if budget >= 3500 and needed_vram <= 40:
            rec.primary = "MacBook Pro M4 Max (64GB)"
            rec.alternative = "MacBook Pro M4 Pro (36GB)"
            rec.estimated_cost = "$3,500–$4,000"
            rec.reasoning.append("Unified memory handles models up to 34B")
            rec.reasoning.append("Silent, portable, 15hr battery")
        elif budget >= 2500:
            rec.primary = "MacBook Pro M4 Pro (36GB)"
            rec.alternative = "MacBook Pro M4 (24GB)"
            rec.estimated_cost = "$2,500–$3,000"
            rec.reasoning.append("Good balance of portability and capability")
        else:
            rec.primary = "MacBook Air M3 (16GB)"
            rec.alternative = "Framework Laptop + eGPU"
            rec.estimated_cost = "$1,500–$1,800"
            rec.reasoning.append("Handles 7B models, extremely portable")
            if model_size not in ("3B", "7B"):
                rec.warnings.append(
                    f"16GB limits you to 7B models. "
                    f"Budget more for {model_size}."
                )
        return rec

    # --- Training Focus ---
    if use_case == "training":
        if location == "datacenter" or budget >= 30000:
            rec.primary = "Cloud H100 instances (on-demand)"
            rec.alternative = "On-prem H100 if >8000 hrs/year"
            rec.estimated_cost = "$3–$4/hr cloud, $32K purchase"
            rec.reasoning.append("H100 is the training standard")
            rec.reasoning.append(
                "Cloud is cheaper unless you run >8000 hrs/year"
            )
        elif budget >= 1500:
            rec.primary = "RTX 4090 (24GB)"
            rec.alternative = "RTX 4080 Super (16GB)"
            rec.estimated_cost = "$1,500 GPU + $1,000 system"
            rec.reasoning.append("Best consumer GPU for training")
            rec.reasoning.append("Handles 7B–13B training, 34B with LoRA")
            if model_size in ("70B", "180B"):
                rec.warnings.append(
                    f"Cannot train {model_size} locally. Use cloud or LoRA/QLoRA."
                )
        elif budget >= 600:
            rec.primary = "RTX 4070 (12GB)"
            rec.alternative = "RTX 4070 Ti Super (16GB) for $300 more"
            rec.estimated_cost = "$600 GPU + $500 system"
            rec.reasoning.append("Budget training card, handles 7B with QLoRA")
            if model_size not in ("3B", "7B"):
                rec.warnings.append(
                    f"12GB VRAM limits training to 7B. "
                    f"Use cloud for {model_size}."
                )
        else:
            rec.primary = "Cloud GPU (AWS/GCP spot instances)"
            rec.alternative = "Google Colab Pro ($10/month)"
            rec.estimated_cost = "$0.30–$1.00/hr"
            rec.reasoning.append("Budget too low for dedicated training hardware")

        return rec

    # --- Inference Focus ---
    if use_case == "inference":
        if location == "datacenter":
            if model_size in ("70B", "180B"):
                rec.primary = "A100 or H100 cluster (cloud)"
                rec.alternative = "On-prem L40 cluster for cost savings"
                rec.estimated_cost = "$2–$4/hr per GPU"
            else:
                rec.primary = "L4 or L40 (inference-optimized)"
                rec.alternative = "A100 for flexibility"
                rec.estimated_cost = "$1–$2/hr"
            rec.reasoning.append("Inference-optimized GPUs save 30–40% vs training GPUs")
        elif budget >= 1500:
            rec.primary = "RTX 4090 (24GB)"
            rec.alternative = "RTX 4070 (better perf/watt for inference)"
            rec.estimated_cost = "$1,500"
            rec.reasoning.append("RTX 4070 is often better for inference-only")
            rec.warnings.append(
                "RTX 4090 is overkill for inference-only workloads. "
                "RTX 4070 offers 85% of inference speed at 40% of the price."
            )
        elif budget >= 600:
            rec.primary = "RTX 4070 (12GB)"
            rec.alternative = "MacBook Pro M4 (24GB) if portability matters"
            rec.estimated_cost = "$600"
            rec.reasoning.append("Sweet spot for local inference up to 13B")
        else:
            rec.primary = "MacBook Air M3 (16GB) or Cloud API"
            rec.alternative = "Used RTX 3060 12GB (~$250)"
            rec.estimated_cost = "$250–$1,500"
            rec.reasoning.append("Limited budget: M3 for portability, used GPU for speed")

        return rec

    # --- Development (Both training and inference) ---
    if budget >= 3000:
        rec.primary = "RTX 4090 desktop + MacBook Air M3 for mobility"
        rec.alternative = "MacBook Pro M4 Max (64GB) for all-in-one"
        rec.estimated_cost = "$3,000–$4,000"
        rec.reasoning.append("Desktop for heavy compute, laptop for coding anywhere")
    elif budget >= 1500:
        rec.primary = "MacBook Pro M4 Pro (36GB)"
        rec.alternative = "RTX 4070 desktop ($1,100)"
        rec.estimated_cost = "$1,500–$2,500"
        rec.reasoning.append("Good balance for development workflow")
    elif budget >= 600:
        rec.primary = "RTX 4070 + budget PC"
        rec.alternative = "MacBook Air M3 (16GB)"
        rec.estimated_cost = "$600–$1,100"
        rec.reasoning.append("Best value for serious development")
    else:
        rec.primary = "Google Colab Pro + any laptop"
        rec.alternative = "Used RTX 3060 12GB"
        rec.estimated_cost = "$10/month + existing hardware"
        rec.reasoning.append("Cloud-first approach on a tight budget")

    return rec


def interactive_selector():
    """Run the interactive hardware selector."""
    print("=" * 60)
    print("  AI HARDWARE SELECTOR")
    print("=" * 60)

    print("\nAnswer these questions to get a recommendation.\n")

    # Budget
    print("1. What's your budget?")
    print("   a) Under $500")
    print("   b) $500–$1,500")
    print("   c) $1,500–$3,500")
    print("   d) $3,500+")
    print("   e) Cloud only (no hardware purchase)")
    budget_map = {"a": 300, "b": 1000, "c": 2500, "d": 5000, "e": 0}
    budget_choice = input("   Choice [a-e]: ").strip().lower()
    budget = budget_map.get(budget_choice, 1000)

    # Use case
    print("\n2. Primary use case?")
    print("   a) Training models")
    print("   b) Running inference (serving models)")
    print("   c) Both training and inference")
    print("   d) Development and experimentation")
    print("   e) Edge/IoT deployment")
    use_map = {
        "a": "training", "b": "inference", "c": "both",
        "d": "development", "e": "edge",
    }
    use_choice = input("   Choice [a-e]: ").strip().lower()
    use_case = use_map.get(use_choice, "development")

    # Location
    print("\n3. Where will it run?")
    print("   a) Home office")
    print("   b) Office/lab")
    print("   c) Data center")
    print("   d) Mobile/portable")
    loc_map = {
        "a": "home", "b": "office", "c": "datacenter", "d": "mobile",
    }
    loc_choice = input("   Choice [a-d]: ").strip().lower()
    location = loc_map.get(loc_choice, "home")

    # Model size
    print("\n4. Largest model you need to run?")
    print("   a) 3B (small, fast)")
    print("   b) 7B (standard)")
    print("   c) 13B (capable)")
    print("   d) 34B (very capable)")
    print("   e) 70B (frontier-class)")
    print("   f) 180B+ (largest)")
    size_map = {
        "a": "3B", "b": "7B", "c": "13B",
        "d": "34B", "e": "70B", "f": "180B",
    }
    size_choice = input("   Choice [a-f]: ").strip().lower()
    model_size = size_map.get(size_choice, "7B")

    # Priority
    print("\n5. Top priority?")
    print("   a) Speed (fastest possible)")
    print("   b) Cost (cheapest that works)")
    print("   c) Efficiency (best perf/watt)")
    print("   d) Portability (laptop/mobile)")
    print("   e) Balanced")
    pri_map = {
        "a": "speed", "b": "cost", "c": "efficiency",
        "d": "portability", "e": "balanced",
    }
    pri_choice = input("   Choice [a-e]: ").strip().lower()
    priority = pri_map.get(pri_choice, "balanced")

    # Get recommendation
    rec = recommend_hardware(budget, use_case, location, model_size, priority)

    print("\n" + "=" * 60)
    print("  RECOMMENDATION")
    print("=" * 60)
    print(f"\n  Primary:     {rec.primary}")
    print(f"  Alternative: {rec.alternative}")
    print(f"  Est. Cost:   {rec.estimated_cost}")
    print(f"\n  Reasoning:")
    for r in rec.reasoning:
        print(f"    - {r}")
    if rec.warnings:
        print(f"\n  Warnings:")
        for w in rec.warnings:
            print(f"    ! {w}")
    print("\n" + "=" * 60)


if __name__ == "__main__":
    interactive_selector()

Programmatic usage (no interaction needed):

from hardware_selector import recommend_hardware

# Startup with $2K budget doing inference
rec = recommend_hardware(budget=2000, use_case="inference", location="home", model_size="13B")
print(f"Buy: {rec.primary}")
print(f"Or:  {rec.alternative}")
for w in rec.warnings:
    print(f"Warning: {w}")

# Data center training
rec = recommend_hardware(budget=50000, use_case="training", location="datacenter", model_size="70B")
print(f"Buy: {rec.primary}")

19. Common Hardware Mistakes

These are real mistakes people make when buying AI hardware. Each one wastes money or performance.

Mistake 1: “Bought RTX 4090 for inference-only workload”

The problem: The RTX 4090 is a training beast with ~82.6 TFLOPS FP32, but inference doesn’t need that much compute. Inference is memory-bandwidth-bound, not compute-bound.

The numbers:

  • RTX 4090: $1,500, 280W inference, ~90 tok/s on 7B
  • RTX 4070: $600, 120W inference, ~55 tok/s on 7B
  • Cost per token: 4090 = 1.7x the price for 1.6x the speed

What to do instead: For inference-only, buy the RTX 4070 (or two RTX 4070s for $1,200 with 2x throughput). The 4090 only makes sense if you also train models.

def is_4090_worth_it(training_hours_per_month: int, inference_hours_per_month: int) -> str:
    """Determine if RTX 4090 is worth it over RTX 4070."""
    # 4090 advantage: 1.5x training speed, 1.6x inference speed
    # 4090 cost: 2.5x price, 2.3x power

    training_time_saved = training_hours_per_month * 0.33  # 33% faster
    value_of_time = 50  # $/hr for your time
    monthly_time_savings = training_time_saved * value_of_time

    price_diff = 1500 - 600  # $900 more
    monthly_power_diff = ((280 - 120) / 1000) * inference_hours_per_month * 0.15

    months_to_payback = price_diff / (monthly_time_savings - monthly_power_diff)

    if months_to_payback < 0:
        return ("RTX 4070 is better. You don't train enough to justify the 4090. "
                f"Training savings: ${monthly_time_savings:.0f}/mo, "
                f"Extra power: ${monthly_power_diff:.0f}/mo")
    elif months_to_payback > 24:
        return (f"RTX 4070 is better. Payback is {months_to_payback:.0f} months "
                f"— longer than the GPU's useful life.")
    else:
        return (f"RTX 4090 pays for itself in {months_to_payback:.0f} months. "
                f"Worth it if you train regularly.")


# Examples
print(is_4090_worth_it(training_hours_per_month=0, inference_hours_per_month=100))
# -> RTX 4070 is better. You don't train enough.

print(is_4090_worth_it(training_hours_per_month=40, inference_hours_per_month=100))
# -> RTX 4090 pays for itself in ~5 months.

Mistake 2: “Running FP32 on a GPU with tensor cores”

The problem: Modern NVIDIA GPUs (RTX 30xx, 40xx, A100, H100) have tensor cores that accelerate FP16 and BF16 operations by 2–4x. Running FP32 wastes half or more of your GPU’s capability.

The numbers:

  • RTX 4090 FP32: ~82.6 TFLOPS
  • RTX 4090 FP16 (tensor cores): ~165 TFLOPS — 2x faster, same GPU
  • H100 FP32: ~67 TFLOPS
  • H100 FP16 (tensor cores): ~989 TFLOPS — ~15x faster!

What to do instead: Always use mixed precision or FP16/BF16 for training and inference. PyTorch makes this easy:

"""
Correct: Using mixed precision to exploit tensor cores.
This example shows the difference between FP32 and FP16 training.
"""
import torch

# WRONG: FP32 training (wastes tensor cores)
def train_fp32(model, data, optimizer):
    """This ignores tensor cores entirely."""
    for batch in data:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()


# RIGHT: Mixed precision training (uses tensor cores)
def train_mixed_precision(model, data, optimizer):
    """2-4x faster on GPUs with tensor cores."""
    scaler = torch.amp.GradScaler("cuda")
    for batch in data:
        optimizer.zero_grad()
        with torch.amp.autocast("cuda"):  # Automatically uses FP16 where safe
            loss = model(batch)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()


# RIGHT: FP16 inference (maximum speed)
def inference_fp16(model, input_data):
    """Inference in FP16 — no accuracy loss for most models."""
    model = model.half()  # Convert model to FP16
    with torch.no_grad():
        with torch.amp.autocast("cuda"):
            output = model(input_data)
    return output


# Check if your GPU has tensor cores
def check_tensor_cores():
    """Check if current GPU supports tensor core acceleration."""
    if not torch.cuda.is_available():
        print("No CUDA GPU available.")
        return False

    capability = torch.cuda.get_device_capability()
    gpu_name = torch.cuda.get_device_name()

    # Tensor cores: compute capability >= 7.0 (Volta and newer)
    has_tensor = capability[0] >= 7
    print(f"GPU: {gpu_name}")
    print(f"Compute capability: {capability[0]}.{capability[1]}")
    print(f"Tensor cores: {'Yes' if has_tensor else 'No'}")

    if has_tensor:
        print("-> USE mixed precision (torch.amp.autocast) for 2-4x speedup!")
    else:
        print("-> FP32 is your only option. Consider upgrading GPU.")

    return has_tensor

Mistake 3: “Forgot to account for electricity in TCO”

The problem: People compare GPU purchase prices without factoring in electricity. For data center deployments, power can be 30–50% of total cost over 3 years.

The numbers (3-year TCO for 24/7 operation):

GPUPurchase3yr Electricity3yr Cooling (PUE 1.3)Total 3yr TCOElectricity %
RTX 4070$600$789$237$1,62649%
RTX 4090$1,500$1,774$532$3,80647%
A100$15,000$1,577$473$17,0509%
H100$32,000$2,759$828$35,5878%

Key insight: For expensive data center GPUs, electricity is a small percentage because the hardware itself costs so much. For consumer GPUs running 24/7, electricity can approach or exceed the purchase price over 3 years.

Mistake 4: “Bought maximum RAM without checking bandwidth”

The problem: More RAM lets you load bigger models, but if memory bandwidth is low, the GPU starves waiting for data. This matters more for inference than training.

Example: A100 40GB (1,555 GB/s bandwidth) vs A100 80GB (2,039 GB/s bandwidth). The 80GB version is not just more memory — it has 31% more bandwidth. For inference on large models, the 80GB version can be 20–30% faster even when the model fits in 40GB.

Mistake 5: “Using consumer GPU in a data center rack”

The problem: RTX 4090 is not designed for 24/7 data center operation. It has:

  • A blower-style cooler designed for a PC case with airflow
  • No ECC memory (silent data corruption risk over months)
  • Consumer warranty that may be voided by data center use
  • Power connectors not designed for hot-swap

What to use instead: L40 or A6000 for data center inference. They cost more but have proper cooling, ECC memory, and data center support.

Mistake 6: “Ignoring quantization, buying more VRAM instead”

The problem: A 70B model at FP16 needs ~140GB VRAM. People buy H200 (141GB, $38,000) when they could quantize to 4-bit and fit it in an A100 80GB ($15,000) or even an RTX 4090 pair (2x24GB = 48GB for $3,000).

The math:

def model_memory_by_quantization(params_billions: float) -> dict:
    """Show memory requirements at different quantization levels."""
    results = {}
    # Bytes per parameter at each precision
    precisions = {
        "FP32 (full)":    4.0,
        "FP16 (half)":    2.0,
        "INT8":           1.0,
        "INT4 (AWQ/GGUF)": 0.5,
        "INT3 (aggressive)": 0.375,
    }

    for name, bytes_per_param in precisions.items():
        size_gb = (params_billions * 1e9 * bytes_per_param) / (1024 ** 3)
        # Add ~20% overhead for KV cache and runtime
        total_gb = size_gb * 1.2
        results[name] = {"model_gb": round(size_gb, 1), "total_gb": round(total_gb, 1)}

    return results


def print_quantization_comparison(params_billions: float):
    """Show how quantization changes hardware requirements."""
    results = model_memory_by_quantization(params_billions)

    print(f"\n  Memory requirements for {params_billions}B parameter model:")
    print(f"  {'Precision':25s} {'Model':>8s} {'+ Overhead':>10s} {'Fits In':>30s}")
    print("  " + "-" * 75)

    gpu_options = [
        ("RTX 4070 (12GB)", 12),
        ("RTX 4090 (24GB)", 24),
        ("M4 Max (64GB)", 48),
        ("A100 (80GB)", 80),
        ("H200 (141GB)", 141),
    ]

    for name, info in results.items():
        fits = [g[0] for g in gpu_options if g[1] >= info["total_gb"]]
        fits_str = ", ".join(fits[:2]) if fits else "Multi-GPU required"
        print(f"  {name:25s} {info['model_gb']:>6.1f}GB {info['total_gb']:>8.1f}GB "
              f"  {fits_str}")


# Show for common model sizes
for size in [7, 13, 34, 70]:
    print_quantization_comparison(size)

Output:

  Memory requirements for 7B parameter model:
  Precision                   Model  + Overhead                        Fits In
  ---------------------------------------------------------------------------
  FP32 (full)                 26.1GB    31.3GB   M4 Max (64GB), A100 (80GB)
  FP16 (half)                 13.0GB    15.6GB   RTX 4090 (24GB), M4 Max (64GB)
  INT8                         6.5GB     7.8GB   RTX 4070 (12GB), RTX 4090 (24GB)
  INT4 (AWQ/GGUF)              3.3GB     3.9GB   RTX 4070 (12GB), RTX 4090 (24GB)
  INT3 (aggressive)            2.4GB     2.9GB   RTX 4070 (12GB), RTX 4090 (24GB)

  Memory requirements for 70B parameter model:
  Precision                   Model  + Overhead                        Fits In
  ---------------------------------------------------------------------------
  FP32 (full)                260.8GB   312.9GB   Multi-GPU required
  FP16 (half)                130.4GB   156.4GB   H200 (141GB)
  INT8                        65.2GB    78.2GB   A100 (80GB), H200 (141GB)
  INT4 (AWQ/GGUF)             32.6GB    39.1GB   M4 Max (64GB), A100 (80GB)
  INT3 (aggressive)           24.4GB    29.3GB   M4 Max (64GB), A100 (80GB)

Bottom line: Always quantize before buying more VRAM. AWQ 4-bit quantization has negligible quality loss for inference and cuts memory requirements by 4x.


Validation Checklist

How do you know you got this right?

Performance Checks

  • Benchmarked your hardware using the detection script (Section 13) and recorded actual TFLOPS, memory bandwidth, and VRAM
  • Know your VRAM limit and maximum model size at each precision level (FP16, int8, int4)
  • Measured real inference latency (tokens/second) on your target model, not just theoretical TFLOPS

Implementation Checks

  • Hardware selected using the decision matrix (Section 9) based on your actual workload (training vs inference, batch vs real-time)
  • Power consumption and annual electricity cost calculated for your setup (use the formula: watts/1000 * hours/day * 365 * $/kWh)
  • Break-even analysis completed: on-premise vs cloud, with your actual GPU-hours/year usage
  • Thermal solution verified: passive cooling sufficient (M-series), or active cooling adequate under sustained load (RTX series)
  • Quantization tested before buying more VRAM: confirmed AWQ int4 quality is acceptable for your use case
  • Memory headroom verified: model + KV cache + OS overhead fits within 60-70% of total device RAM
  • Cloud provider pricing compared across at least 2 providers (AWS, Lambda, Runpod) for your workload profile

Integration Checks

  • Hardware supports your framework stack (CUDA for PyTorch/TensorFlow, Metal/MLX for Apple Silicon)
  • Model serving architecture planned: single-user development vs multi-user API (determines GPU count and type)
  • Upgrade path identified: know what hardware to move to when you outgrow current setup

Common Failure Modes

  • Buying RTX 4090 for inference-only: Overspend of $900+ vs RTX 4070 which delivers 60% of the speed at 20% of the cost. Fix: match GPU to workload type.
  • Using cloud for steady-state 24/7 workload: Break-even with owned hardware is typically month 3. Fix: run break-even analysis before committing to cloud.
  • Ignoring power draw in TCO: 450W GPU running 24/7 costs $591/year in electricity alone. Fix: include power in all hardware cost comparisons.
  • Assuming M-series can’t train: It can fine-tune via LoRA; just slower than discrete GPUs. Fix: use MLX for local fine-tuning on M-series before dismissing it.

Sign-Off Criteria

  • Total cost of ownership calculated for 3-year and 5-year horizons (hardware + power + cooling + maintenance)
  • Hardware decision documented with rationale (why X over Y, with cost and performance justification)
  • Verified model fits in memory on target hardware by running actual inference, not just calculating theoretical fit
  • Scaling plan defined: what happens when you need 2x, 5x, or 10x current capacity
  • Power and cooling infrastructure confirmed adequate for chosen hardware (especially for multi-GPU setups)

20. AI Infrastructure: Networking Between Chips

Individual chips are fast. The bottleneck in large-scale AI is connecting them. When you train a 405B-parameter model across 16,384 GPUs, the network between those GPUs determines whether your cluster runs at 80% efficiency or 30%. Broadcom, NVIDIA, and increasingly the Ultra Ethernet Consortium are fighting over this layer.

Why Networking Matters for AI

Large model training and high-throughput inference require constant communication between accelerators. Every forward pass of a distributed model sends gradients, activations, and KV cache data across the network. If the network is slower than the compute, GPUs sit idle waiting for data. This is called the communication bottleneck, and it is the single biggest efficiency problem in large AI clusters.

The math: An H100 GPU produces ~3.9 TB/s of internal memory bandwidth. If it is connected to other GPUs via a 400 Gbps Ethernet link (50 GB/s), the network is ~78x slower than the GPU’s internal bus. The GPU spends most of its time waiting.

The Three Interconnect Technologies

TechnologyBandwidth Per LinkLatencyRangeVendor Lock-InCost
NVLink/NVSwitch900 GB/s (NVLink 5)<1 usWithin a node (72 GPUs max)NVIDIA onlyIncluded in GPU price
InfiniBand NDR400 Gbps (50 GB/s)1-2 usRack to data centerNVIDIA (Mellanox)$5,000-$15,000/port
Ethernet 800G800 Gbps (100 GB/s)2-5 usData center to globalMulti-vendor (Broadcom, Cisco, Arista)$2,000-$8,000/port

How they work together (two-level hierarchy):

  • Level 1 (intra-node): NVLink/NVSwitch connects GPUs within a single server or NVLink domain (up to 72 GPUs). Sub-microsecond latency, TB/s aggregate bandwidth. This is the fast lane.
  • Level 2 (inter-node): Ethernet or InfiniBand connects NVLink domains across racks. Microsecond latency, 400G-800G per NIC. This is the highway between buildings.

Broadcom’s Role: The Networking Fabric Provider

Broadcom does not make GPUs or AI accelerators (those are NVIDIA, Google, Meta, AMD). Broadcom makes the networking silicon that connects them, and the custom ASIC design platform that hyperscalers use to build their own chips.

Two distinct businesses:

  1. Ethernet Switch ASICs — Broadcom’s Tomahawk series dominates data center switching:

    • Tomahawk 6 (2025): 102.4 Tbps total switching capacity, the highest-bandwidth switch chip ever built
    • Used in switches from Arista, Cisco, and others that form the backbone of AI data centers
    • Supports 800 Gbps per port, 128 ports per switch
  2. XPU Custom Silicon Platform — Broadcom designs custom AI accelerators for hyperscalers:

    • Google TPU: Broadcom has co-designed Google’s Tensor Processing Units since 2015, with a supply agreement extending through 2031
    • Meta MTIA: Extended partnership announced April 2026 for multiple generations of Meta Training and Inference Accelerators, starting with the first 2nm-process custom AI silicon, scaling to multi-gigawatt deployment by 2029
    • Additional customers: Anthropic, OpenAI, ByteDance, and others
    • Revenue: $8.4 billion in AI semiconductor revenue in Q1 FY2026 (106% YoY growth)

Ethernet vs InfiniBand: The 2026 Landscape

NVIDIA has historically dominated AI networking with InfiniBand (acquired via Mellanox in 2020). Broadcom is leading the charge to replace InfiniBand with Ethernet, which would break NVIDIA’s networking monopoly.

Why Ethernet is winning:

  • Ultra Ethernet Consortium (UEC) 1.0 specification released June 2025, adding InfiniBand-like features (adaptive routing, congestion control, hardware packet reordering) to Ethernet
  • Cost: Ethernet switches and NICs are 40-60% cheaper than InfiniBand equivalents
  • Multi-vendor: Broadcom, Cisco, Arista, AMD all ship Ethernet silicon; InfiniBand is NVIDIA-only
  • Scale: IP routing enables larger fabric scales than InfiniBand subnets
  • Operational tooling: Enterprise networking teams already know Ethernet

Where InfiniBand still wins:

  • Lowest latency (1-2 us vs 2-5 us for Ethernet)
  • Mature RDMA implementation (RoCEv2 on Ethernet is catching up but still requires tuning)
  • Proven at extreme scale (NVIDIA’s own DGX SuperPOD clusters)

Current recommendation: For new enterprise and cloud AI clusters of 64+ GPUs, RoCEv2 over 800G Ethernet with Broadcom Tomahawk switches is the default choice. InfiniBand remains relevant for latency-critical training workloads at NVIDIA-exclusive sites.

What This Means for Harness Builders

If you are building an AI agent harness that calls cloud APIs, networking infrastructure is invisible to you — the cloud provider handles it. But understanding this layer matters for:

  • Cost estimation: Networking is 15-25% of a large AI cluster’s total cost. When cloud providers price inference endpoints, networking costs are baked in.
  • Latency budgets: Inter-node communication adds 2-10 ms to distributed inference. If your harness chains multiple model calls, this compounds.
  • Provider selection: Hyperscalers building their own chips (Google TPU, Meta MTIA, Amazon Trainium) with Broadcom networking will offer cheaper inference than NVIDIA-GPU-only providers, because they avoid NVIDIA’s GPU and InfiniBand markup.
  • Edge vs cloud decisions: The networking layer is what makes cloud inference expensive at scale. If your model fits on a single device, you bypass all of this.

21. Qualcomm Edge AI and Hexagon NPU

Qualcomm is the dominant player in mobile and IoT AI inference. If Apple owns the premium phone AI experience (Neural Engine + CoreML), Qualcomm owns the rest: Android phones, IoT devices, automotive systems, and XR headsets. Their AI stack runs on billions of devices.

Architecture: Qualcomm AI Engine

Qualcomm’s AI approach is heterogeneous computing — distributing AI workloads across multiple processors on a single chip:

ComponentRoleBest For
Hexagon NPUDedicated neural processing unit with tensor coresSustained inference, LLMs, image models
Adreno GPUGraphics processor with compute shadersParallel inference, image generation
Kryo/Oryon CPUGeneral-purpose coresControl flow, pre/post-processing, small models
Sensing HubLow-power always-on processorWake words, ambient sensing, always-on detection

The Qualcomm AI Engine orchestrates workload placement across these processors. A single inference request might use the NPU for the main model, the CPU for tokenization, and the GPU for image post-processing.

Hexagon NPU: Specifications by Generation

ChipNPU TOPSProcessKey FeaturesDevices
Snapdragon 8 Gen 345 TOPS4nmDual Hexagon cores, INT4/INT8/FP16Galaxy S24 Ultra, OnePlus 12
Snapdragon 8 Elite75 TOPS3nmEnhanced tensor cores, 3x faster than 8 Gen 2Galaxy S25 Ultra, OnePlus 13
Snapdragon X Elite45 TOPS4nmLaptop-class, 12-core Oryon CPUWindows laptops (Surface, Lenovo, Dell)

What 75 TOPS means in practice: TOPS (Tera Operations Per Second) measures raw INT8 throughput. For comparison, Apple A18 Pro delivers 35 TOPS from its Neural Engine. But TOPS alone does not determine real-world performance — memory bandwidth, software optimization, and model compatibility matter as much.

On-Device LLM Performance

Running LLMs directly on a phone, with no cloud connection:

ModelParametersQuantizationSnapdragon 8 EliteNotes
Llama 3.2 3B Instruct3BW4A16~10 tok/sUsable for chat, voice commands
Llama 3.1 8B Instruct8BW4A16~5 tok/sSlower but more capable, 2048 context
Small vision models1-3BINT815-30 tok/sReal-time image understanding

Comparison with Apple:

  • iPhone 16 Pro (A18 Pro): ~18 tok/s on 3B models, ~35 tok/s on 1.5B models
  • Galaxy S25 Ultra (8 Elite): ~15 tok/s on 3B, ~5 tok/s on 7B (can run larger models due to 16GB RAM vs 8GB)

The trade-off: Apple is faster on small models; Qualcomm can run bigger models because Android flagships have more RAM (12-16GB vs 8GB).

Qualcomm AI Hub: Developer Workflow

Qualcomm AI Hub is the equivalent of Apple’s CoreML Tools — it converts, optimizes, and deploys models to Qualcomm hardware. The workflow:

  1. Start with a trained model (PyTorch, ONNX, TensorFlow)
  2. Export and optimize via AI Hub (quantization, graph optimization, NPU code generation)
  3. Compile to QNN context binary (precompiled, device-specific format)
  4. Deploy using Qualcomm Genie runtime (for LLMs) or QNN SDK (for other models)
"""
Qualcomm AI Hub: Export a model for on-device inference.

Requires: pip install qai-hub-models
          Qualcomm AI Hub account (free)

This compiles a Llama model for Snapdragon 8 Elite NPU execution.
"""

# Export Llama 3.1 8B for Snapdragon (single command)
# python -m qai_hub_models.models.llama_v3_1_8b_instruct.export

# Programmatic usage:
import qai_hub_models

# List available pre-optimized models
# Categories: image classification, object detection, LLMs,
#             image generation, speech recognition, and more

# The export process handles:
# 1. Model download from HuggingFace
# 2. Quantization (W4A16 for LLMs, INT8 for vision)
# 3. Graph optimization for Hexagon NPU
# 4. Compilation to QNN context binary
# 5. Performance profiling on target device

# Output: a .bin file ready for on-device deployment
# Compilation typically completes in minutes, not hours

Developer experience: Qualcomm AI Hub abstracts the complexity of NPU compilation behind a single export command. It supports converting PyTorch or ONNX models to any on-device runtime: LiteRT (Google), ONNX Runtime, or Qualcomm’s native QNN stack. The model zoo includes 175+ pre-optimized models.

Qualcomm Insight Platform

The Qualcomm Insight Platform is a separate product focused on edge AI for video intelligence and security. It is a SaaS platform that runs AI models on Qualcomm-powered cameras and edge boxes for real-time video analytics — object detection, person tracking, anomaly detection. It uses an LLM-based conversational engine for querying video data.

This is relevant for IoT/edge deployments but not for building a typical AI agent harness.

When to Use Qualcomm for AI

ScenarioUse Qualcomm?Why
Android app with on-device AIYesHexagon NPU is the best Android AI accelerator
IoT/edge device (cameras, sensors)YesLow power, good NPU, large ecosystem
Windows laptop AIMaybeSnapdragon X Elite runs models well, but Intel/AMD have competitive NPUs
Cloud inferenceNoUse NVIDIA GPUs or cloud TPUs
Training modelsNoNPUs are inference-only
Cross-platform agent harnessIndirectYour harness calls APIs; the NPU accelerates the on-device runtime beneath

Qualcomm vs Apple Neural Engine: Summary

AspectQualcomm (Snapdragon 8 Elite)Apple (A18 Pro)
NPU TOPS75 TOPS35 TOPS
Max device RAM16 GB8 GB
Largest on-device model8B (quantized)3B (quantized)
Developer toolsAI Hub, QNN SDKCoreML Tools, MLX
FrameworkQNN, ONNX Runtime, LiteRTCoreML, MLX
EcosystemAndroid, IoT, automotive, XRiPhone, iPad, Mac
AdvantageMore RAM, larger models, open ecosystemFaster per-TOPS, tighter integration, better optimization

22. OpenVINO: Intel’s Inference Optimization Toolkit

OpenVINO (Open Visual Inference and Neural network Optimization) is Intel’s open-source toolkit for optimizing and deploying AI inference on Intel hardware. If you have Intel CPUs, integrated GPUs, or Intel NPUs, OpenVINO can make your models run 2-5x faster than naive PyTorch or TensorFlow inference.

What It Does

OpenVINO sits between your trained model and Intel hardware. It takes a model from any major framework, converts it to an optimized intermediate representation, applies hardware-specific optimizations (quantization, kernel fusion, graph optimization), and runs inference using the best available Intel hardware.

Trained Model (PyTorch/ONNX/TF) --> OpenVINO Converter --> Optimized IR --> Intel Hardware
                                         |                                    |
                                    Quantization (NNCF)               CPU / GPU / NPU
                                    Graph optimization
                                    Kernel fusion

Supported Hardware

Intel HardwareWhat It IsOpenVINO SupportBest For
Intel CPUs (Core, Xeon)General-purpose processorsFull (primary target)Server inference, any workload
Intel Arc GPUsDiscrete graphics cardsFullParallel inference, image models
Intel integrated GPUsBuilt into Core processorsFullLaptop/desktop inference
Intel NPU (Meteor Lake+)Dedicated neural acceleratorFullAlways-on AI, efficient inference
Intel GaudiAI training/inference acceleratorSeparate SDKData center training (not OpenVINO)

Quick Start: Model Conversion and Inference

"""
openvino_quickstart.py -- Convert and run a model with OpenVINO.

Requires: pip install openvino nncf
          pip install torch torchvision  (for model download)

Works on any machine with an Intel CPU (no GPU required).
"""

import openvino as ov
import numpy as np


# --- Step 1: Convert a PyTorch model to OpenVINO ---

def convert_pytorch_model():
    """Convert a PyTorch model to OpenVINO IR format."""
    import torch
    from torchvision.models import mobilenet_v2, MobileNet_V2_Weights

    # Load a pretrained model
    model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
    model.eval()

    # Create example input
    example_input = torch.randn(1, 3, 224, 224)

    # Convert to OpenVINO (one line)
    ov_model = ov.convert_model(model, example_input=example_input)

    # Save for later use (optional — avoids re-conversion)
    ov.save_model(ov_model, "mobilenet_v2.xml")

    return ov_model


# --- Step 2: Run inference ---

def run_inference(model_path: str = "mobilenet_v2.xml"):
    """Load and run an OpenVINO model."""
    # Initialize the runtime
    core = ov.Core()

    # List available devices
    print(f"Available devices: {core.available_devices}")
    # Example output: ['CPU', 'GPU', 'NPU']

    # Compile model for a specific device
    # "CPU" = Intel CPU, "GPU" = Intel integrated/Arc GPU, "NPU" = Intel NPU
    # "AUTO" = let OpenVINO pick the best device
    compiled_model = core.compile_model(model_path, "AUTO")

    # Run inference
    input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
    result = compiled_model([input_data])

    # Get output
    output = result[0]
    predicted_class = np.argmax(output)
    print(f"Predicted class: {predicted_class}")

    return predicted_class


# --- Step 3: Optimize with quantization ---

def quantize_model(model_path: str = "mobilenet_v2.xml"):
    """Apply INT8 quantization using NNCF for ~2x speedup."""
    import nncf

    core = ov.Core()
    ov_model = core.read_model(model_path)

    # Post-training quantization (no retraining needed)
    # Requires a small calibration dataset (100-300 samples)
    def calibration_data():
        for _ in range(100):
            yield [np.random.randn(1, 3, 224, 224).astype(np.float32)]

    quantized_model = nncf.quantize(
        ov_model,
        nncf.Dataset(calibration_data()),
    )

    ov.save_model(quantized_model, "mobilenet_v2_int8.xml")
    print("Quantized model saved. Expected ~2x speedup on Intel CPUs.")


if __name__ == "__main__":
    print("Converting PyTorch model to OpenVINO...")
    convert_pytorch_model()

    print("\nRunning inference...")
    run_inference()

    print("\nQuantizing model...")
    quantize_model()

LLM Inference with OpenVINO GenAI

OpenVINO has expanded beyond computer vision to support generative AI workloads:

"""
openvino_llm.py -- Run an LLM with OpenVINO on Intel hardware.

Requires: pip install openvino-genai optimum[openvino]

Convert a HuggingFace model first:
    optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct \
        --weight-format int4 llama-1b-ov
"""

import openvino_genai as ov_genai


def run_llm(model_dir: str = "llama-1b-ov"):
    """Run LLM inference on Intel CPU/GPU."""
    # Load the model (automatically selects best device)
    pipe = ov_genai.LLMPipeline(model_dir, "CPU")

    # Generate text
    result = pipe.generate(
        "Explain what a KV cache is in one paragraph.",
        max_new_tokens=128,
        temperature=0.7,
    )
    print(result)


if __name__ == "__main__":
    run_llm()

Key GenAI features in OpenVINO 2026.0:

  • Mixture of Experts (MoE) model support (GPT-OSS-20B, Qwen3-30B)
  • Speculative decoding with EAGLE-3 on CPU, GPU, and NPU
  • Text-to-video pipeline (LTX-Video model)
  • Whisper speech-to-text with word-level timestamps
  • INT4 data-aware weight compression for MoE models

OpenVINO vs CoreML vs TensorRT

AspectOpenVINOCoreMLTensorRT
VendorIntel (open-source)Apple (proprietary)NVIDIA (proprietary)
Target hardwareIntel CPU, GPU, NPUApple Neural Engine, GPU, CPUNVIDIA GPUs only
Input formatsPyTorch, ONNX, TF, PaddlePaddle, JAXPyTorch, ONNX, TF (via coremltools)ONNX, PyTorch (via torch-tensorrt)
QuantizationINT8, INT4, FP8-4BLUT (NNCF)INT8, palettization, pruningFP8, INT8, INT4
LLM supportYes (OpenVINO GenAI)Yes (CoreML for Apple Intelligence)Yes (TensorRT-LLM)
Typical speedup2-5x over PyTorch on Intel CPUs3-10x on Neural Engine2-6x on NVIDIA GPUs
Open sourceYes (Apache 2.0)NoNo (limited source available)
Cross-platformLinux, Windows, macOS (Intel only)macOS, iOS onlyLinux, Windows (NVIDIA only)

ONNX Ecosystem Integration

OpenVINO fits into the broader ONNX ecosystem as one of several execution providers:

PyTorch Model
     |
     v
ONNX Format (universal interchange)
     |
     +-- ONNX Runtime + OpenVINO EP  --> Intel hardware
     +-- ONNX Runtime + TensorRT EP  --> NVIDIA hardware
     +-- ONNX Runtime + CoreML EP    --> Apple hardware
     +-- ONNX Runtime + QNN EP       --> Qualcomm hardware
     +-- ONNX Runtime + DirectML EP  --> Windows GPUs

This means you can export your model to ONNX once and run it on any hardware via the appropriate execution provider. OpenVINO can be used either standalone (direct API) or as an ONNX Runtime execution provider.

When to Use OpenVINO

ScenarioUse OpenVINO?Why
Server inference on Intel Xeon CPUsYesPrimary use case, significant speedup over raw PyTorch
Laptop inference on Intel CoreYesGood acceleration, especially with integrated GPU and NPU
Edge devices with Intel chipsYesSupports NPU for efficient always-on inference
NVIDIA GPU inferenceNoUse TensorRT or vLLM instead
Apple Silicon inferenceNoUse CoreML or MLX instead
Qualcomm device inferenceNoUse QNN SDK or AI Hub instead
Cross-platform deploymentMaybeUse ONNX Runtime with OpenVINO EP for Intel, other EPs for other hardware
Building an AI agent harnessUnlikelyYour harness likely calls cloud APIs; OpenVINO matters if you self-host inference on Intel hardware

Practical Relevance for Harness Builders

OpenVINO is most relevant if you are:

  • Self-hosting inference on Intel server hardware (common in enterprise environments where GPU procurement is slow or restricted)
  • Running models on Intel laptops for local development without an NVIDIA GPU
  • Deploying edge AI on Intel-based IoT devices (Intel NUC, industrial PCs)
  • Using ONNX Runtime as your inference backend and want Intel-optimized execution

If your harness calls cloud inference APIs (OpenAI, Anthropic, Google), OpenVINO is irrelevant — the cloud provider handles hardware optimization. If you run models locally on Apple Silicon, use MLX or CoreML instead.


See Also

  • Doc 01 (Foundation Models) — Model size depends on hardware; SLM selection is hardware-aware
  • Doc 02 (KV Cache Optimization) — Hardware choice (GPU VRAM vs unified memory) affects cache strategy
  • Doc 13 (Cost Management) — Hardware cost (amortization, electricity, maintenance) factors into total cost of ownership
  • Doc 23 (Apple Intelligence & CoreML) — Apple’s inference optimization stack, comparison point for OpenVINO and Qualcomm
  • Doc 25 (Edge & Physical AI) — Edge deployment patterns where Qualcomm NPU and OpenVINO are relevant
  • Doc 26 (TensorFlow & Frameworks) — Framework ecosystem context; OpenVINO, CoreML, TensorRT as deployment targets
  • Doc 28 (Unified Memory & Hardware Economics) — Deep dive into why Apple Silicon unified memory changes the economics