MAKE TECH
Local AI estimator • practical heuristics, not benchmark theatre

What AI can actually run locally on your hardware?

This page aims for the useful middle ground between “yes, technically” and “yes, you would actually want to use it”. It uses conservative, approximate engineering judgement to estimate what kinds of local AI workloads make sense on Windows PC, Linux PC, Mac, and Raspberry Pi 5 16GB.

How the platforms differ

  • Windows / Linux: usually constrained mainly by NVIDIA GPU VRAM.
  • Mac: constrained mainly by Apple unified memory, which is shared by CPU and GPU.
  • Pi 5 16GB: CPU + shared RAM edge device, with much harsher limits for generative workloads.

What changes the answer

  • Model size and architecture
  • Quantization level
  • Context length or generation length
  • Framework / runtime overhead
  • Available VRAM or unified memory
  • Whether the speed is acceptable to a real human
Practical framing

This is not a benchmark database

It is an explainer and estimator. The memory bands are indicative. A model that loads is not automatically a model that feels good to use.

Balanced platform view

With the same NVIDIA GPU, Windows and Linux often support similar model sizes. Linux often has stronger local AI tooling culture; Windows is often more convenient for mixed general desktop use.

Mac note

Unified memory can sometimes let larger models load than a small-VRAM discrete GPU would allow, but speed and software compatibility can differ.

Pi note

Tiny LLMs, lightweight speech models, and some local transcription can be feasible. Serious image generation is strained; serious video generation is usually a bad fit.

PlatformSwitches memory model and wording
CategoryRepresentative examples, not an exhaustive list
PracticalityFilter by how tolerable the experience is likely to be
GPU VRAM targetLeave headroom for overhead
12
GB VRAM
SortBecause smallest that works is not the same as most capable

Comfortable

Likely sensible for normal local use with workable speed and some headroom.

Borderline

Usable with compromises such as shorter context, lower resolution, or patience.

Experimental

Technically possible, but often slow, awkward, or compromised enough to be a hobby project.

Unrealistic

Either very poor on the chosen hardware or simply not a sensible local target.

Representative model fit

Showing heuristic results for the current platform and memory target.

Raspberry Pi 5 16GB: what is realistically local?

Treat the Pi 5 as a low-power local inference box, not as a desktop AI workstation. It can do useful work, but the envelope is much smaller.

  • Realistic: tiny LLMs, some 3B-class LLMs, fast local TTS, small Whisper transcription, and lightweight CPU-only inference.
  • Maybe workable: 7B-class quantized LLM experiments for short prompts, if you accept modest speed.
  • Heavily constrained: image generation. It is closer to “proof that it can run” than “pleasant daily workflow”.
  • Generally not recommended: serious local video generation. “Can run” and “runs well” are very different sentences here.
Practical rule: on a Pi, transcription and compact utility inference are credible; large generative media pipelines are mostly the wrong job for the device.

Interpretation notes

  • Windows vs Linux: same NVIDIA GPU usually means a similar raw model ceiling. Differences are more about tooling convenience than magic performance law.
  • Mac: unified memory can let larger models load than a small-VRAM GPU, but it still shares that pool with the rest of the system.
  • Headroom matters: raw model size equal to VRAM does not mean comfortable usage. Runtime, context, attention cache, and outputs still need room.
  • Media generation scales badly: image resolution, video frame count, and audio duration can move a workload from plausible to painful surprisingly fast.
This page deliberately penalises “just fits on paper” scenarios. The aim is practicality, not heroic loading screenshots.

Built-in source notes for the curated examples

The dataset is intentionally curated rather than exhaustive. It uses representative families that are commonly discussed for local use and keeps the memory bands indicative.

The practical bands in this page are more conservative than raw model-card claims. That is deliberate.