You want to run large language models locally on a Mac Studio. Good idea - unified memory is genuinely useful for LLMs. But the specs matter, and there are some hard truths about what “works” versus what feels responsive. More importantly: the right Mac depends entirely on which model you want to run.

Memory requirements: which model fits your Mac?

Different models have wildly different memory demands. Here’s what you actually need for the top free models:

Small models (great for any Mac)

  • LLaMA 3 7B: ~4–5 GB (Q4), ~8 GB (Q5)
  • Mistral 7B: ~4–5 GB (Q4), ~8 GB (Q5)
  • Phi 2: ~2.5 GB (Q4)

Reality: Runs on M2 Max easily. 15–40 tok/s depending on chip.

Medium models (the practical sweetspot)

  • LLaMA 3 13B: ~8–10 GB (Q4), ~13 GB (Q5)
  • Deepseek Chat 7B / 13B: ~4–10 GB
  • Gemma 7B / 13B: ~4–10 GB

Reality: Comfortable on M2 Max, excellent on M2 Ultra. 20–40 tok/s is standard.

Large dense models (needs real hardware)

  • LLaMA 3 70B: ~38–42 GB (Q4), ~65 GB (Q5)
  • Mixtral 8x7B MoE: ~14–16 GB (Q4), ~24 GB (Q5)

Reality: Mixtral works well on M2 Ultra. 70B needs M3 Ultra or a GPU rig.

Mixture-of-experts (tricky)

  • Qwen3.6-35B-A3B (3B active): ~19–22 GB (Q4), ~36 GB (Q8)
  • Mixtral 8x22B: ~24–28 GB (Q4), ~40 GB (Q5)

Reality: You must load all weights even though only part activates. Needs M2 Ultra minimum.

The key gotcha: active parameters ≠ memory footprint. Qwen 35B with 3B active still needs 35B worth of weights in memory.

Add overhead for KV cache (0.5–4 GB depending on context window), and you need realistic headroom: at least 24–28 GB total for any 30B+ model.

The Apple Silicon advantage and its limits

Mac Studio sits in an interesting middle ground.

What’s good:

  • Unified memory architecture (GPU and CPU share the same pool)
  • No VRAM bottleneck
  • Memory bandwidth is genuinely strong (800+ GB/s on Ultra variants)

What’s limiting:

  • Not CUDA - you’re stuck with Metal backend or custom implementations
  • Raw compute is lower than high-end GPUs
  • Generation speed is memory-bound, and that bandwidth matters more than raw TFLOPs

This means Mac Studio doesn’t compete with a gaming GPU rig on tokens/second, but it punches above its weight on simplicity and price per GB of usable memory.

Mac Studio configs and real-world performance

Here’s what actually runs well on each Mac, with realistic token speeds.

M2 Max (32–64 GB) - Budget entry point

What runs well:

  • LLaMA 3 7B–13B: ✅ excellent (25–40 tok/s)
  • Mistral 7B: ✅ excellent (25–40 tok/s)
  • Mixtral 8x7B: ⚠️ tight but viable (12–18 tok/s at 32GB)
  • Qwen 35B / 70B: ❌ frustrating (8–15 tok/s, memory pressure)

Cost: £1,400–£2,500 new

Verdict: Perfect for small–medium models. Skip if you want 30B+ class models.

M2 Ultra (64–128 GB) - The sweet spot

What runs well:

  • LLaMA 3 7B–13B: ✅ excellent (30–45 tok/s)
  • Mixtral 8x7B: ✅ very good (18–28 tok/s)
  • Qwen 35B / Mixtral 8x22B: ✅ responsive (15–25 tok/s at 64GB, 18–28 tok/s at 128GB)
  • LLaMA 70B: ⚠️ barely viable (8–12 tok/s at 128GB only)

Specs: 20 cores CPU, 16 cores GPU, ~800 GB/s bandwidth

Cost: £3,000–£5,500 new; ~£2,400–£3,200 refurbished

When to buy: You want proper performance on 30B+ models without overspending. Excellent refurbished market (these age well).

M3 Ultra (96–256 GB) - No compromises

What runs well:

  • All medium models: 🚀 excellent (35–50+ tok/s)
  • Qwen 35B / Mixtral variants: 🚀 very smooth (20–30 tok/s)
  • LLaMA 70B: ✅ usable (15–22 tok/s at 128GB, faster at 192GB+)
  • Multiple concurrent models: ✅ comfortable headroom

Cost: £4,200–£7,500+

When to buy: You’re serious about local AI. You want to run large models, experiment with different ones, or handle long contexts (32K+).

M4 Max (36–128 GB) - The confusing option

Critical limitation: Only 410–546 GB/s bandwidth vs 800+ on Ultra chips.

What runs okay:

  • LLaMA 3 7B–13B: ✅ fine (25–35 tok/s)
  • Mixtral 8x7B: ⚠️ barely (12–20 tok/s)
  • Qwen 35B: ❌ only if maxed to 128GB, then ~15–22 tok/s (but M2 Ultra does same for less)

Cost: ~£2,000–£4,000

Verdict: Newer chip, but memory bandwidth is the bottleneck for LLMs. Only makes sense if you’re committed to small models or willing to accept compromises.

Realistic performance expectations

If you actually buy one of these, what will generation feel like?

Token speed thresholds:

  • 8–12 tok/s: noticeably laggy, typing pauses
  • 15–18 tok/s: acceptable, feels responsive enough
  • 20–25 tok/s: very good, near-ChatGPT-like
  • 30+ tok/s: excellent, zero perceptible lag

Context window penalty: Your context length matters a lot. Longer contexts = larger KV cache = slower generation.

  • 4K context: baseline
  • 16K context: -10–25% speed
  • 32K context: -25–40% speed
  • 100K+: can halve throughput

So “20 tok/s at 4K context” becomes “maybe 12 tok/s at 32K context” on the same hardware.

Picking your Mac based on your model choice

You want to run 7B–13B models (LLaMA 3, Mistral, etc.)

M2 Max with 32–64 GB (~£1.5k–£2.5k)

  • 25–40 tok/s is standard
  • Great value, these are excellent utility models
  • Plenty of power

You want to run 30B–35B models (Qwen, Mixtral, etc.)

M2 Ultra with 64 GB minimum (~£3k, or £2.4k refurbished)

  • 15–25 tok/s depending on model
  • Comfortable performance for interactive use
  • Good refurbished availability

M2 Ultra with 128 GB (~£4.5k) if you want headroom

  • Smoother under load, room for larger context windows
  • Best price/performance for serious local AI work

You want to run 70B models or experiment with multiple large models

M3 Ultra with 128 GB (~£5.5k)

  • 15–22 tok/s on 70B models
  • Newest architecture, better efficiency
  • Actually comfortable for production-style use

You want maximum flexibility (mix of models, large contexts, concurrency)

M3 Ultra with 192 GB or 256 GB (~£7k+)

  • Handles everything without compromise
  • Can run multiple models simultaneously
  • Proper AI workstation feel

The honest comparison table

Mac Config Price LLaMA 13B Mixtral 8x7B Qwen 35B 70B Models Vibes
M2 Max 32GB £1.4k 30–40 ✅ not viable ❌ not viable ❌ no ❌ Budget entry
M2 Max 64GB £2k 30–40 ✅ 12–18 ⚠️ struggles ❌ no ❌ Good for small
M2 Ultra 64GB £3k 35–45 ✅ 18–28 ✅ 15–22 ✅ 8–12 ⚠️ Sweet spot
M2 Ultra 128GB £4.5k 35–45 ✅ 20–30 ✅ 18–25 ✅ 10–15 ⚠️ Comfortable
M3 Ultra 128GB £5.5k 40–50 🚀 22–32 🚀 20–28 🚀 15–22 ✅ Excellent
M3 Ultra 256GB £8k+ 40–50 🚀 22–32 🚀 20–28 🚀 18–25 ✅ Workstation

My recommendation: If you’re spending £3k–£4k anyway, M2 Ultra with 64–128 GB is the inflection point where the experience stops feeling constrained. You get real performance without paying for the latest chip.

Context window penalty (often overlooked)

Larger context windows mean larger KV cache, which means slower generation. This hits harder on Mac than GPU rigs.

Typical slowdown:

  • 4K context: baseline speed
  • 16K context: -10–25% slower
  • 32K context: -25–40% slower
  • 100K+ context: can halve throughput

Real example: If your model does 25 tok/s at 4K context, expect ~12–15 tok/s at 32K context on the same Mac.

Practical implication: For local use, aim for 4K–16K context windows. You keep speed without losing useful context for code, research, or chat. Only go beyond 32K on M3 Ultra with 128 GB+ if you truly need it.

Alternative: GPU rigs aren’t cheaper

Since you’re spending £3k–£5k anyway, you should know: a comparable GPU setup isn’t obviously better.

A single RTX 4090:

  • £1,800–£2,200
  • 24 GB VRAM
  • 40–70 tok/s on Qwen 35B
  • But: CUDA setup overhead, electricity cost, cooling

RTX 4090 vs M2 Ultra 64 GB:

  • RTX is 2–3× faster
  • Mac is simpler, cheaper running costs, unified memory
  • You’re paying for convenience more than performance

Both are legitimate choices. Mac Studio is the right call if you value simplicity.

Since you’re choosing both a Mac and a model, here’s what actually matters:

Model Size Memory (Q4) tok/s (M2 Ultra) tok/s (M3 Ultra) Best for
LLaMA 3 7B 7B 4–5 GB 30–40 40–50 Fast, responsive, coding
Mistral 7B 7B 4–5 GB 30–40 40–50 Similar to LLaMA, French origin
Mixtral 8x7B MoE 56B 14–16 GB 18–28 22–32 Better quality than 7B, still fast
Deepseek Coder 6.7B–33B 4–18 GB 25–35 35–45 Code generation specialty
LLaMA 3 13B 13B 8–10 GB 25–35 35–45 Better reasoning than 7B
Gemma 13B 13B 8–10 GB 25–35 35–45 Smaller, efficient alternative
Qwen 35B MoE MoE 35B 19–22 GB 15–22 20–28 Strong reasoning, sparse
Mixtral 8x22B MoE 141B 24–28 GB 12–18 16–24 Highest quality dense-equivalent
LLaMA 3 70B 70B 38–42 GB 8–12 15–22 Expert-level, needs headroom

Patterns:

  • 7B models: fit anywhere, 30–40+ tok/s even on M2 Max
  • 13B models: good sweet spot, 25–35 tok/s on M2 Ultra
  • MoE models: efficient for their capability, 15–28 tok/s depending on size
  • 70B+: only practical on M3 Ultra or GPU

Real-world advice

If you just want something that works: Grab LLaMA 3 7B or Mistral 7B. They run everywhere, generate good output, and hit 30+ tok/s. You don’t need to spend £3k.

If you want better quality without overspending: LLaMA 3 13B or Mixtral 8x7B on an M2 Ultra 64GB (£3k). This is where quality meets responsive performance.

If you’re serious about local AI: M2 Ultra 128 GB or M3 Ultra 128 GB. You can keep multiple models loaded, experiment freely, and handle longer contexts without compromise.

If GPU is an option: Price out a RTX 4090 rig anyway. For £3k–£4k, a 4090 will give you 2–3× the speed. Mac Studio is the right choice only if you value simplicity + unified memory + lower power consumption.


Tools to actually run these models

  • Ollama - Easiest entry point for Mac, supports most popular models
  • MLX - Apple’s own ML framework, fastest on Apple Silicon
  • llama.cpp - Portable, works everywhere, widely used
  • LM Studio - Desktop UI, good for beginners
  • Open Web UI - Modern web interface for local models, great with Ollama
  • GPT4All - Easy desktop app with built-in models, minimal setup
  • Text Generation WebUI - Feature-rich web interface with chat, notebook, and API modes
  • LocalAI - OpenAI-compatible API for local inference, easy integration
  • vLLM - Production inference engine if you need batching/concurrency
  • Hugging Face Transformers - The foundational library for running any model programmatically

Model sources

Reference