The Thing About Local LLMs

Overview

Someone responded to yesterday’s post about the Pi coding agent, asking if I was using it with local models. I’ve done a lot of testing with them, and I can say with great confidence that the hype about using them is mostly just that.

At the same time, Pi is a truly remarkable tool that is not only your best option for working with local LLMs, but also - in my opinion - your best option for working with any LLM.

The Reality of Local Models

Unless you’re rich and can afford a beefy PC with multiple top-of-the-line GPU cards, there are very few edge cases where running a local LLM model makes sense.

The videos claiming you can run them on a normal laptop or a Raspberry Pi are only technically correct. Practically they are all lies. They have all had the delay between prompt and response edited out, and are probably running models that aren’t even remotely comparable to what you get from cloud providers.

You can run tiny models and get… bearable response times, but they’re 💩 for coding.

My wife has a gaming PC with 1 good GPU, and in theory we can get load Qwen3-Coder-Next 80b models into its memory, and get output similar to Anthtropic’s Sonnet 4.6 model. I believe that, but what’s unclear is how bearable the delay will be.

She and her research partner Rhyannon (👋), have some things they want to investigate that involve or are facilitated by using local models, so we’re going to be seeing what that looks like and how usable it is.

The Mathematical Reality of No

So far, my experience has been that even on a maxed out mac laptop the only use cases that are practical are background processes where you don’t actually care when it finishes. Apple’s claims about their “Neural Engine” are just hype. Yes, it’s probably better than CPUs without something like that, but good models are large - like really large - and if you want to use that model interactively you need to load the whole thing into the GPUs VRAM.

The back-of-the-napkin equation is pretty straightforward: M = Np * Bp where

Symbol Description
M Memory in Gb
Np Number of parameters in Billions
Bp Bytes per parameter

Some pre-calculated examples for models you might want to use:

model Model VRAM Size Activation Mem Total Needed
qwen3:4b 8GB ~3.2GB 11.2GB
quen3:8b 16GB ~6.4GB 22.4GB
llama3.2:3b 6GB ~2.4GB 8.4GB
llama3.2:1b 2GB ~0.8Gb 2.8GB

On top of that we need to consider “Activation memory” - temp data used during inference or training. That’s generally another ~20-40%. In the table above I’ve used 40% to make sure we have “enough”. Note that Training (not something we’re discussing) requires 2-4x the parameter memory. Note that that extra 20-40% is required for each agent and subagent hitting the system at the same time because they each have their own unique blob of text that inference needs to be run on. As soon as you exceed the VRAM of your GPU you start having massive performance penalties.

Quantization of models helps lower the VRAM requirements, but the fact of the matter is that you need way more VRAM dedicated to the GPU than any laptop can provide.

Pi Changes Things

I think Pi isn’t just the right paradigm / approach for coding agents. I also think that it’s minimalism is required for doing anything interactive with a local LLM. The things it can do to accommodate restrictive hardware end up working out a lot like accommodations for people with physical disabilities. Yes, it helps them, but it also helps everyone else.

All the other coding agents are financially motivated to use more of your tokens. Anthropic doesn’t care how wasteful Claude is of tokens as it does its work. The more it wastes the more you end up paying. Pi enables a lot of important things that minimize token usage:

  • trivially small & modifiable “system prompt”
  • encourages putting code in extensions not skills
  • ability to override everything with custom code
  • the ability to treat interactions with the LLM as a tree

The System Prompt

It’s been said that the thing that truly differentiates the Agentic platforms is not the models. It’s all the infrastructure and hidden prompts the vendors have built around their models.

The system prompt is one of those. It’s included at the start of every session and sucks down thousands of tokens. Claude’s changes between releases without warning, includes content dynamically generated by the agent (presumably more token usage) and there’s no easy way to see what it is. I’ve tried overriding it with the --system-prompt-file a_file.md flag and --system-prompt "some text" but it failed to follow the instructions I put in the prompt in both cases.

Pi’s system prompt is known, tiny, and modifiable.

Extensions over Skills

Agentic Skills are ultimately just Markdown documents that have to be read in and reasoned about before the system makes some non-deterministic choices about what to do, and what tools to use to do it with. You can’t actually control it’s actions. It’s like giving instructions to a grade-schooler. It doesn’t matter how precise those instructions are, you can never be 100% sure they’ll be interpreted correctly, or consistently every time they’re read. Also, that reading uses tokens.

Pi extensions are code. You invoke them via a prompt just like skills, but once its invoked it’s just a matter of waiting for the code to finish its run and either hand something back to pi for further processing, or tell it it’s done. It doesn’t matter how complex the extension is, it doesn’t use any tokens unless you’re explicitly asking it to interact with the model.

Overriding Everything

It’s not so much that you can override everything, because most things in Claude code - for example - seem possible to be overridden with custom versions. It’s that you can override everything with deterministic code instead of throwing a block of Markdown text at it and hoping it interprets it correctly. Additionally, you don’t have to pay the token cost of processing those markdown instructions because - with Extensions - it doesn’t exist.

Conversation Trees

Models have no memory. The ever-growing history of your conversation with an LLM in each session keeps getting sent to the LLM with each successive prompt. It has to read in the entire thing from scratch with every new prompt. In addition to just being wasteful, this can be problematic when the human or the agent have gone down side quests that either didn’t pan out, or weren’t relevant to the primary task at hand.

A good example of this is the human asking the agent to explain something it just did. It’s useful to you, but each successive prompt to the LLM will have that useless blob of text to process.

Pi’s sessions are stored as trees. This means that, in the example above, you could run /tree and jump the context back to just before you asked it to explain. More significantly, if you and the Agent had been exploring some way of solving a coding problem only to find that your thinking was bad, or that it wasn’t possible, you could jump back so that those “bad” ideas were no longer things that had to be considered on each successive prompt.