Script for easily configuring, using, switching and comparing local offline coding models
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

TypeScript 74.4%
Shell 25.6%
2 1 0

Clone this repository

https://tangled.org/burrito.space/localcode https://tangled.org/did:plc:7r5c5jhtphcpkg3y55xu2y64/localcode
git@tangled.org:burrito.space/localcode git@tangled.org:did:plc:7r5c5jhtphcpkg3y55xu2y64/localcode

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

Local AI Coding Environment#

A fully offline, privacy-first AI coding setup for macOS Apple Silicon. Uses llama.cpp to run Qwen 2.5 Coder models locally, with Aider and OpenCode as terminal-based coding agents — no API keys, no cloud, no costs.

Hardware Requirements#

  • Mac with Apple Silicon (M1/M2/M3/M4)
  • 32GB RAM recommended (the 32B model uses ~20GB)
  • ~25GB free disk space for models

What Gets Installed#

Component Purpose
llama.cpp Local model inference with Metal GPU acceleration
Qwen 2.5 Coder 32B (Q4_K_M) Main chat/coding model (~20GB)
Qwen 2.5 Coder 1.5B (Q4_K_M) Fast autocomplete model (~1.2GB)
Aider Terminal coding agent (Claude Code alternative)
jq JSON processing for the pipe command

Both models are served via llama.cpp's built-in OpenAI-compatible API, making them work with any tool that supports the OpenAI API format.

Installation#

chmod +x setup.sh
./setup.sh

The script is idempotent — safe to run multiple times. The first run downloads ~21GB of model weights from HuggingFace.

After installation, restart your shell or run:

source ~/.zshrc

Commands#

llama-start#

Starts both llama.cpp servers in the foreground. Press Ctrl+C to stop both.

  • Chat model (32B): http://127.0.0.1:8080
  • Autocomplete model (1.5B): http://127.0.0.1:8081

llama-stop#

Kills all running llama-server processes.

ai-code [directory]#

The main coding agent. Auto-starts the chat server if it's not running. Initializes a git repo if one doesn't exist, then launches Aider with full file-editing capabilities.

cd ~/projects/my-app
ai-code .

# or from anywhere
ai-code ~/projects/my-app

Inside Aider you can ask it to edit files, run commands, and refactor code across your project. Changes are auto-committed to git so you can always roll back.

ai-ask "question"#

Quick Q&A mode — no file editing, just chat. Useful for coding questions without modifying your project.

ai-ask "how do I handle errors in rust"

ai-pipe "prompt"#

Pipe code through the model via stdin. Useful for one-shot transforms in scripts.

cat main.py | ai-pipe "add type hints"
git diff | ai-pipe "write a commit message"
cat api.go | ai-pipe "find bugs"

Using OpenCode Instead of Aider#

OpenCode is another terminal coding agent with a polished TUI. It connects to the same llama.cpp backend.

Install OpenCode#

brew install anomalyco/tap/opencode

Configure#

Create ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llama-cpp/qwen2.5-coder-32b",
  "provider": {
    "llama-cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
        "apiKey": "not-needed"
      },
      "models": {
        "qwen2.5-coder-32b": {
          "name": "Qwen 2.5 Coder 32B",
          "tools": true
        }
      }
    }
  }
}

Run#

llama-start        # start the server (or ai-code auto-starts it)
opencode           # launch OpenCode in your project directory

Use /models inside OpenCode to select the Qwen model, and Tab to switch between Plan and Build modes.

Configuration Files#

File Purpose
~/.aider/aider.conf.yml Aider settings (model, git, UI)
~/.aider/.env API base URL and key for Aider
~/.config/opencode/opencode.json OpenCode provider config
~/.local/share/llama-models/ Downloaded GGUF model files
~/.local/bin/ Launcher scripts

llama.cpp Server Flags#

The chat server launches with these defaults:

Flag Value Purpose
--ctx-size 16384 Context window (increase to 32768 if tools misbehave)
--n-gpu-layers 99 Offload all layers to Metal GPU
--flash-attn Enable flash attention for speed
--mlock Lock model in RAM to prevent swapping
--threads auto Uses performance core count

Troubleshooting#

Model loading is slow on first run: The first inference after starting the server takes 10–30 seconds while the model loads into memory. Subsequent requests are fast.

Running out of RAM / swapping: The 32B Q4 model needs ~20GB. Close memory-heavy apps. You can also try the smaller qwen2.5-coder-14b-instruct-q4_k_m.gguf instead.

OpenCode tools not working: Increase --ctx-size to 32768 in the llama-chat-server script. Tool-calling needs more context to work reliably.

Slow generation speed: Expect ~15–25 tokens/sec on the 32B model with M4. This is normal for a model this size running locally. The 1.5B autocomplete model runs much faster.

Server won't start: Check if another process is using port 8080 or 8081 with lsof -i :8080. Use llama-stop to kill stale processes.

Performance Expectations#

Model Speed Use Case
Qwen 2.5 Coder 32B ~15–25 tok/s Chat, code generation, refactoring
Qwen 2.5 Coder 1.5B ~100+ tok/s Autocomplete, quick suggestions

Both models run entirely on-device using Metal acceleration. No network connection required after initial setup.

Uninstall#

# Remove models (~21GB)
rm -rf ~/.local/share/llama-models

# Remove launcher scripts
rm ~/.local/bin/{ai-code,ai-ask,ai-pipe,llama-start,llama-stop,llama-chat-server,llama-complete-server}

# Remove configs
rm -rf ~/.aider
rm -f ~/.config/opencode/opencode.json

# Remove Ollama auto-start (if set)
launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist
rm ~/Library/LaunchAgents/com.ollama.serve.plist

# Uninstall packages
pipx uninstall aider-chat
brew uninstall llama.cpp jq