Local AI Coding Environment#

A fully offline, privacy-first AI coding setup for macOS Apple Silicon. Uses llama.cpp to run Qwen 2.5 Coder models locally, with Aider and OpenCode as terminal-based coding agents — no API keys, no cloud, no costs.

Hardware Requirements#

Mac with Apple Silicon (M1/M2/M3/M4)
32GB RAM recommended (the 32B model uses ~20GB)
~25GB free disk space for models

What Gets Installed#

Component	Purpose
llama.cpp	Local model inference with Metal GPU acceleration
Qwen 2.5 Coder 32B (Q4_K_M)	Main chat/coding model (~20GB)
Qwen 2.5 Coder 1.5B (Q4_K_M)	Fast autocomplete model (~1.2GB)
Aider	Terminal coding agent (Claude Code alternative)
jq	JSON processing for the pipe command

Both models are served via llama.cpp's built-in OpenAI-compatible API, making them work with any tool that supports the OpenAI API format.

Installation#

chmod +x setup.sh
./setup.sh

The script is idempotent — safe to run multiple times. The first run downloads ~21GB of model weights from HuggingFace.

After installation, restart your shell or run:

source ~/.zshrc

Commands#

`llama-start`#

Starts both llama.cpp servers in the foreground. Press Ctrl+C to stop both.

Chat model (32B): http://127.0.0.1:8080
Autocomplete model (1.5B): http://127.0.0.1:8081

`llama-stop`#

Kills all running llama-server processes.

`ai-code [directory]`#

The main coding agent. Auto-starts the chat server if it's not running. Initializes a git repo if one doesn't exist, then launches Aider with full file-editing capabilities.

cd ~/projects/my-app
ai-code .

# or from anywhere
ai-code ~/projects/my-app

Inside Aider you can ask it to edit files, run commands, and refactor code across your project. Changes are auto-committed to git so you can always roll back.

`ai-ask "question"`#

Quick Q&A mode — no file editing, just chat. Useful for coding questions without modifying your project.

ai-ask "how do I handle errors in rust"

`ai-pipe "prompt"`#

Pipe code through the model via stdin. Useful for one-shot transforms in scripts.

cat main.py | ai-pipe "add type hints"
git diff | ai-pipe "write a commit message"
cat api.go | ai-pipe "find bugs"

Using OpenCode Instead of Aider#

OpenCode is another terminal coding agent with a polished TUI. It connects to the same llama.cpp backend.

Install OpenCode#

brew install anomalyco/tap/opencode

Configure#

Create ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llama-cpp/qwen2.5-coder-32b",
  "provider": {
    "llama-cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
        "apiKey": "not-needed"
      },
      "models": {
        "qwen2.5-coder-32b": {
          "name": "Qwen 2.5 Coder 32B",
          "tools": true
        }
      }
    }
  }
}

Run#

llama-start        # start the server (or ai-code auto-starts it)
opencode           # launch OpenCode in your project directory

Use /models inside OpenCode to select the Qwen model, and Tab to switch between Plan and Build modes.

Configuration Files#

File	Purpose
`~/.aider/aider.conf.yml`	Aider settings (model, git, UI)
`~/.aider/.env`	API base URL and key for Aider
`~/.config/opencode/opencode.json`	OpenCode provider config
`~/.local/share/llama-models/`	Downloaded GGUF model files
`~/.local/bin/`	Launcher scripts

llama.cpp Server Flags#

The chat server launches with these defaults:

Flag	Value	Purpose
`--ctx-size`	16384	Context window (increase to 32768 if tools misbehave)
`--n-gpu-layers`	99	Offload all layers to Metal GPU
`--flash-attn`	—	Enable flash attention for speed
`--mlock`	—	Lock model in RAM to prevent swapping
`--threads`	auto	Uses performance core count

Troubleshooting#

Model loading is slow on first run: The first inference after starting the server takes 10–30 seconds while the model loads into memory. Subsequent requests are fast.

Running out of RAM / swapping: The 32B Q4 model needs ~20GB. Close memory-heavy apps. You can also try the smaller qwen2.5-coder-14b-instruct-q4_k_m.gguf instead.

OpenCode tools not working: Increase --ctx-size to 32768 in the llama-chat-server script. Tool-calling needs more context to work reliably.

Slow generation speed: Expect ~15–25 tokens/sec on the 32B model with M4. This is normal for a model this size running locally. The 1.5B autocomplete model runs much faster.

Server won't start: Check if another process is using port 8080 or 8081 with lsof -i :8080. Use llama-stop to kill stale processes.

Performance Expectations#

Model	Speed	Use Case
Qwen 2.5 Coder 32B	~15–25 tok/s	Chat, code generation, refactoring
Qwen 2.5 Coder 1.5B	~100+ tok/s	Autocomplete, quick suggestions

Both models run entirely on-device using Metal acceleration. No network connection required after initial setup.

Uninstall#

# Remove models (~21GB)
rm -rf ~/.local/share/llama-models

# Remove launcher scripts
rm ~/.local/bin/{ai-code,ai-ask,ai-pipe,llama-start,llama-stop,llama-chat-server,llama-complete-server}

# Remove configs
rm -rf ~/.aider
rm -f ~/.config/opencode/opencode.json

# Remove Ollama auto-start (if set)
launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist
rm ~/Library/LaunchAgents/com.ollama.serve.plist

# Uninstall packages
pipx uninstall aider-chat
brew uninstall llama.cpp jq

Configure Feed