Local AI Coding Environment#
A fully offline, privacy-first AI coding setup for macOS Apple Silicon. Uses llama.cpp to run Qwen 2.5 Coder models locally, with Aider and OpenCode as terminal-based coding agents — no API keys, no cloud, no costs.
Hardware Requirements#
- Mac with Apple Silicon (M1/M2/M3/M4)
- 32GB RAM recommended (the 32B model uses ~20GB)
- ~25GB free disk space for models
What Gets Installed#
| Component | Purpose |
|---|---|
| llama.cpp | Local model inference with Metal GPU acceleration |
| Qwen 2.5 Coder 32B (Q4_K_M) | Main chat/coding model (~20GB) |
| Qwen 2.5 Coder 1.5B (Q4_K_M) | Fast autocomplete model (~1.2GB) |
| Aider | Terminal coding agent (Claude Code alternative) |
| jq | JSON processing for the pipe command |
Both models are served via llama.cpp's built-in OpenAI-compatible API, making them work with any tool that supports the OpenAI API format.
Installation#
chmod +x setup.sh
./setup.sh
The script is idempotent — safe to run multiple times. The first run downloads ~21GB of model weights from HuggingFace.
After installation, restart your shell or run:
source ~/.zshrc
Commands#
llama-start#
Starts both llama.cpp servers in the foreground. Press Ctrl+C to stop both.
- Chat model (32B):
http://127.0.0.1:8080 - Autocomplete model (1.5B):
http://127.0.0.1:8081
llama-stop#
Kills all running llama-server processes.
ai-code [directory]#
The main coding agent. Auto-starts the chat server if it's not running. Initializes a git repo if one doesn't exist, then launches Aider with full file-editing capabilities.
cd ~/projects/my-app
ai-code .
# or from anywhere
ai-code ~/projects/my-app
Inside Aider you can ask it to edit files, run commands, and refactor code across your project. Changes are auto-committed to git so you can always roll back.
ai-ask "question"#
Quick Q&A mode — no file editing, just chat. Useful for coding questions without modifying your project.
ai-ask "how do I handle errors in rust"
ai-pipe "prompt"#
Pipe code through the model via stdin. Useful for one-shot transforms in scripts.
cat main.py | ai-pipe "add type hints"
git diff | ai-pipe "write a commit message"
cat api.go | ai-pipe "find bugs"
Using OpenCode Instead of Aider#
OpenCode is another terminal coding agent with a polished TUI. It connects to the same llama.cpp backend.
Install OpenCode#
brew install anomalyco/tap/opencode
Configure#
Create ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"model": "llama-cpp/qwen2.5-coder-32b",
"provider": {
"llama-cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"apiKey": "not-needed"
},
"models": {
"qwen2.5-coder-32b": {
"name": "Qwen 2.5 Coder 32B",
"tools": true
}
}
}
}
}
Run#
llama-start # start the server (or ai-code auto-starts it)
opencode # launch OpenCode in your project directory
Use /models inside OpenCode to select the Qwen model, and Tab to switch between Plan and Build modes.
Configuration Files#
| File | Purpose |
|---|---|
~/.aider/aider.conf.yml |
Aider settings (model, git, UI) |
~/.aider/.env |
API base URL and key for Aider |
~/.config/opencode/opencode.json |
OpenCode provider config |
~/.local/share/llama-models/ |
Downloaded GGUF model files |
~/.local/bin/ |
Launcher scripts |
llama.cpp Server Flags#
The chat server launches with these defaults:
| Flag | Value | Purpose |
|---|---|---|
--ctx-size |
16384 | Context window (increase to 32768 if tools misbehave) |
--n-gpu-layers |
99 | Offload all layers to Metal GPU |
--flash-attn |
— | Enable flash attention for speed |
--mlock |
— | Lock model in RAM to prevent swapping |
--threads |
auto | Uses performance core count |
Troubleshooting#
Model loading is slow on first run: The first inference after starting the server takes 10–30 seconds while the model loads into memory. Subsequent requests are fast.
Running out of RAM / swapping: The 32B Q4 model needs ~20GB. Close memory-heavy apps. You can also try the smaller qwen2.5-coder-14b-instruct-q4_k_m.gguf instead.
OpenCode tools not working: Increase --ctx-size to 32768 in the llama-chat-server script. Tool-calling needs more context to work reliably.
Slow generation speed: Expect ~15–25 tokens/sec on the 32B model with M4. This is normal for a model this size running locally. The 1.5B autocomplete model runs much faster.
Server won't start: Check if another process is using port 8080 or 8081 with lsof -i :8080. Use llama-stop to kill stale processes.
Performance Expectations#
| Model | Speed | Use Case |
|---|---|---|
| Qwen 2.5 Coder 32B | ~15–25 tok/s | Chat, code generation, refactoring |
| Qwen 2.5 Coder 1.5B | ~100+ tok/s | Autocomplete, quick suggestions |
Both models run entirely on-device using Metal acceleration. No network connection required after initial setup.
Uninstall#
# Remove models (~21GB)
rm -rf ~/.local/share/llama-models
# Remove launcher scripts
rm ~/.local/bin/{ai-code,ai-ask,ai-pipe,llama-start,llama-stop,llama-chat-server,llama-complete-server}
# Remove configs
rm -rf ~/.aider
rm -f ~/.config/opencode/opencode.json
# Remove Ollama auto-start (if set)
launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist
rm ~/Library/LaunchAgents/com.ollama.serve.plist
# Uninstall packages
pipx uninstall aider-chat
brew uninstall llama.cpp jq