llama.cpp Server Mode: Local OpenAI-Compatible API
llama.cpp includes a built-in HTTP server that exposes an OpenAI-compatible API. Unlike Ollama, it gives you fine-grained control over every inference parameter. Unlike ExLlamaV2, it runs on NVIDIA, AMD, Apple Silicon, and CPU without changing anything.
Build llama.cpp
Linux/macOS (CUDA):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Linux (ROCm/AMD):
cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release -j$(nproc)
macOS (Metal/Apple Silicon):
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)
Windows (CUDA):
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Or download pre-built binaries from the llama.cpp GitHub releases page — look for llama-*-bin-win-cuda-*.zip for Windows with CUDA.
Download a Model
# Download from HuggingFace using huggingface-cli
pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models
Start the Server
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 \
-ngl 99 \
--ctx-size 8192 \
--parallel 1
Key flags:
| Flag | Value | Effect |
|---|---|---|
-m | path to .gguf | Model file |
-ngl | 99 | GPU layers (99 = all) |
--port | 8080 | Server port |
--ctx-size | 8192 | Context window |
--parallel | 1 | Concurrent requests |
--host | 0.0.0.0 | Allow network access |
--api-key | secret | Optional auth key |
The server starts and logs: llama server listening at http://127.0.0.1:8080
Test the API
# Health check
curl http://localhost:8080/health
# OpenAI-compatible chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "What is Q4_K_M?"}]
}'
Python with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="none", # No auth by default
)
response = client.chat.completions.create(
model="llama3.1", # Model name is ignored, uses whatever is loaded
messages=[
{"role": "system", "content": "You are a local AI expert."},
{"role": "user", "content": "Compare ExLlamaV2 vs llama.cpp"}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
Streaming:
stream = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Write a haiku about VRAM"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Multi-GPU
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-70B-Q4_K_M.gguf \
-ngl 99 \
--tensor-split 1,1 \ # Equal split across 2 GPUs
--ctx-size 4096 \
--port 8080
Running as a Background Service
Linux (systemd):
sudo nano /etc/systemd/system/llamacpp.service
[Unit]
Description=llama.cpp Server
After=network.target
[Service]
Type=simple
User=YOUR_USERNAME
ExecStart=/path/to/llama.cpp/build/bin/llama-server \
-m /path/to/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--ctx-size 8192 \
--port 8080
Restart=on-failure
[Install]
WantedBy=multi-user.target
sudo systemctl enable llamacpp
sudo systemctl start llamacpp
Windows (run on startup):
Create a batch file and add it to the Startup folder (shell:startup):
@echo off
cd C:\path\to\llama.cpp
build\bin\llama-server.exe -m models\model.gguf -ngl 99 --port 8080
llama.cpp vs Ollama Server Comparison
| Feature | llama.cpp server | Ollama |
|---|---|---|
| Setup complexity | Medium | Easy |
| Parameter control | Full | Limited |
| AMD/Apple support | Excellent | Good |
| Model switching | Manual restart | Automatic |
| OpenAI compatibility | Yes | Yes |
| Speed (NVIDIA) | Good | Similar |
| Background service | Manual setup | Automatic |
Next Steps
- ExLlamaV2 Setup — faster on NVIDIA
- Build a Local MoE Pipeline — multi-model orchestration using the API
- Ollama API Guide — if you prefer Ollama's simplicity