ExLlamaV2 Setup: 20-30% Faster Than Ollama on NVIDIA
ExLlamaV2 is a Python inference library built specifically for NVIDIA GPUs. It implements custom CUDA kernels optimised for Llama-architecture models and consistently outperforms llama.cpp and Ollama by 20-30% on throughput benchmarks. If you have an NVIDIA GPU and want maximum tokens per second, this is the backend to use.
Requirements
- NVIDIA GPU (RTX 20-series or newer)
- CUDA 12.1+ installed
- Python 3.8+
- 6GB+ VRAM
Install CUDA
Check if CUDA is installed:
nvcc --version
nvidia-smi
If not installed, download CUDA Toolkit from developer.nvidia.com. CUDA 12.1+ recommended.
Install ExLlamaV2
pip install exllamav2
If you get build errors, install from source:
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -e .
Verify installation:
python -c "import exllamav2; print('ExLlamaV2 installed')"
Download a GGUF Model
ExLlamaV2 runs GGUF files directly. Download from HuggingFace — bartowski's quantizations are reliable:
# Example: Llama 3.1 8B Q4_K_M (~4.7GB)
# Search: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
# Download the Q4_K_M file
Or use the huggingface_hub library:
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='bartowski/Meta-Llama-3.1-8B-Instruct-GGUF',
filename='Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf',
local_dir='./models'
)
"
Basic Inference Test
# From the exllamav2 directory
python test_inference.py \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "What is quantization in AI?" \
-t 200
This outputs the response and reports tokens per second at the end.
Interactive Chat
python examples/chat.py \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-mode llama
Mode options: llama, chatml, mistral, gemma — match to your model's chat format.
Python API
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
from exllamav2.generator.filters import ExLlamaV2PrefixFilter
MODEL_PATH = "./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
# Load model
config = ExLlamaV2Config(MODEL_PATH)
config.max_seq_len = 8192
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=8192, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2DynamicGenerator(
model=model,
cache=cache,
tokenizer=tokenizer,
)
# Generate
output = generator.generate(
prompt="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nExplain VRAM<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
)
print(output)
Multi-GPU Setup
For dual GPU with NVLink or PCIe, split the model across both:
python test_inference.py \
-m ./models/Meta-Llama-3.1-70B-Q4_K_M.gguf \
-gs 24,24 # 24GB per GPU
The -gs flag specifies VRAM allocation per GPU in GB.
Benchmarking vs Ollama
Run this to get a direct comparison:
# ExLlamaV2 benchmark
python test_inference.py \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "Write a detailed explanation of how transformers work" \
-t 300 \
-pps # Print tokens per second
# Ollama benchmark (in a separate terminal with Ollama running)
time ollama run llama3.1:8b "Write a detailed explanation of how transformers work"
Expected results on RTX 4090:
- ExLlamaV2: ~128 tok/s
- Ollama: ~98 tok/s
TabbyAPI — ExLlamaV2 with OpenAI API
If you want ExLlamaV2 speed with an OpenAI-compatible API server:
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
pip install -r requirements.txt
# Configure
cp config_sample.yml config.yml
# Edit config.yml: set model_dir and model_name
python main.py
# API available at http://localhost:5000/v1
Then use it with the OpenAI SDK exactly like Ollama:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="tabby")
When to Use ExLlamaV2 vs Ollama
| Use case | Recommended |
|---|---|
| Maximum throughput | ExLlamaV2 |
| Easy setup, just works | Ollama |
| AMD GPU | Ollama (llama.cpp ROCm) |
| Apple Silicon | Ollama (Metal) |
| OpenAI-compatible API | Ollama or TabbyAPI |
| Production serving | TabbyAPI or TensorRT-LLM |
Next Steps
- llama.cpp Server Mode — OpenAI API without TabbyAPI
- Inference Profiler — compare configs before committing
- Speed Estimator — predict tok/s for your GPU