Defiled
AI
⬡ Uncensored
Tools
Benchmarks
Learn
Community
LIGHT
About
Sign Up
DefiledAI Tools
INFERENCE PROFILER
Compare 2 configs
Throughput, TTFT, bandwidth utilisation, and CPU offload analysis for any GPU + model combination.
Configuration
Parameters (B)
3B
7B
8B
13B
27B
70B
72B
Quantization
F16
Q8_0
Q6_K
Q5_K_M
Q4_K_M
Q3_K_M
IQ3_M
Q2_K
GPU Count
1×
2×
4×
8×
GPU
RTX 4090 (24GB · 1008GB/s)
RTX 4080 Super (16GB · 736GB/s)
RTX 4080 (16GB · 717GB/s)
RTX 4070 Ti Super (16GB · 672GB/s)
RTX 4070 Ti (12GB · 504GB/s)
RTX 4070 Super (12GB · 504GB/s)
RTX 4070 (12GB · 504GB/s)
RTX 4060 Ti 16GB (16GB · 288GB/s)
RTX 4060 Ti 8GB (8GB · 288GB/s)
RTX 3090 Ti (24GB · 1008GB/s)
RTX 3090 (24GB · 936GB/s)
RTX 3080 Ti (12GB · 912GB/s)
RTX 3080 12GB (12GB · 912GB/s)
RTX 3080 10GB (10GB · 760GB/s)
RTX 3070 Ti (8GB · 608GB/s)
RTX 3070 (8GB · 448GB/s)
RTX 3060 Ti (8GB · 448GB/s)
RTX 3060 12GB (12GB · 360GB/s)
RTX 2080 Ti (11GB · 616GB/s)
RX 7900 XTX (24GB · 960GB/s)
RX 7900 XT (20GB · 800GB/s)
RX 7800 XT (16GB · 624GB/s)
RX 6900 XT (16GB · 512GB/s)
RX 6800 XT (16GB · 512GB/s)
M3 Max 40-core (48GB · 400GB/s)
M3 Pro 18-core (36GB · 150GB/s)
M2 Ultra (192GB · 800GB/s)
2x RTX 3090 NVLink (48GB · 1872GB/s)
2x RTX 4090 (48GB · 2016GB/s)
A100 40GB (40GB · 1555GB/s)
A100 80GB (80GB · 2000GB/s)
H100 SXM (80GB · 3350GB/s)
Backend
ExLlamaV2
llama.cpp (CUDA)
Ollama
TensorRT-LLM
llama.cpp (ROCm)
llama.cpp (Metal)
Context (tokens)
512
1K
2K
4K
8K
16K
32K
64K
Batch Size
1
2
4
8
16
Profile Results
Throughput
3
t/s
TTFT
20.16
s
Model size
42.0GB
KV cache (4K ctx)
0.0GB
Total VRAM needed
42.5GB
CPU offload
20.4GB
BW utilisation
0%
Analysis
⚠ 20.4GB overflows to RAM — ~3 tok/s (CPU-limited)
✗ Too slow for interactive use