Downloading and Running Your First Model
You have Ollama installed. Now you need to pick the right model for your hardware. This guide covers how to choose, download, and get the most out of your first local model.
The Only Number That Matters: VRAM
Your GPU's VRAM determines which models you can run. The model file must fit in VRAM — if it doesn't, Ollama offloads layers to system RAM and inference becomes 10–50× slower.
Check your VRAM:
# NVIDIA
nvidia-smi --query-gpu=memory.total --format=csv
# AMD
rocm-smi --showmeminfo vram
# macOS
system_profiler SPDisplaysDataType | grep VRAM
Choosing Your Model
| Your VRAM | Best First Model | Quality Level |
|---|---|---|
| 4GB | Phi-3 Mini 3.8B | Good for simple tasks |
| 6GB | Llama 3.1 8B Q4_K_M | Strong all-rounder |
| 8GB | Mistral 7B Q8_0 | Near-lossless 7B |
| 10–12GB | Gemma 2 9B Q8_0 | Excellent reasoning |
| 16GB | Gemma 2 27B Q4_K_M | Strong capability |
| 24GB | Gemma 2 27B Q8_0 | Near-lossless 27B |
| 48GB | Llama 3.1 70B Q4_K_M | Frontier-class local AI |
Recommendation for most people: Start with llama3.1:8b if you have 6GB+ VRAM. It's the most tested, has the largest community, and performs well on coding, writing, and general tasks.
Downloading Your First Model
# Pull without running (downloads in background)
ollama pull llama3.1:8b
# Pull and run immediately
ollama run llama3.1:8b
Ollama automatically selects the best quantization for your VRAM. You don't need to specify a quant manually for your first model.
Model sizes to expect:
| Model | Download Size |
|---|---|
| Phi-3 Mini 3.8B | ~2.3GB |
| Llama 3.1 8B | ~4.7GB |
| Mistral 7B | ~4.1GB |
| Gemma 2 9B | ~5.4GB |
| Gemma 2 27B | ~17GB |
| Llama 3.1 70B | ~40GB |
Running Your First Conversation
After ollama run llama3.1:8b, you'll see a >>> prompt. Type your message and press Enter.
>>> What is quantization in the context of AI models?
To exit: type /bye or press Ctrl+D.
Useful slash commands during a session:
/help Show all commands
/clear Clear conversation history
/show info Show model details
/set verbose Show timing and token stats
Checking Performance
After your first response, check how fast it's generating:
# In a second terminal while model is running
ollama ps
This shows:
- Model name
- Size in VRAM
- Processor (GPU or CPU)
- Time until unload
Target speeds:
- 7–8B model on 6–8GB GPU: 60–130 tok/s — feels instant
- 27B model on 24GB GPU: 40–60 tok/s — usable
- 70B model on 48GB: 18–25 tok/s — slower but very capable
If you're seeing under 5 tok/s, the model is probably running on CPU. Check the troubleshooting section in the Ollama install guide.
Specific Quants for More Control
Ollama's default quant is good but you can specify:
# Near-lossless quality (needs ~8GB VRAM for 8B)
ollama run llama3.1:8b:q8_0
# Smaller/faster (needs ~4GB VRAM)
ollama run llama3.1:8b:q4_0
# Default (Q4_K_M — best balance)
ollama run llama3.1:8b
Use the Quant Picker tool if you're unsure which format to use.
Trying Different Models
Once you're comfortable, try other models:
# Great for coding
ollama run qwen2.5-coder:7b
# Strong reasoning
ollama run deepseek-r1:7b
# Fast and capable
ollama run mistral:7b
# Uncensored assistant
ollama run dolphin-mistral:7b
See what else is available:
ollama list # What you've downloaded
# Browse more at: ollama.com/library
Managing Disk Space
Models accumulate quickly. Manage them:
# See what you have and sizes
ollama list
# Remove a model
ollama rm phi3:mini
# Models are stored at:
# Linux/macOS: ~/.ollama/models
# Windows: C:\Users\<you>\.ollama\models
Next Steps
- Open WebUI Setup — get a proper chat interface instead of the terminal
- Ollama API Guide — use your model from any app
- Understanding VRAM — go deeper on memory management
- Abliterated Models Guide — run uncensored variants