HomeTutorialsbeginner
beginnerBeginner Tutorial

Downloading and Running Your First Model

How to choose the right model for your GPU, download it with Ollama, and get the best performance from your first local AI setup.

2026-05-304 min read
ollamabeginnermodelsvramfirst-model

Downloading and Running Your First Model

You have Ollama installed. Now you need to pick the right model for your hardware. This guide covers how to choose, download, and get the most out of your first local model.

The Only Number That Matters: VRAM

Your GPU's VRAM determines which models you can run. The model file must fit in VRAM — if it doesn't, Ollama offloads layers to system RAM and inference becomes 10–50× slower.

Check your VRAM:

# NVIDIA
nvidia-smi --query-gpu=memory.total --format=csv

# AMD
rocm-smi --showmeminfo vram

# macOS
system_profiler SPDisplaysDataType | grep VRAM

Choosing Your Model

Your VRAMBest First ModelQuality Level
4GBPhi-3 Mini 3.8BGood for simple tasks
6GBLlama 3.1 8B Q4_K_MStrong all-rounder
8GBMistral 7B Q8_0Near-lossless 7B
10–12GBGemma 2 9B Q8_0Excellent reasoning
16GBGemma 2 27B Q4_K_MStrong capability
24GBGemma 2 27B Q8_0Near-lossless 27B
48GBLlama 3.1 70B Q4_K_MFrontier-class local AI

Recommendation for most people: Start with llama3.1:8b if you have 6GB+ VRAM. It's the most tested, has the largest community, and performs well on coding, writing, and general tasks.

Downloading Your First Model

# Pull without running (downloads in background)
ollama pull llama3.1:8b

# Pull and run immediately
ollama run llama3.1:8b

Ollama automatically selects the best quantization for your VRAM. You don't need to specify a quant manually for your first model.

Model sizes to expect:

ModelDownload Size
Phi-3 Mini 3.8B~2.3GB
Llama 3.1 8B~4.7GB
Mistral 7B~4.1GB
Gemma 2 9B~5.4GB
Gemma 2 27B~17GB
Llama 3.1 70B~40GB

Running Your First Conversation

After ollama run llama3.1:8b, you'll see a >>> prompt. Type your message and press Enter.

>>> What is quantization in the context of AI models?

To exit: type /bye or press Ctrl+D.

Useful slash commands during a session:

/help          Show all commands
/clear         Clear conversation history
/show info     Show model details
/set verbose   Show timing and token stats

Checking Performance

After your first response, check how fast it's generating:

# In a second terminal while model is running
ollama ps

This shows:

  • Model name
  • Size in VRAM
  • Processor (GPU or CPU)
  • Time until unload

Target speeds:

  • 7–8B model on 6–8GB GPU: 60–130 tok/s — feels instant
  • 27B model on 24GB GPU: 40–60 tok/s — usable
  • 70B model on 48GB: 18–25 tok/s — slower but very capable

If you're seeing under 5 tok/s, the model is probably running on CPU. Check the troubleshooting section in the Ollama install guide.

Specific Quants for More Control

Ollama's default quant is good but you can specify:

# Near-lossless quality (needs ~8GB VRAM for 8B)
ollama run llama3.1:8b:q8_0

# Smaller/faster (needs ~4GB VRAM)
ollama run llama3.1:8b:q4_0

# Default (Q4_K_M — best balance)
ollama run llama3.1:8b

Use the Quant Picker tool if you're unsure which format to use.

Trying Different Models

Once you're comfortable, try other models:

# Great for coding
ollama run qwen2.5-coder:7b

# Strong reasoning
ollama run deepseek-r1:7b

# Fast and capable
ollama run mistral:7b

# Uncensored assistant
ollama run dolphin-mistral:7b

See what else is available:

ollama list         # What you've downloaded
# Browse more at: ollama.com/library

Managing Disk Space

Models accumulate quickly. Manage them:

# See what you have and sizes
ollama list

# Remove a model
ollama rm phi3:mini

# Models are stored at:
# Linux/macOS: ~/.ollama/models
# Windows: C:\Users\<you>\.ollama\models

Next Steps