Calculate if your GPU can run Large Language Models locally (Inference Only)
* Affects KV Cache size in VRAM
Unlike gaming, where FPS matters, local AI relies heavily on Video RAM (VRAM). The entire model weights must fit into the GPU memory for fast inference.
If the model is larger than your VRAM, layers will offload to your system RAM (CPU), causing speeds to drop from 50 tokens/sec to 1-3 tokens/sec.
Most local models use GGUF format with quantization to reduce size without losing much intelligence:
You cannot fill your VRAM to 100% with just the model weights. We add a safety margin to account for:
This ensures your model won't crash mid-sentence.
Instead of chasing specific models, look for these VRAM targets based on your goals:
Great for 7B-8B models. (Classic examples: RTX 3060 / 4060 series). This is the minimum for a comfortable experience.
The sweet spot. Allows larger quantizations or mid-sized models like Yi-34B. (Classic examples: RTX 4060 Ti 16GB / 4070 Ti Super).
The consumer ceiling. Required for 70B models (quantized) and heavy workflows. (Classic examples: RTX 3090 / 4090 series).
Yes, but it will be extremely slow. This is called "CPU Offloading".
If a model doesn't fit in your VRAM, generation speed can drop from 50 tokens/second to 1-3 tokens/second. It works for testing, but not for daily use.
NVIDIA (Recommended): Supported by almost all AI software out of the box via CUDA. It is the "plug and play" option.
AMD (Budget / Advanced): Often offers more VRAM per dollar (e.g., RX 6800 XT 16GB is cheaper than RTX 4060 Ti). However, setting up ROCm on Windows can be tricky. Best used with LM Studio or Ollama (Vulkan support).
Once you have the hardware, the easiest way to start is: