QLoRA (Quantized LoRA)

LoRA fine-tuning combined with 4-bit quantization of the base model — dramatically reduces VRAM needs.

QLoRA is LoRA fine-tuning applied to a base model that has been quantized to 4-bit precision (typically NF4 — NormalFloat 4-bit). This combination drastically reduces VRAM consumption during fine-tuning while preserving most of the accuracy of full-precision LoRA.

The base model weights are stored in 4-bit format and dequantized on-the-fly during the forward pass. The trainable LoRA adapters remain in higher precision (FP16 or BF16) so gradient computation is well-conditioned. Memory savings are substantial: a 70B-parameter model in QLoRA fits in 48 GB, where full LoRA would require 80 GB and full fine-tuning would require 600+ GB.

QLoRA democratized large-model fine-tuning by making it possible on workstation cards. A 7B-parameter QLoRA fine-tune fits comfortably on a 16 GB card; a 13B fits on 24 GB. It remains the dominant approach when VRAM is the binding constraint.

Related Terms

Concepts directly relevant to QLoRA (Quantized LoRA).

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning method that adds small trainable matrices to a frozen base model.

PEFT (Parameter-Efficient Fine-Tuning)

Umbrella term for techniques that fine-tune large models by updating only a small fraction of parameters.

Quantization

Reducing the numerical precision of model weights to lower memory and compute requirements.

Workloads Where QLoRA (Quantized LoRA) Matters

GPU fit analysis for the workloads this concept directly influences.

Fine Tuning

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary