Parameter-Efficient Fine Tuning (PEFT) π
Full fine-tuning updates all parameters in a pre-trained neural network. For a 7B parameter model, this requires storing gigabytes of gradients and optimizer states, demanding multiple enterprise A100 GPUs. LoRA (Low-Rank Adaptation) solves this by keeping the original model weights frozen and adding small adapter layers.
Training Adapter Layers with PEFT
Here is how to initialize and wrap a pre-trained base model with a LoRA configuration using the HuggingFace PEFT library:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# 1. Load frozen base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
# 2. Define LoRA adapter configuration
lora_config = LoraConfig(
r=8, # Rank of adapter matrices (typically 8, 16, or 32)
lora_alpha=32, # Scaling factor for adapter weights
target_modules=["q_proj", "v_proj"], # Target specific attention modules
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 3. Inject adapter layers
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# Only ~0.1% of the model parameters are now trainable!
Quantized LoRA (QLoRA)
While LoRA reduces trainable weights, we still need to load the full 7B base model into GPU memory. QLoRA reduces VRAM usage further by quantizing the base model weights to 4-bit NormalFloat (NF4) precision. This compresses the model memory from ~16GB to ~4.5GB, allowing developers to fine-tune 7B models on standard consumer graphics cards (like a single RTX 4090 or Google Colab T4!).