Fine-Tune an LLM on Custom Data — Complete Guide 2026
Reviewed: June 4, 2026
Last updated: May 2026
Fine-tuning a large language model on your own data can dramatically improve performance on domain-specific tasks. In this tutorial, you’ll fine-tune a Llama 3.2 model on a custom dataset using LoRA (Low-Rank Adaptation) — the most cost-effective approach for most use cases. We’ll cover data preparation, training, evaluation, and deployment.
Why Fine-Tune?
Before diving in, let’s clarify when fine-tuning makes sense vs. alternatives:
| Approach | Best For | Cost |
|---|---|---|
| Prompt Engineering | General tasks, quick experiments | $ |
| RAG (Retrieval) | Knowledge-heavy, frequently changing data | $$ |
| Fine-Tuning | Consistent style/behavior, domain expertise | $$$ |
Fine-tuning shines when you need the model to consistently follow a specific format, adopt a particular tone, or perform a narrow task with high accuracy.
What We’re Building
A fine-tuned Llama 3.2 1B model that answers customer support questions in your company’s voice, trained on a dataset of 500 example Q&A pairs.
Prerequisites
- Python 3.10+
- GPU with 16GB+ VRAM (or access to cloud GPU)
- HuggingFace account with API token
- Basic understanding of PyTorch
Step 1: Prepare Your Training Data
Fine-tuning data should be high-quality and representative. For supervised fine-tuning (SFT), use a JSONL file with instruction-response pairs:
{"instruction": "How do I reset my password?", "response": "To reset your password, go to Settings → Security → Reset Password. You'll receive an email with a reset link within 2 minutes."}
{"instruction": "What's your refund policy?", "response": "We offer full refunds within 30 days of purchase. Contact support@company.com with your order number."}
Create your dataset:
import json
from datasets import Dataset
# Load your JSONL data
data = []
with open("support_qa.jsonl") as f:
for line in f:
item = json.loads(line)
data.append({
"text": f"### Instruction:n{item['instruction']}nn### Response:n{item['response']}"
})
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")
Step 2: Set Up LoRA with Unsloth
We’ll use Unsloth — it’s 2-5x faster than standard fine-tuning and uses 70% less VRAM:
pip install unsloth datasets accelerate peft trl
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = 2048,
dtype = None, # Auto-detect: BF16 on Ampere+, FP16 otherwise
load_in_4bit = True, # 4-bit quantization for lower VRAM
)
# Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)
Step 3: Train the Model
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset["train"],
eval_dataset = dataset["test"],
dataset_text_field = "text",
max_seq_length = 2048,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
num_train_epochs = 3,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
output_dir = "outputs",
save_strategy = "epoch",
report_to = "none", # or "wandb" for experiment tracking
),
)
# Train!
trainer.train()
# Save the fine-tuned model
model.save_pretrained_merged("model_finetuned", tokenizer, save_method = "merged_16bit")
Step 4: Evaluate the Model
FastLanguageModel.for_inference(model)
from transformers import TextStreamer
def ask(question):
messages = [{"role": "user", "content": question}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
inputs, streamer=text_streamer, max_new_tokens=256,
temperature=0.7, do_sample=True
)
# Test it
ask("How do I reset my password?")
ask("What's your refund policy?")
Step 5: Deploy with vLLM
Once fine-tuned, serve the model with vLLM for production inference:
pip install vllm
# Serve the model
python -m vllm.entrypoints.openai.api_server
--model ./model_finetuned
--host 0.0.0.0
--port 8000
--dtype auto
Query it like OpenAI’s API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="./model_finetuned",
messages=[{"role": "user", "content": "How do I reset my password?"}]
)
print(response.choices[0].message.content)
Step 6: Upload to HuggingFace Hub (Optional)
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="model_finetuned",
repo_id="your-username/llama3.2-support-bot",
repo_type="model",
)
Tips for Better Fine-Tuning
- Data quality > quantity: 500 high-quality examples beat 5,000 noisy ones
- Use chat templates: Always format data using the model’s native chat template
- Monitor for overfitting: If eval loss increases while train loss decreases, reduce epochs
- Start small: Try 1 epoch first, then increase if needed
- 4-bit quantization: Use QLoRA (4-bit) for models up to 7B on consumer GPUs
Cost Comparison
| Model Size | Method | VRAM Needed | Time (500 samples) |
|---|---|---|---|
| 1B | QLoRA | 8GB | ~15 min |
| 3B | QLoRA | 12GB | ~30 min |
| 7B | QLoRA | 16GB | ~1 hour |
| 13B | QLoRA | 24GB | ~2 hours |
Key Takeaways
- LoRA/QLoRA makes fine-tuning accessible on consumer hardware
- Unsloth provides 2-5x speedups over standard HuggingFace training
- Data preparation is the most important step — invest time here
- Always evaluate on a held-out test set before deploying
- vLLM gives you OpenAI-compatible serving with minimal setup
