Before diving in, let's clarify when fine-tuning makes sense vs. alternatives: ApproachBest ForCost Prompt EngineeringGeneral tasks, quick experiments$ RAG (Retrieval)Knowledge-heavy, frequently changing data$$ Fine-TuningConsistent style/behavior, domain expertise$$$ Fine-tuning shines when you nee

A fine-tuned Llama 3.2 1B model that answers customer support questions in your company's voice, trained on a dataset of 500 example Q&A pairs. Prerequisites Python 3.10+ GPU with 16GB+ VRAM (or access to cloud GPU) HuggingFace account with API token Basic understanding of PyTorch Step 1: Prepar

Fine-Tune an LLM on Custom Data — Complete Guide 2026

Q: Step 2: Set Up LoRA with Unsloth

We'll use Unsloth — it's 2-5x faster than standard fine-tuning and uses 70% less VRAM: pip install unsloth datasets accelerate peft trl from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Llama-3.2-1B-Instruct", max_seq_lengt

Q: Step 3: Train the Model

from trl import SFTTrainer from transformers import TrainingArguments trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset["train"], eval_dataset = dataset["test"], dataset_text_field = "text", max_seq_length = 2048, args = TrainingArguments( per_device_train_batch_siz

Q: Step 4: Evaluate the Model

FastLanguageModel.for_inference(model) from transformers import TextStreamer def ask(question): messages = [{"role": "user", "content": question}] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") text_streamer = TextStreame

Fine-Tune an LLM on Custom Data — Complete Guide 2026

Reviewed: June 4, 2026

Last updated: May 2026

Fine-tuning a large language model on your own data can dramatically improve performance on domain-specific tasks. In this tutorial, you’ll fine-tune a Llama 3.2 model on a custom dataset using LoRA (Low-Rank Adaptation) — the most cost-effective approach for most use cases. We’ll cover data preparation, training, evaluation, and deployment.

Why Fine-Tune?

Before diving in, let’s clarify when fine-tuning makes sense vs. alternatives:

Approach	Best For	Cost
Prompt Engineering	General tasks, quick experiments	$
RAG (Retrieval)	Knowledge-heavy, frequently changing data	$$
Fine-Tuning	Consistent style/behavior, domain expertise	$$$

Fine-tuning shines when you need the model to consistently follow a specific format, adopt a particular tone, or perform a narrow task with high accuracy.

What We’re Building

A fine-tuned Llama 3.2 1B model that answers customer support questions in your company’s voice, trained on a dataset of 500 example Q&A pairs.

Prerequisites

Python 3.10+
GPU with 16GB+ VRAM (or access to cloud GPU)
HuggingFace account with API token
Basic understanding of PyTorch

Step 1: Prepare Your Training Data

Fine-tuning data should be high-quality and representative. For supervised fine-tuning (SFT), use a JSONL file with instruction-response pairs:

{"instruction": "How do I reset my password?", "response": "To reset your password, go to Settings → Security → Reset Password. You'll receive an email with a reset link within 2 minutes."}
{"instruction": "What's your refund policy?", "response": "We offer full refunds within 30 days of purchase. Contact support@company.com with your order number."}

Create your dataset:

import json
from datasets import Dataset

# Load your JSONL data
data = []
with open("support_qa.jsonl") as f:
    for line in f:
        item = json.loads(line)
        data.append({
            "text": f"### Instruction:n{item['instruction']}nn### Response:n{item['response']}"
        })

dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")

Step 2: Set Up LoRA with Unsloth

We’ll use Unsloth — it’s 2-5x faster than standard fine-tuning and uses 70% less VRAM:

pip install unsloth datasets accelerate peft trl

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect: BF16 on Ampere+, FP16 otherwise
    load_in_4bit = True,  # 4-bit quantization for lower VRAM
)

# Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                    # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Step 3: Train the Model

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 3,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        save_strategy = "epoch",
        report_to = "none",  # or "wandb" for experiment tracking
    ),
)

# Train!
trainer.train()

# Save the fine-tuned model
model.save_pretrained_merged("model_finetuned", tokenizer, save_method = "merged_16bit")

Step 4: Evaluate the Model

FastLanguageModel.for_inference(model)

from transformers import TextStreamer

def ask(question):
    messages = [{"role": "user", "content": question}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(
        inputs, streamer=text_streamer, max_new_tokens=256, 
        temperature=0.7, do_sample=True
    )

# Test it
ask("How do I reset my password?")
ask("What's your refund policy?")

Step 5: Deploy with vLLM

Once fine-tuned, serve the model with vLLM for production inference:

pip install vllm

# Serve the model
python -m vllm.entrypoints.openai.api_server 
    --model ./model_finetuned 
    --host 0.0.0.0 
    --port 8000 
    --dtype auto

Query it like OpenAI’s API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="./model_finetuned",
    messages=[{"role": "user", "content": "How do I reset my password?"}]
)
print(response.choices[0].message.content)

Step 6: Upload to HuggingFace Hub (Optional)

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="model_finetuned",
    repo_id="your-username/llama3.2-support-bot",
    repo_type="model",
)

Tips for Better Fine-Tuning

Data quality > quantity: 500 high-quality examples beat 5,000 noisy ones
Use chat templates: Always format data using the model’s native chat template
Monitor for overfitting: If eval loss increases while train loss decreases, reduce epochs
Start small: Try 1 epoch first, then increase if needed
4-bit quantization: Use QLoRA (4-bit) for models up to 7B on consumer GPUs

Cost Comparison

Model Size	Method	VRAM Needed	Time (500 samples)
1B	QLoRA	8GB	~15 min
3B	QLoRA	12GB	~30 min
7B	QLoRA	16GB	~1 hour
13B	QLoRA	24GB	~2 hours

Key Takeaways

LoRA/QLoRA makes fine-tuning accessible on consumer hardware
Unsloth provides 2-5x speedups over standard HuggingFace training
Data preparation is the most important step — invest time here
Always evaluate on a held-out test set before deploying
vLLM gives you OpenAI-compatible serving with minimal setup

📚 Related Posts

DataGate AI Content Intelligence Dashboard — DataGate AI Content Intelligence Dashboard *{box-sizing:border-box;margin:0;padding:0} :root{--bg:#0f172a;--card:#1e293b;--accent:#3b82f6;--accent2:#8b5cf6;--green:#10b981;--yellow:#f59e0b;--red:#ef4444;--text:#e2e8f0;--muted:#94a3b8} body{font-family:'Segoe UI',system-ui,sans-serif;background:var(--bg);color:var(--text);padding:16px;line-height:1.6} .header{display:flex;align-items:center;justify-content:space-between;flex-wrap:wrap;gap:12px;margin-bottom:16px} .header h1{font-size:1.5rem;background:linear-gradient(90deg,var(--accent),var(--accent2));-webkit-background-clip:text;-webkit-text-fill-color:transparent} .header .badge{background:linear-gradient(135deg,var(--accent),var(--accent2));color:#fff;padding:4px 12px;border-radius:20px;font-size:.75rem;font-weight:600}…
Topic Trend Tracker — Topic Trend Tracker *{box-sizing:border-box;margin:0;padding:0} :root{--bg:#0f172a;--card:#1e293b;--accent:#3b82f6;--accent2:#8b5cf6;--green:#10b981;--yellow:#f59e0b;--red:#ef4444;--text:#e2e8f0;--muted:#94a3b8} body{font-family:'Segoe UI',system-ui,sans-serif;background:var(--bg);color:var(--text);padding:20px;line-height:1.6} .wrap{max-width:1100px;margin:0 auto} h1{font-size:1.6rem;margin:4px 0 16px;background:linear-gradient(90deg,var(--accent),var(--accent2));-webkit-background-clip:text;-webkit-text-fill-color:transparent} .sub{color:var(--muted);margin-bottom:20px;font-size:.9rem} .grid{display:grid;grid-template-columns:1fr 1fr;gap:16px}…
Audience Segmentation Explorer — Audience Segmentation Explorer *{box-sizing:border-box;margin:0;padding:0} :root{--bg:#0f172a;--card:#1e293b;--accent:#3b82f6;--accent2:#8b5cf6;--green:#10b981;--yellow:#f59e0b;--red:#ef4444;--text:#e2e8f0;--muted:#94a3b8} body{font-family:'Segoe UI',system-ui,sans-serif;background:var(--bg);color:var(--text);padding:20px;line-height:1.6} .wrap{max-width:1100px;margin:0 auto} h1{font-size:1.6rem;margin:4px 0 16px;background:linear-gradient(90deg,var(--accent),var(--accent2));-webkit-background-clip:text;-webkit-text-fill-color:transparent} .sub{color:var(--muted);margin-bottom:20px;font-size:.9rem} .grid{display:grid;grid-template-columns:1fr 1fr;gap:16px}…
AI Content Performance Analyzer — AI Content Performance Analyzer *{box-sizing:border-box;margin:0;padding:0} :root{--bg:#0f172a;--card:#1e293b;--accent:#3b82f6;--accent2:#8b5cf6;--green:#10b981;--yellow:#f59e0b;--red:#ef4444;--text:#e2e8f0;--muted:#94a3b8} body{font-family:'Segoe UI',system-ui,sans-serif;background:var(--bg);color:var(--text);padding:20px;line-height:1.6} .wrap{max-width:1100px;margin:0 auto} h1{font-size:1.6rem;margin:4px 0 16px;background:linear-gradient(90deg,var(--accent),var(--accent2));-webkit-background-clip:text;-webkit-text-fill-color:transparent} .sub{color:var(--muted);margin-bottom:20px;font-size:.9rem} .stats{display:grid;grid-template-columns:repeat(auto-fit,minmax(140px,1fr));gap:12px;margin-bottom:20px}…
Wave 151 Hub: AI Agent Engineering — 🌊 Wave 151: AI Agent Engineering The definitive guide to building production-grade AI agents —…

Fine-Tune an LLM on Custom Data — Complete Guide 2026

Fine-Tune an LLM on Custom Data — Complete Guide 2026

Why Fine-Tune?

What We’re Building

Prerequisites

Step 1: Prepare Your Training Data

Step 2: Set Up LoRA with Unsloth

Step 3: Train the Model

Step 4: Evaluate the Model

Step 5: Deploy with vLLM

Step 6: Upload to HuggingFace Hub (Optional)

Tips for Better Fine-Tuning

Cost Comparison

Key Takeaways

📚 Related Posts

Schreibe einen Kommentar Antwort abbrechen