Fine-Tune an LLM on Custom Data — Complete Guide 2026

Reviewed: June 4, 2026

Last updated: May 2026

Fine-tuning a large language model on your own data can dramatically improve performance on domain-specific tasks. In this tutorial, you’ll fine-tune a Llama 3.2 model on a custom dataset using LoRA (Low-Rank Adaptation) — the most cost-effective approach for most use cases. We’ll cover data preparation, training, evaluation, and deployment.

Why Fine-Tune?

Before diving in, let’s clarify when fine-tuning makes sense vs. alternatives:

Approach Best For Cost
Prompt Engineering General tasks, quick experiments $
RAG (Retrieval) Knowledge-heavy, frequently changing data $$
Fine-Tuning Consistent style/behavior, domain expertise $$$

Fine-tuning shines when you need the model to consistently follow a specific format, adopt a particular tone, or perform a narrow task with high accuracy.

What We’re Building

A fine-tuned Llama 3.2 1B model that answers customer support questions in your company’s voice, trained on a dataset of 500 example Q&A pairs.

Prerequisites

Step 1: Prepare Your Training Data

Fine-tuning data should be high-quality and representative. For supervised fine-tuning (SFT), use a JSONL file with instruction-response pairs:

{"instruction": "How do I reset my password?", "response": "To reset your password, go to Settings → Security → Reset Password. You'll receive an email with a reset link within 2 minutes."}
{"instruction": "What's your refund policy?", "response": "We offer full refunds within 30 days of purchase. Contact support@company.com with your order number."}

Create your dataset:

import json
from datasets import Dataset

# Load your JSONL data
data = []
with open("support_qa.jsonl") as f:
    for line in f:
        item = json.loads(line)
        data.append({
            "text": f"### Instruction:n{item['instruction']}nn### Response:n{item['response']}"
        })

dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")

Step 2: Set Up LoRA with Unsloth

We’ll use Unsloth — it’s 2-5x faster than standard fine-tuning and uses 70% less VRAM:

pip install unsloth datasets accelerate peft trl
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect: BF16 on Ampere+, FP16 otherwise
    load_in_4bit = True,  # 4-bit quantization for lower VRAM
)

# Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                    # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Step 3: Train the Model

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 3,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
        save_strategy = "epoch",
        report_to = "none",  # or "wandb" for experiment tracking
    ),
)

# Train!
trainer.train()

# Save the fine-tuned model
model.save_pretrained_merged("model_finetuned", tokenizer, save_method = "merged_16bit")

Step 4: Evaluate the Model

FastLanguageModel.for_inference(model)

from transformers import TextStreamer

def ask(question):
    messages = [{"role": "user", "content": question}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(
        inputs, streamer=text_streamer, max_new_tokens=256, 
        temperature=0.7, do_sample=True
    )

# Test it
ask("How do I reset my password?")
ask("What's your refund policy?")

Step 5: Deploy with vLLM

Once fine-tuned, serve the model with vLLM for production inference:

pip install vllm

# Serve the model
python -m vllm.entrypoints.openai.api_server 
    --model ./model_finetuned 
    --host 0.0.0.0 
    --port 8000 
    --dtype auto

Query it like OpenAI’s API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="./model_finetuned",
    messages=[{"role": "user", "content": "How do I reset my password?"}]
)
print(response.choices[0].message.content)

Step 6: Upload to HuggingFace Hub (Optional)

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="model_finetuned",
    repo_id="your-username/llama3.2-support-bot",
    repo_type="model",
)

Tips for Better Fine-Tuning

Cost Comparison

Model Size Method VRAM Needed Time (500 samples)
1B QLoRA 8GB ~15 min
3B QLoRA 12GB ~30 min
7B QLoRA 16GB ~1 hour
13B QLoRA 24GB ~2 hours

Key Takeaways

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert