Tokenization in LLMs: How Text Becomes Numbers (Byte Pair Encoding Explained)

Reviewed: June 4, 2026

Reading time: 6 minutes | AI Fundamentals | DataGate.ch Knowledge Base

Before an AI model can process your text, it must convert it into numbers. This conversion process — tokenization — is one of the most fundamental yet misunderstood aspects of how LLMs work. Understanding it saves you from surprising bugs and helps you optimize costs.

What Is a Token?

A token is the basic unit of text that an LLM processes. It’s not a character, not a word — it’s something in between:

  • „hello“ → 1 token (common word)
  • „unbelievable“ → likely 3 tokens: „un“ + „believ“ + „able“
  • „12345“ → might be 1-3 tokens depending on frequency

Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.

Byte Pair Encoding (BPE): How It Works

Most modern LLMs (GPT, Llama, Mistral) use Byte Pair Encoding. The algorithm:

  1. Start with individual characters as the base vocabulary
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until you reach the target vocabulary size (typically 32K-100K)

The result: common words and substrings become single tokens, while rare words get split into multiple tokens.

Why Tokenization Matters

Cost

Most LLM APIs charge per token. A prompt that tokenizes to 2,000 tokens costs more than one that’s 1,000 tokens — even if they contain the same number of words. Optimizing token usage directly reduces costs.

Context Window

Models have token limits (128K, 200K, 1M). Tokenization determines how much „real content“ fits in that window. A 128K token window holds roughly 85,000-100,000 English words.

Multilingual Performance

Tokenizers trained primarily on English text are inefficient with other languages. A Chinese sentence might use 3x more tokens than its English equivalent. This affects both cost and quality.

Tokenization Quirks That Cause Bugs

The trailing space problem: “ hello“ and „hello“ may tokenize differently because tokenizers are sensitive to whitespace.

Number inconsistency: „123“ might be one token or three separate digit tokens, depending on the tokenizer.

JSON generation: Tokenizers handle JSON whitespace inefficiently, which is why some structured output formats cost more than expected.

How to Check Your Token Count

# Using tiktoken (OpenAI's tokenizer)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(f"Tokens: {len(tokens)}")

Bottom Line

Tokenization is the invisible layer between human text and AI understanding. Understanding it helps you control costs, avoid bugs, and optimize your prompts. Every AI developer should know how their model’s tokenizer works.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert