| Term |
Definition |
Example |
| Artificial Intelligence (AI) |
The broad field of creating systems that can perform tasks typically requiring human intelligence — reasoning, perception, language, decision-making. |
Chatbots, self-driving cars, recommendation engines |
| Machine Learning (ML) |
A subset of AI where systems learn patterns from data rather than following explicit rules. The model improves with more data. |
Spam filters, fraud detection, image recognition |
| Deep Learning |
ML using neural networks with many layers („deep“ networks). Powers modern AI breakthroughs in vision, language, and audio. |
GPT-4, Stable Diffusion, Whisper |
| Neural Network |
A computational model inspired by the brain. Consists of interconnected nodes (neurons) organized in layers that process information. |
The foundational architecture behind all modern AI |
| Transformer |
The neural network architecture (introduced 2017) that powers all modern LLMs. Uses „self-attention“ to understand context across long sequences. |
GPT, Claude, Gemini, Llama — all transformer-based |
| Attention Mechanism |
The core innovation of transformers. Allows the model to focus on relevant parts of the input when generating each output token. |
When translating „the cat sat on the mat,“ the model attends to „cat“ when generating the subject |
| Parameters |
The learned weights in a neural network. More parameters generally means more capacity to learn patterns (but also more compute needed). |
GPT-4: ~1.8T params; Llama 4 Scout: 17B active (109B total MoE) |
| Term |
Definition |
Example |
| Large Language Model (LLM) |
A transformer model trained on vast text data to understand and generate human language. „Large“ = billions of parameters. |
GPT-4.1, Claude 3.7, Llama 4, Gemini 2.5, Mistral Large 3 |
| Token |
The basic unit of text an LLM processes. ~4 characters in English. One token ≈ ¾ of a word on average. |
„Hello, world!“ = 4 tokens |
| Context Window |
The maximum number of tokens an LLM can process in a single request (input + output combined). |
Claude 3.7: 200K tokens (~150K words); GPT-4.1: 1M tokens |
| Prompt |
The input text that tells the LLM what to do. Can include instructions, examples, and context. |
„Summarize this article in 3 bullet points“ |
| Prompt Engineering |
The craft of designing effective prompts to get the best outputs from an LLM. Includes techniques like few-shot, chain-of-thought, and role prompting. |
Adding „Let’s think step by step“ to improve reasoning |
| Temperature |
A parameter (0-2) controlling randomness. Lower = more deterministic/focused; higher = more creative/random. |
0.0 for code/legal; 0.7 for chat; 1.2+ for creative writing |
| Hallucination |
When an LLM generates plausible-sounding factually incorrect information. The most significant reliability challenge. |
LLM confidently states a false date or invents a fake source |
| RAG (Retrieval-Augmented Generation) |
An architecture that augments an LLM with a retrieval system (usually vector search) so it can ground answers in factual, up-to-date documents. |
Enterprise Q&A over internal documentation |
| Embedding |
A numerical vector representation of text that captures semantic meaning. Similar texts have similar vectors. |
„king“ – „man“ + „woman“ ≈ „queen“ (classic word2vec example) |
| Vector Database |
A database optimized for storing and searching vector embeddings. Powers semantic search and RAG systems. |
Pinecone, Weaviate, Milvus, Qdrant, pgvector |
| Fine-tuning |
Further pre-training a base model on a specialized dataset to improve performance on specific tasks or domains. |
Training Llama on medical papers for a healthcare chatbot |
| LoRA (Low-Rank Adaptation) |
An efficient fine-tuning method that updates only small „adapter“ matrices instead of all model parameters. 100x cheaper than full fine-tuning. |
Fine-tuning a 70B model on a single GPU instead of 8 |
| QLoRA |
Quantized LoRA — combines 4-bit quantization with LoRA for even more efficient fine-tuning. |
Fine-tuning a 7B model on a consumer GPU |
| GGUF |
A file format for quantized models, optimized for local CPU/GPU inference via llama.cpp. |
llama-4-scout-17b-16e-instruct-Q4_K_M.gguf |
| Quantization |
Reducing model weight precision (e.g., 16-bit → 4-bit) to reduce memory and speed up inference with minimal quality loss. |
Q4_K_M: 4-bit quantization, ~95% of full precision quality |
| MoE (Mixture of Experts) |
Architecture where different „expert“ sub-networks handle different inputs. Only a subset activates per token, reducing compute. |
Llama 4 Scout: 16 experts, 2 active per token |
| Term |
Definition |
Example |
| AI Agent |
An LLM-based system that can take actions (call tools, write code, make API calls) to accomplish goals autonomously. |
A coding agent that reads a GitHub issue, writes code, and opens a PR |
| Tool Use / Function Calling |
The ability of an LLM to invoke external functions (APIs, databases, code execution) as part of its reasoning process. |
Agent calls weather API to answer „What’s the weather in Zurich?“ |
| Chain-of-Thought (CoT) |
Prompting technique that asks the LLM to reason step-by-step before giving the final answer. Dramatically improves complex reasoning. |
„Let’s solve this math problem step by step…“ |
| ReAct (Reasoning + Acting) |
An agent pattern that alternates between reasoning (thinking) and acting (using tools) in a loop until the task is complete. |
Think → Search → Think → Read → Think → Answer |
| Multi-Agent System |
Multiple AI agents working together, each with different roles, expertise, or perspectives, coordinated by an orchestration layer. |
Research agent + Writing agent + Review agent collaborating on a report |
| MCP (Model Context Protocol) |
An open standard (by Anthropic, 2024) for connecting LLMs to external tools and data sources. Replaces ad-hoc integrations. |
Connecting Claude to your database, file system, or APIs via MCP servers |
| A2A (Agent-to-Agent) |
Google’s protocol for agents to communicate and collaborate across different platforms and frameworks. |
A Google agent delegating a subtask to an Anthropic agent |
| Term |
Definition |
Example |
| Pre-training |
The initial training phase where a model learns general language patterns from massive text corpora (books, web, code). |
GPT-4 pre-trained on ~13T tokens of internet text |
| RLHF (RL from Human Feedback) |
Training technique where humans rank model outputs, and a reward model is trained to align the LLM with human preferences. |
ChatGPT’s helpfulness and safety alignment |
| DPO (Direct Preference Optimization) |
A simpler alternative to RLHF that directly optimizes the model on preference data without a separate reward model. |
Used to fine-tune Llama 3 and Mistral models |
| Benchmark |
A standardized test to evaluate model performance on specific tasks (reasoning, coding, math, etc.). |
MMLU (knowledge), HumanEval (coding), GSM8K (math) |
| Scaling Laws |
Empirical relationships showing that model performance improves predictably with more compute, data, and parameters. |
Chinchilla optimal: 20 tokens per parameter |
| Term |
Definition |
Example |
| Inference |
The process of running a trained model to generate outputs (as opposed to training). |
Sending a prompt to GPT-4 and receiving a response |
| vLLM |
A high-performance LLM serving engine that optimizes inference throughput using PagedAttention. |
Serving Llama 4 at 10x the throughput of naive inference |
| KV Cache |
A cache of key-value attention states from previous tokens, avoiding redundant computation during generation. |
Reduces generation time by 50-80% for long sequences |
| Batch Inference |
Processing multiple requests simultaneously to maximize GPU utilization and throughput. |
Processing 100 classification requests in one GPU pass |
| GPU VRAM |
Video RAM on a graphics card. Determines the maximum model size that can fit on a single GPU. |
NVIDIA A100: 80GB; RTX 5090: 32GB; M4 Ultra: 192GB unified |
| Term |
Definition |
Example |
| Alignment |
The process of ensuring AI systems behave in accordance with human values, intentions, and safety requirements. |
Training Claude to refuse harmful requests |
| Jailbreak |
A technique to bypass an AI model’s safety guardrails through carefully crafted prompts. |
„DAN“ (Do Anything Now) prompts that bypass content filters |
| Guardrails |
Safety mechanisms that constrain AI behavior — input filtering, output validation, content policies. |
Refusing to generate PII, hate speech, or dangerous instructions |
| Red Teaming |
Adversarial testing where experts try to find vulnerabilities, biases, or harmful outputs in an AI system. |
Hiring security researchers to probe a new LLM before release |