Chapter 02 · The mechanics

How artificial intelligence actually works.

Strip away the magic and modern AI is a stack of well-understood ideas executed at staggering scale. Here is the whole pipeline, end to end, in plain English.

Tokenization

Text, images, audio, and video are sliced into numeric representations called tokens. A modern model processes up to tens of trillions of tokens during pre-training. Autoregressive visual models (such as VAR and LlamaGen) tokenize images for next-token sequential prediction.

Embeddings

Tokens are mapped into high-dimensional vector spaces where semantic relationships are represented geometrically. Related concepts, like 'quarks' and 'leptons', or visual patches of a cat and the word 'feline', sit close to each other in these multi-thousand-dimensional spaces.

Attention Mechanics

Standard Self-Attention computes a full quadratic matrix, allowing every token to weigh every other token. To scale, modern architectures use Multi-Query Attention (MQA) to share key/value heads, or Grouped-Query Attention (GQA) to group them. This dramatically reduces the memory footprint of the Key-Value (KV) cache.

Pre-training Scales

Predict the next token across 20+ trillion tokens of books, code, and synthetic datasets. Trillions of weights absorb a compressed mathematical model of human knowledge. A frontier training run today scales beyond 10^26 FLOPs and costs $100M to $1B+ in compute.

Post-training Alignment

Refining raw models into helpful assistants via a three-stage pipeline: Supervised Fine-Tuning (SFT), Preference Alignment (DPO has largely replaced PPO, reducing GPU usage by ~50%), and Reinforcement Learning with Verifiable Rewards (like GRPO or RLVR) for coding and math.

Reasoning & thinking

Reasoning models (OpenAI o1/o3, DeepSeek-R1, Qwen-QwQ) scale test-time compute. By emitting 20,000–60,000 internal 'thinking' tokens, models perform multi-step search, self-correction, and logical verification before displaying their final answer.

Flow Matching & Generation

Modern visual generation has shifted from classic diffusion to Latent Diffusion Models (LDMs) and Flow Matching (rectified flow). Flow matching trains the neural network to learn straight-line probability paths between noise and data, enhancing training stability and sample quality.

Agentic Execution

The model runs in loops, writing code, executing in sandboxes, calling APIs, and reading documents. Rather than single-turn answers, frameworks like Claude Code and Devin behave as autonomous teammates, combining planning and execution.

Field glossary

The vocabulary of the frontier.

GQA & MQA: Grouped-Query and Multi-Query Attention. MQA shares a single key/value head across all query heads; GQA groups them. Crucial for reducing the Key-Value cache memory bottleneck during long-context inference.
Mixture of Experts (MoE): Only a subset of parameters ('experts') activates per token. Enables scaling total parameters (e.g. DeepSeek V4's ~1T total parameters / 50-60B active, Llama 4 Maverick's 400B total / 17B active) while keeping compute costs low.
State Space Models (SSMs): A transformer alternative like Mamba 3 that processes sequences with linear computational complexity O(n) instead of quadratic complexity O(n²), offering massive speedups for long-context windows.
Linear Recurrent Units (LRU): Recurrent architectures that eliminate non-linearities in state updates. This allows fully parallelized training (similar to transformers) while retaining the O(1) inference storage of RNNs.
Test-Time Compute: Shifting scaling laws from training to inference. Instead of training larger models, reasoning models spend additional compute at inference time, utilizing reinforcement learning to self-correct during generation.
Constitutional AI: Anthropic's post-training method. The model critiques and refines its own outputs based on a written list of principles (an 80-page constitution as of 2025), reducing human label costs by 100-1,000×.
DPO vs PPO: Direct Preference Optimization trains the model directly on human preferences without a separate reward model. It is more stable and requires only half the VRAM of standard Proximal Policy Optimization (PPO).
Model Collapse: A degradation loop where AI models trained on synthetic data from prior models begin to lose structural coherence, highlighting the need for validation loops like compilers or formal math proof checkers.

Why this matters

Every leap of the last five years — ChatGPT, image generation, code agents, real-time voice, reasoning models, video — comes from the same recipe: bigger transformers, better data, smarter training, more compute. The recipe is still scaling.

See who is scaling fastest →

← Chapter 01: Compute Core Chapter 03: Frontier Labs →