- GQA & MQA
- Grouped-Query and Multi-Query Attention. MQA shares a single key/value head across all query heads; GQA groups them. Crucial for reducing the Key-Value cache memory bottleneck during long-context inference.
- Mixture of Experts (MoE)
- Only a subset of parameters ('experts') activates per token. Enables scaling total parameters (e.g. DeepSeek V4's ~1T total parameters / 50-60B active, Llama 4 Maverick's 400B total / 17B active) while keeping compute costs low.
- State Space Models (SSMs)
- A transformer alternative like Mamba 3 that processes sequences with linear computational complexity O(n) instead of quadratic complexity O(n²), offering massive speedups for long-context windows.
- Linear Recurrent Units (LRU)
- Recurrent architectures that eliminate non-linearities in state updates. This allows fully parallelized training (similar to transformers) while retaining the O(1) inference storage of RNNs.
- Test-Time Compute
- Shifting scaling laws from training to inference. Instead of training larger models, reasoning models spend additional compute at inference time, utilizing reinforcement learning to self-correct during generation.
- Constitutional AI
- Anthropic's post-training method. The model critiques and refines its own outputs based on a written list of principles (an 80-page constitution as of 2025), reducing human label costs by 100-1,000×.
- DPO vs PPO
- Direct Preference Optimization trains the model directly on human preferences without a separate reward model. It is more stable and requires only half the VRAM of standard Proximal Policy Optimization (PPO).
- Model Collapse
- A degradation loop where AI models trained on synthetic data from prior models begin to lose structural coherence, highlighting the need for validation loops like compilers or formal math proof checkers.