World Models for Autonomous Driving Simulation: A Deep Dive
Focus: Generative world models as learned simulators for autonomous driving Key Systems: GAIA-1 (Wayve), Waymo World Model (Genie 3), DriveDreamer, Epona, GenAD Read Time: 60 min
Table of Contents
- Executive Summary
- Background and Motivation
- Foundational Concepts
- Key World Models for AD
- Three-Tier Taxonomy
- Technical Deep Dive
- World Models vs Reconstruction-Based Simulation
- Code Examples
- Mental Models and Diagrams
- Hands-On Exercises
- Interview Questions
- References
Executive Summary
The Core Idea
World models are neural networks that learn to simulate the world. Instead of hand-engineering physics, rendering pipelines, and agent behaviors, a world model observes real driving data and learns to predict what happens next -- given the current state and an action, it generates the future state of the world.
Traditional Simulation: World Model Simulation:
Hand-crafted Scripted Learned from Emergent
physics engine behaviors real data behaviors
| | | |
v v v v
+-----------+ +-----------+ +---------------------------+
| Physics | | Behavior | | Neural World Model |
| Engine |--->| Scripts | | (Single Learned Model) |
| (Bullet/ | | (Rule- | | |
| PhysX) | | based) | | Input: state + action |
+-----------+ +-----------+ | Output: next state |
| | +---------------------------+
v v |
+-----------+ +-----------+ v
| Renderer | | Sensor | +---------------------------+
| (Raster/ | | Sim | | Camera, LiDAR, BEV, |
| RT) | | | | Occupancy -- all in one |
+-----------+ +-----------+ +---------------------------+
Why This Matters Now
Three converging trends have made world models for AD practical:
-
Scaling laws for video generation: Models like Sora, Genie 2/3, and Cosmos demonstrated that video generation quality scales predictably with compute and data.
-
Sensor-realistic generation: Recent models generate not just camera images but also LiDAR point clouds, BEV maps, and occupancy grids -- the full sensor suite needed for AD testing.
-
Controllability breakthroughs: Language-conditioned and action-conditioned generation means engineers can specify scenarios ("a pedestrian jaywalks from behind a parked truck") rather than hand-scripting them.
The Landscape (Early 2026)
| System | Organization | Date | Key Innovation |
|---|---|---|---|
| GAIA-1 | Wayve | Jul 2023 | First large-scale (9B param) AD world model |
| DriveDreamer | GigaAI | Oct 2023 | First from real driving data, diffusion-based |
| ADriver-I | CASIA | Nov 2023 | Interleaved vision-action tokens |
| GenAD | CUHK/SenseTime | Mar 2024 | Temporal reasoning, action-conditioned |
| OccWorld | PKU | May 2024 | 3D occupancy world model |
| Vista | MIT/MBZUAI | May 2024 | High-fidelity long-horizon driving simulator |
| Dreamland | NVIDIA | Jun 2024 | Physics simulator + video generator hybrid |
| Epona | Various | 2025 | Autoregressive diffusion world model |
| Waymo WM | Waymo/DeepMind | Feb 2026 | Genie 3 backbone, camera+LiDAR, language control |
Background and Motivation
What Are World Models?
A world model is a learned function that predicts the next state of the environment given the current state and an action:
s_{t+1} = f_theta(s_t, a_t)
Where:
s_tis the current world state (images, point clouds, maps, agent states)a_tis the action taken (steering, acceleration, or higher-level commands)f_thetais a neural network with learned parametersthetas_{t+1}is the predicted next state
The term originates from cognitive science (Kenneth Craik, 1943) and was formalized for RL by Ha and Schmidhuber (2018) in "World Models" -- where an agent learns a compressed representation of the environment to plan actions within an imagined future.
For autonomous driving, this concept becomes:
Given: current camera images, LiDAR scans, HD map, ego vehicle action
Predict: future camera images, LiDAR scans, positions of all agents
More formally:
{I_{t+1}, L_{t+1}, B_{t+1}} = WorldModel({I_t, L_t, B_t, M}, a_t)
Where:
I = camera images (multi-view)
L = LiDAR point cloud
B = BEV representation
M = HD map / road topology
a = ego action (steering, acceleration)
Why Generative Simulation Matters
Traditional simulation has a fundamental problem: everything must be authored. Every building, every pedestrian behavior, every lighting condition, every sensor artifact must be explicitly modeled by engineers. This creates three critical gaps:
1. The Long-Tail Problem
Frequency of Scenarios in Real Driving:
|
|####
|########
|############ <-- Common scenarios
|################## (well-covered by traditional sim)
|########################
|##############################
|####################################
|############################################
|######################################################
|##################################################################
|########################################################################
| ^^^^^^^^^^^^^^^^^^^^^^^^
| Long-tail scenarios
| (hard to author manually,
| but critical for safety)
+---------------------------------------------------------------------->
Scenario Type (ordered by frequency)
Real driving encounters an enormous variety of rare situations. Manually authoring these is expensive and incomplete. World models can generate novel scenarios by recombining patterns learned from data.
2. The Realism Problem
Hand-crafted simulators produce outputs that look synthetic. Neural networks trained on these synthetic outputs develop biases. World models generate outputs that are distributionally closer to real sensor data because they learned directly from it.
3. The Scalability Problem
Creating a new city in a traditional simulator takes months. A world model trained on diverse data can generate new environments by interpolating between learned representations. Conditioning on maps, language, or style tokens enables rapid scenario creation.
Reconstruction-Based vs Generative Approaches
The AD simulation community has two major paradigms for creating realistic sensor data:
Reconstruction-Based (3DGS / NeRF)
These approaches reconstruct a 3D representation of a specific real-world scene from captured sensor data, then re-render from novel viewpoints.
Real Drive Log ──> 3D Reconstruction ──> Novel View Synthesis
(3DGS, NeRF) (move camera, change
agent positions)
Key systems: UniSim (Waabi), MARS, StreetGaussians, SplatAD
Strengths:
- Photorealistic for the specific captured scene
- Geometrically consistent (real 3D structure)
- Good for replay-with-perturbation testing
Limitations:
- Bound to captured scenes (cannot generate truly new environments)
- Requires dense capture data for each scene
- Editing (adding/removing objects) is challenging
- Dynamic objects are particularly hard
Generative / World Models
These approaches learn a generative model from large datasets and can synthesize entirely new scenarios.
Large Driving Dataset ──> Train World Model ──> Generate New Scenarios
(diffusion, AR) (action-conditioned,
language-controlled)
Key systems: GAIA-1, Waymo World Model, DriveDreamer, GenAD
Strengths:
- Can generate entirely new scenarios never seen in training data
- Naturally handles long-tail and rare events
- Scalable -- one model covers many environments
- Controllable via language, actions, or layouts
Limitations:
- May hallucinate physically implausible scenarios
- Geometric consistency not guaranteed
- Evaluation is harder (no ground truth for generated scenes)
- Computationally expensive at inference time
Comparison Table
| Dimension | Reconstruction (3DGS/NeRF) | Generative (World Models) |
|---|---|---|
| Scene Coverage | Only captured scenes | Any scene (generalized) |
| Geometric Accuracy | High (real 3D) | Approximate (learned) |
| Photorealism | Very high for captured scene | High but can hallucinate |
| Novel Scenario Gen | Limited (perturbation only) | Strong (recombination) |
| Data Requirements | Dense per-scene capture | Large diverse dataset |
| Edit/Control | Limited (object removal hard) | Flexible (language, action) |
| Physical Consistency | Depends on physics model | Learned (can be wrong) |
| Multi-Sensor | Per-sensor reconstruction | Joint generation possible |
| Compute (Training) | Per-scene optimization | One-time large training |
| Compute (Inference) | Fast rendering | Slow generation |
| Industry Example | Applied Intuition | Waymo World Model |
| Best For | Regression testing on real routes | Scenario exploration, long-tail |
Foundational Concepts
Autoregressive Sequence Modeling for Video and Scenes
Autoregressive (AR) models generate sequences one token at a time, where each new token is conditioned on all previous tokens. For world models, the "tokens" can be image patches, latent codes, or discrete codes from a VQ-VAE.
Autoregressive Video Generation:
Frame 1 Frame 2 Frame 3 Frame 4
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │───>│ │───>│ │───>│ ? │
│ z_1 │ │ z_2 │ │ z_3 │ │ z_4 │
└─────┘ └─────┘ └─────┘ └─────┘
| | | ^
v v v |
┌──────────────────────────────────────┐
│ Transformer (causal mask) │
│ │
│ P(z_4 | z_1, z_2, z_3, action) │
└──────────────────────────────────────┘
Each z_t is a set of discrete tokens (VQ-VAE codes)
or continuous latent vectors (VAE embeddings).
How GAIA-1 uses this: GAIA-1 tokenizes video frames, text descriptions, and actions into a single sequence of discrete tokens, then trains a large Transformer (9B parameters) to predict the next token autoregressively. This is conceptually identical to how GPT generates text -- but the "vocabulary" includes visual tokens.
Key design choices:
- Tokenization: VQ-VAE or VQ-GAN converts images to discrete tokens
- Sequence ordering: Raster-scan within frame, temporal order across frames
- Conditioning: Actions and text are interleaved as special tokens
- Sampling: Temperature and top-k/top-p control generation diversity
Diffusion Models for Generation
Diffusion models generate data by learning to reverse a noise-addition process. Starting from pure Gaussian noise, the model iteratively denoises to produce a clean sample.
Forward Process (Training):
Clean Image ──> Add Noise ──> Add More Noise ──> ... ──> Pure Noise
x_0 x_1 x_2 x_T
q(x_t | x_{t-1}) = N(sqrt(1-beta_t) * x_{t-1}, beta_t * I)
Reverse Process (Generation):
Pure Noise ──> Denoise ──> Denoise ──> ... ──> Clean Image
x_T x_{T-1} x_{T-2} x_0
p_theta(x_{t-1} | x_t) -- learned by neural network
For driving world models, diffusion offers several advantages over AR:
- Spatial coherence: Generates the entire frame at once (no raster-scan artifacts)
- Continuous outputs: Naturally handles continuous-valued sensor data
- Classifier-free guidance: Strong controllability through conditioning
- Quality: State-of-the-art image quality at high resolutions
Video diffusion extends this to sequences by operating on space-time volumes:
Video Diffusion Model:
Input: Noisy video volume [T, H, W, C] + conditioning signals
|
v
┌──────────────────────────┐
│ 3D U-Net / DiT │
│ │
│ Spatial Attention │
│ Temporal Attention │
│ Cross-Attention (cond) │
│ │
└──────────────────────────┘
|
v
Predicted noise (or clean video)
Action-Conditioned Generation
The defining feature of a world model (vs a generic video generator) is action conditioning -- the model's output changes based on what action the ego vehicle takes.
Action-Conditioning Mechanisms:
1. TOKEN CONCATENATION (GAIA-1 style):
[img_tokens_t, action_t, img_tokens_{t+1}, action_{t+1}, ...]
- Action is just another token in the sequence
2. CROSS-ATTENTION (Diffusion style):
┌──────────────────┐
│ Video Denoiser │ <─── Cross-Attend ──── Action Embedding
└──────────────────┘
3. CONTROL SIGNAL INJECTION (ControlNet style):
┌──────────────────┐
│ Video Denoiser │
│ + │
│ Action Branch │ <─── Action sequence [a_1, ..., a_T]
│ (parallel net) │
└──────────────────┘
4. FiLM CONDITIONING:
For each layer: h = gamma(action) * h + beta(action)
- Action modulates feature maps via learned affine transforms
Action representations for driving:
| Representation | Dimensionality | Description |
|---|---|---|
| Low-level | 2D | (steering_angle, acceleration) |
| Waypoints | T x 2D | Future ego positions [(x1,y1), ..., (xT,yT)] |
| High-level | Categorical | {turn_left, go_straight, turn_right, stop} |
| Language | Variable | "Turn right at the next intersection" |
| Trajectory + speed | T x 3D | [(x, y, speed)_1, ..., (x, y, speed)_T] |
Multi-Modal Output Generation
Real AD systems consume data from multiple sensors. A world model must generate consistent outputs across these modalities.
Multi-Modal World Model:
┌─────────────────────────┐
│ Shared Latent Space │
│ │
│ z_t (world state) │
│ │
└────┬──────┬──────┬──────┘
│ │ │
┌────┘ │ └────┐
│ │ │
v v v
┌──────────┐ ┌────────┐ ┌──────────┐
│ Camera │ │ LiDAR │ │ BEV │
│ Decoder │ │ Decoder│ │ Decoder │
└──────────┘ └────────┘ └──────────┘
│ │ │
v v v
Multi-view Point Bird's Eye
Images Cloud View Map
The joint generation challenge: Camera images and LiDAR point clouds must be geometrically consistent. A car visible in the camera image must also appear as points in the LiDAR scan at the correct 3D position. This requires either:
- Shared 3D representation: Generate a 3D scene first, then render to each sensor
- Cross-modal attention: Let the camera and LiDAR decoders attend to each other
- Joint latent space: Encode both modalities into a unified latent, decode together
The Waymo World Model (2026) achieves this by operating in a shared representation space where camera and LiDAR features are jointly processed before being decoded to their respective modalities.
Key World Models for AD
GAIA-1 (Wayve, July 2023)
GAIA-1 was a landmark paper that demonstrated world models could work at scale for autonomous driving. It showed that the same scaling laws powering LLMs apply to learned driving simulators.
Architecture:
GAIA-1 Architecture:
Video Frames ──> Image Tokenizer (VQ-VAE) ──> Visual Tokens
Text Prompts ──> Text Tokenizer ──> Text Tokens
Actions ──> Action Tokenizer ──> Action Tokens
│ │ │
└────────────────┼───────────────────────────┘
│
v
┌──────────────────────┐
│ World Model │
│ (Transformer, 9B) │
│ │
│ Autoregressive next │
│ token prediction │
└──────────────────────┘
│
v
┌──────────────────────┐
│ Video Decoder │
│ (VQ-VAE decoder + │
│ upsampling) │
└──────────────────────┘
│
v
Generated Video
(realistic driving scenes)
Key specifications:
| Property | Value |
|---|---|
| Parameters | 9 billion |
| Training Data | Wayve's proprietary UK driving data |
| Input Modalities | Video, text, action |
| Output | Video (camera images) |
| Resolution | 288 x 512 (base), upsampled to higher res |
| Frame Rate | ~25 FPS generation |
| Sequence Length | Variable, demonstrated up to ~30 seconds |
| Tokenizer | VQ-VAE with 8192 codebook size |
Key contributions:
- Scaling: Showed that bigger models produce better simulations (9B >> 1B)
- Multi-modal conditioning: Text descriptions + actions jointly control generation
- Emergent understanding: The model learned 3D structure, object permanence, and basic physics without explicit supervision
- Generalization: Could generate scenarios not seen in training data
Limitations:
- Camera-only (no LiDAR generation)
- Proprietary data and model (not reproducible)
- Temporal consistency degrades over long horizons
- No explicit physics guarantees
Waymo World Model (February 2026)
Waymo's World Model represents the most capable publicly announced AD world model as of early 2026. Built on DeepMind's Genie 3 foundation, it combines advances in video generation with AD-specific capabilities.
Architecture Overview:
Waymo World Model (built on Genie 3):
┌───────────────────────────────────────────────────────────┐
│ INPUT ENCODING │
│ │
│ Multi-view Cameras ──> Visual Encoder ──┐ │
│ LiDAR Point Cloud ──> LiDAR Encoder ──┼──> Fused │
│ HD Map / Road Graph ──> Map Encoder ──┘ State z_t │
│ Language Command ──> Text Encoder ──────┘ │
│ Ego Action ──> Action Encoder ─────┘ │
└───────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────┐
│ GENIE 3 BACKBONE │
│ │
│ Spatiotemporal Transformer with: │
│ - Latent action model (learns action space) │
│ - Dynamics model (predicts next latent state) │
│ - Scalable architecture (billions of parameters) │
│ │
│ Key innovation from Genie 3: │
│ - Operates on latent tokens, not raw pixels │
│ - Learned action representations │
│ - Consistent generation over long horizons │
└───────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────┐
│ MULTI-MODAL DECODERS │
│ │
│ z_{t+1} ──> Camera Decoder ──> Multi-view images │
│ ──> LiDAR Decoder ──> Point cloud │
│ ──> BEV Decoder ──> Bird's eye view │
│ ──> Agent Decoder ──> Bounding boxes / tracks │
└───────────────────────────────────────────────────────────┘
Key capabilities:
-
Multi-sensor generation: Simultaneously generates camera images AND LiDAR point clouds that are geometrically consistent with each other.
-
Language controllability: Natural language commands control what happens in the generated scenario.
- "A cyclist enters the intersection from the left"
- "The lead vehicle brakes suddenly"
- "Heavy rain reduces visibility"
-
Rare event generation: Can generate safety-critical scenarios that are extremely rare in real driving data.
- Pedestrian darting from behind occluded vehicle
- Multi-vehicle chain-reaction collisions
- Sensor degradation scenarios (rain, fog, glare)
-
Long-horizon consistency: Built on Genie 3's architecture for maintaining coherent generation over extended time periods.
-
Closed-loop interaction: Generated world responds to the AD system's actions, enabling closed-loop testing.
Why Genie 3 matters as a foundation:
Genie (1, 2, 3) is DeepMind's line of generative interactive environment models. The key innovations that carry over to the Waymo World Model:
- Latent action models: Instead of requiring predefined action spaces, Genie learns meaningful action representations from observation data alone
- Scalable spatiotemporal architecture: Efficiently processes video-length sequences with attention mechanisms optimized for space-time data
- Interactive generation: The model generates frames that respond causally to input actions, not just plausible-looking video continuations
DriveDreamer (GigaAI, October 2023)
DriveDreamer was the first world model built entirely from real-world driving scenarios using a diffusion-based approach.
Architecture:
DriveDreamer Pipeline:
Real Driving Data
│
├──> 3D Boxes + HDMap ──> Structured Representation
│ │
│ v
│ ┌─────────────────────┐
│ │ Layout Encoder │
│ │ (3D box positions, │
│ │ road topology) │
│ └─────────┬───────────┘
│ │
v v
┌──────────────┐ ┌──────────────────────┐
│ Past Frames │────────>│ Video Diffusion │
│ (context) │ │ Model │
└──────────────┘ │ (Stable Diffusion │
│ + temporal layers) │
└──────────┬───────────┘
│
v
Future Frame Prediction
Key contributions:
- First to use structured driving representations (3D boxes, HD maps) as conditioning for a diffusion-based video generator
- Demonstrated that diffusion models can produce temporally consistent driving videos
- Introduced a two-stage training: (1) align visual features with structured representations, (2) train conditional video generation
Epona (ICCV 2025)
Epona combines the best of autoregressive and diffusion approaches into a unified autoregressive diffusion world model.
Epona: Autoregressive Diffusion
Frame t-2 Frame t-1 Frame t Frame t+1 (generate)
┌──────┐ ┌──────┐ ┌──────┐ ┌──────────────┐
│ │ │ │ │ │ │ Diffusion │
│ z │───>│ z │───>│ z │────>│ Process │
│ │ │ │ │ │ │ (iterative │
└──────┘ └──────┘ └──────┘ │ denoising) │
└──────────────┘
│
Autoregressive: each frame depends v
on all previous frames Generated Frame t+1
Diffusion: each individual frame
is generated via diffusion denoising
HYBRID BENEFIT:
- AR gives temporal consistency (causal structure)
- Diffusion gives spatial quality (parallel pixel generation)
Key insight: Rather than choosing between AR (good temporal coherence) and diffusion (good spatial quality), Epona uses AR to model the temporal sequence while using diffusion to generate each individual frame. This is conceptually similar to how MAR (Masked Autoregressive generation) works for images, but applied to video.
Dreamland (NVIDIA, 2024)
Dreamland takes a unique hybrid approach: it couples a traditional physics simulator with a neural video generator to get the best of both worlds.
Dreamland Architecture:
┌─────────────────────────────────────────────────────┐
│ PHYSICS SIMULATOR │
│ │
│ Vehicle dynamics ──> Positions, velocities │
│ Collision detection ──> Contact forces │
│ Road network ──> Valid trajectories │
│ │
│ OUTPUT: Physically-valid agent states │
└────────────────────────┬────────────────────────────┘
│ Agent positions,
│ velocities, poses
v
┌─────────────────────────────────────────────────────┐
│ VIDEO GENERATOR │
│ │
│ Takes physically-valid states as conditioning │
│ Generates photorealistic sensor observations │
│ │
│ Neural renderer: │
│ - Conditioned on agent layouts from physics sim │
│ - Generates multi-view camera images │
│ - Maintains visual consistency │
└─────────────────────────────────────────────────────┘
The key insight: Physics simulators are good at vehicle dynamics and collision detection. Neural generators are good at photorealism. Dreamland does not try to replace physics -- it uses a physics engine for what it does well and neural generation for what it does well.
Advantages:
- Physically valid trajectories (no cars driving through walls)
- Photorealistic rendering (no synthetic look)
- Controllable (adjust physics parameters directly)
GenAD and Other Recent Models
GenAD (CUHK/SenseTime, 2024) focuses on temporal reasoning for AD world models:
- Uses a temporal reasoning module to capture long-range dependencies
- Action-conditioned generation with ego trajectory as input
- Demonstrates strong performance on nuScenes benchmark
OccWorld (PKU, 2024) operates in 3D occupancy space:
- Generates future 3D occupancy grids instead of images
- Better for planning (3D geometric reasoning)
- Avoids the photorealism challenge entirely by working in occupancy space
Vista (MIT/MBZUAI, 2024) targets high-fidelity, long-horizon simulation:
- Extended horizon generation (minutes, not seconds)
- Multi-view consistency through explicit camera models
- Demonstrated closed-loop driving with real AD stacks
ADriver-I (CASIA, 2023) uses an interleaved token approach:
- Vision tokens and action tokens are interleaved in a single sequence
- Trained with next-token prediction (like an LLM)
- Demonstrated that LLM-style training works for driving world models
Three-Tier Taxonomy
Recent survey papers (Wang et al. 2024, Gao et al. 2024) organize AD world models into three tiers based on their level of integration with planning:
THREE-TIER TAXONOMY OF AD WORLD MODELS
┌─────────────────────────────────────────────────────────────┐
│ TIER 3: Interactive Prediction + Planning │
│ │
│ World model directly participates in decision-making. │
│ The planner uses the world model to simulate consequences │
│ of actions (model-based RL / model-predictive control). │
│ │
│ Examples: MILE, Think2Drive, DriveWM │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Plan ──> Simulate ──> Evaluate ──> Replan │ │
│ │ ^ │ │ │
│ │ └────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ TIER 2: Behavior Planning for Intelligent Agents │
│ │
│ World model predicts agent behaviors and trajectories. │
│ Used for forecasting what other agents will do. │
│ │
│ Examples: TrafficBots, CtRL-Sim, BehaviorGPT │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Current State ──> World Model ──> Agent Futures │ │
│ │ (trajectories) │ │
│ └──────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ TIER 1: Generation of Future Physical World │
│ │
│ World model generates future sensor observations │
│ (images, point clouds, BEV maps, occupancy grids). │
│ Used primarily for data generation and simulation. │
│ │
│ Examples: GAIA-1, DriveDreamer, GenAD, Vista, Waymo WM │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Current Sensors ──> World Model ──> Future │ │
│ │ + Action Sensors │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Tier 1: Generation of Future Physical World
This tier focuses on generating realistic sensor observations. The output types include:
| Output Type | Representation | Models | Use Case |
|---|---|---|---|
| RGB Images | Multi-view camera images | GAIA-1, DriveDreamer, Vista | Perception testing |
| BEV Maps | Top-down semantic grids | BEVWorld, BEVGen | Planning evaluation |
| Occupancy | 3D voxel grids | OccWorld, OccSora | 3D reasoning |
| Point Cloud | 3D point sets | LiDARGen, RangeLDM | LiDAR perception testing |
| Multi-modal | Camera + LiDAR + BEV | Waymo WM | Full-stack testing |
Quality metrics for Tier 1:
- FID (Frechet Inception Distance) for image quality
- FVD (Frechet Video Distance) for video quality
- LPIPS (Learned Perceptual Image Patch Similarity) for perceptual similarity
- Chamfer Distance for point cloud accuracy
- mIoU for semantic/occupancy correctness
Tier 2: Behavior Planning for Intelligent Agents
This tier predicts how other agents (vehicles, pedestrians, cyclists) will behave:
Tier 2 World Model for Agent Behavior:
Scene Context Agent Behavior Prediction
┌───────────────┐ ┌────────────────────────┐
│ Agent states │ │ Vehicle A: [traj_1, │
│ Road topology │──>WM──> │ traj_2, │
│ Traffic rules │ │ traj_3] │
│ Ego plan │ │ Ped B: [traj_1] │
└───────────────┘ │ Cyclist C: [traj_1, │
│ traj_2] │
└────────────────────────┘
(multiple possible futures
per agent, with probabilities)
Models like BehaviorGPT and CtRL-Sim fall in this category. They do not generate images but predict the future trajectories of all agents in the scene.
Tier 3: Interactive Prediction and Planning
This tier integrates the world model directly into the planning loop:
Tier 3: World Model in the Planning Loop
┌──────────────────────────────────────────────────────────┐
│ │
│ 1. Propose candidate action: a_candidate │
│ │ │
│ v │
│ 2. Imagine future: s_{t+1} = WM(s_t, a_candidate) │
│ │ │
│ v │
│ 3. Evaluate: reward = R(s_{t+1}) │
│ │ │
│ v │
│ 4. Repeat for multiple candidates │
│ │ │
│ v │
│ 5. Execute best: a* = argmax_a R(WM(s_t, a)) │
│ │
└──────────────────────────────────────────────────────────┘
This is essentially model-based reinforcement learning applied to driving. The world model serves as a learned dynamics model that the planner uses to simulate consequences of actions before committing to one.
Examples:
- MILE: Uses a world model for end-to-end driving in CARLA
- Think2Drive: Plans by imagining future scenarios
- DriveWM: Integrates world model prediction with planning
Technical Deep Dive
Video Diffusion Models for Driving Scenes
Most state-of-the-art driving world models use some variant of video diffusion. The core architecture typically involves:
Video Diffusion Architecture for Driving:
┌────────────────────────────────────────────────────────────────┐
│ LATENT SPACE │
│ │
│ Input frames ──> VAE Encoder ──> Latent z [T, h, w, c] │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Denoising Network (U-Net or DiT) │ │
│ │ │ │
│ │ For each denoising step t = T, T-1, ..., 1, 0: │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ Spatial Self-Attention │ │ │
│ │ │ (within each frame) │ │ │
│ │ ├──────────────────────────────────────────────────┤ │ │
│ │ │ Temporal Self-Attention │ │ │
│ │ │ (across frames at same spatial location) │ │ │
│ │ ├──────────────────────────────────────────────────┤ │ │
│ │ │ Cross-Attention with Conditioning │ │ │
│ │ │ (action, text, map, layout) │ │ │
│ │ ├──────────────────────────────────────────────────┤ │ │
│ │ │ Feedforward + ResNet blocks │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Denoised latent ──> VAE Decoder ──> Output frames │
└────────────────────────────────────────────────────────────────┘
Key design patterns across driving diffusion models:
-
Latent diffusion: Operate in the latent space of a pretrained VAE, not in pixel space. This dramatically reduces compute (from 512x1024x3 to 64x128x4).
-
Factored attention: Separate spatial and temporal attention reduces the quadratic cost. Instead of attending over all (THW) tokens, spatial attention is (H*W) per frame and temporal attention is (T) per spatial position.
-
Multi-view extension: For multi-camera setups (e.g., 6 cameras on Waymo/nuScenes), add a cross-view attention layer between cameras sharing the same timestep.
-
Progressive generation: Generate keyframes first, then interpolate intermediate frames. This improves long-range temporal consistency.
Conditioning Mechanisms
Driving world models support diverse conditioning signals. Here is how each type is typically integrated:
CONDITIONING MECHANISMS IN DRIVING WORLD MODELS
┌─────────────────────────────────────────────────────────────┐
│ │
│ ACTION CONDITIONING │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │ Steering│───>│ MLP │───>│ Cross-Attention │ │
│ │ Accel │ │ Encoder │ │ or FiLM layers │ │
│ └─────────┘ └──────────┘ └─────────────────┘ │
│ │
│ LANGUAGE CONDITIONING │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │ "A car │───>│ CLIP / │───>│ Cross-Attention │ │
│ │ cuts │ │ T5 │ │ (like Stable │ │
│ │ in" │ │ Encoder │ │ Diffusion) │ │
│ └─────────┘ └──────────┘ └─────────────────┘ │
│ │
│ LAYOUT / MAP CONDITIONING │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │ HD Map │───>│ Spatial │───>│ Addition to │ │
│ │ BBoxes │ │ Encoder │ │ input or │ │
│ │ Lanes │ │ (Conv) │ │ ControlNet │ │
│ └─────────┘ └──────────┘ └─────────────────┘ │
│ │
│ PAST FRAMES CONDITIONING │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │ Frame │───>│ VAE │───>│ Concatenation │ │
│ │ t-k to │ │ Encoder │ │ to noisy input │ │
│ │ t │ │ │ │ (channel-wise) │ │
│ └─────────┘ └──────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Classifier-free guidance (CFG) is widely used for controllability:
During training:
- Randomly drop conditioning with probability p_drop (e.g., 10%)
- Model learns both conditional and unconditional generation
During inference:
- epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)
- w > 1 amplifies the conditioning signal
- Higher w = more controllable but less diverse
- Lower w = more diverse but less responsive to conditioning
Multi-Sensor Generation
Generating consistent camera and LiDAR data simultaneously is one of the hardest challenges. Here are the main approaches:
Approach 1: Independent Generation with Consistency Loss
Camera Branch: z_t ──> Camera Diffusion ──> Images
│
├──> Consistency
│ Loss
│
LiDAR Branch: z_t ──> LiDAR Diffusion ──> Points
Simple but weak -- consistency is only encouraged, not guaranteed.
Approach 2: Shared Backbone with Modality-Specific Heads
z_t
│
v
┌──────────────────┐
│ Shared Backbone │
│ (Transformer) │
└────┬─────────┬───┘
│ │
v v
┌────────┐ ┌────────┐
│Camera │ │LiDAR │
│Head │ │Head │
└────────┘ └────────┘
Better consistency because features are shared. Used by several recent models.
Approach 3: Generate 3D First, Then Render
z_t ──> 3D World Generator ──> 3D Scene Representation
│
┌─────────┼─────────┐
│ │ │
v v v
Camera LiDAR BEV
Render Render Render
Best consistency but most expensive. The 3D representation can be a neural radiance field, point cloud, voxel grid, or tri-plane feature.
Evaluation Metrics for Generated Driving Scenarios
Evaluating world models is notoriously difficult. Here is a taxonomy of metrics:
Image/Video Quality Metrics:
| Metric | What It Measures | Formula/Method |
|---|---|---|
| FID | Distribution similarity | Frechet distance between real and generated Inception features |
| FVD | Video distribution similarity | FID extended to video (I3D features) |
| LPIPS | Perceptual similarity | Distance in VGG feature space |
| PSNR | Pixel-level accuracy | 10 * log10(MAX^2 / MSE) |
| SSIM | Structural similarity | Luminance, contrast, structure comparison |
Driving-Specific Metrics:
| Metric | What It Measures |
|---|---|
| Scene Consistency | Do objects maintain identity across frames? |
| Physical Plausibility | Do vehicles obey physics (no teleportation, clipping)? |
| Action Fidelity | Does the world respond correctly to ego actions? |
| Downstream Task Performance | Does a perception model trained on generated data work in real world? |
| Collision Rate | Are generated scenarios physically valid (no interpenetration)? |
| Scenario Diversity | How varied are the generated scenarios? |
The downstream evaluation paradigm:
Real Driving Data ──> Train World Model ──> Generate Synthetic Data
│
v
Train Perception Model
on Synthetic Data
│
v
Evaluate on Real
Test Set
│
v
Performance Delta =
proxy for generation quality
If perception models trained on generated data perform well on real test sets, the world model must be generating realistic, diverse, and useful data.
Controllability vs Diversity Trade-off
A fundamental tension exists in conditional generation:
Controllability-Diversity Trade-off:
High Control ────────────────────────────── Low Control
(exact spec) (free generation)
│ │
v v
Low Diversity High Diversity
(one output per (many varied
condition) outputs)
┌──────────────────────────────────────────────────┐
│ │
│ Safety Testing ──> Need HIGH control │
│ "Generate exactly this scenario" │
│ │
│ Training Data ──> Need HIGH diversity │
│ "Generate varied scenarios" │
│ │
│ Coverage Testing ──> Need BOTH │
│ "Generate diverse variants of this scenario" │
│ │
└──────────────────────────────────────────────────┘
Knobs for controlling this trade-off:
- CFG weight (w): Higher = more control, less diversity
- Temperature: Lower = more deterministic, less diverse
- Conditioning level: More detailed condition = more control
- Sampling method: DDIM (deterministic) vs DDPM (stochastic)
World Models vs Reconstruction-Based Simulation
When to Use Each Approach
DECISION TREE: RECONSTRUCTION vs WORLD MODEL
What is your primary use case?
│
├─── Regression testing on specific real routes?
│ │
│ └──> RECONSTRUCTION (3DGS/NeRF)
│ - You need exact scene fidelity
│ - You want to re-run real scenarios with perturbations
│ - Applied Intuition / Waabi UniSim approach
│
├─── Generating novel/rare scenarios for safety?
│ │
│ └──> WORLD MODEL (generative)
│ - You need scenarios not in your driving logs
│ - Language-controllable scenario creation
│ - Waymo World Model approach
│
├─── Training data augmentation?
│ │
│ └──> WORLD MODEL (generative)
│ - You need diverse, varied training samples
│ - Data scaling beyond what you have captured
│
├─── Sensor simulation for perception testing?
│ │
│ └──> RECONSTRUCTION (3DGS/NeRF) for accuracy
│ WORLD MODEL for coverage
│ BOTH is the ideal
│
└─── End-to-end planning in imagination?
│
└──> WORLD MODEL (Tier 3)
- Model-based RL
- Planning by imagining futures
Complementary Strengths
Rather than viewing these as competing approaches, the industry is converging on using both:
HYBRID SIMULATION PIPELINE (Emerging Best Practice):
┌──────────────────────────────────────────────────────────┐
│ │
│ RECONSTRUCTION LAYER │
│ ┌────────────────────────────────────────────────┐ │
│ │ 3DGS/NeRF scene reconstructions │ │
│ │ - Faithful replay of real driving logs │ │
│ │ - High geometric accuracy │ │
│ │ - Novel view synthesis for sensor sim │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ WORLD MODEL LAYER │
│ ┌────────────────────────────────────────────────┐ │
│ │ Generative world model │ │
│ │ - Adds new agents to reconstructed scenes │ │
│ │ - Generates counterfactual scenarios │ │
│ │ - Creates weather/lighting variations │ │
│ │ - Simulates sensor degradation │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ BEHAVIOR MODEL LAYER │
│ ┌────────────────────────────────────────────────┐ │
│ │ Learned agent behavior models │ │
│ │ - Realistic multi-agent interactions │ │
│ │ - Reactive to ego vehicle actions │ │
│ │ - Diverse behavioral modes │ │
│ └────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
Applied Intuition vs Waymo Comparison
These two companies represent different philosophies for AD simulation:
| Dimension | Applied Intuition | Waymo (World Model) |
|---|---|---|
| Core Philosophy | Reconstruction + editing | Generation from learned prior |
| Scene Source | Real captured scenes | Generated or real + generated |
| Primary Method | 3DGS/NeRF reconstruction, asset libraries | Genie 3-based world model |
| Scenario Creation | Manual + algorithmic perturbation | Language-controlled generation |
| Sensor Sim | Physics-based + neural rendering | Learned generation |
| Agent Behavior | Rule-based + learned models | Emergent from world model |
| Strengths | Geometric accuracy, determinism | Novel scenarios, scalability |
| Limitations | Bound to captured data, labor-intensive | May hallucinate, less precise |
| Target Users | OEMs needing validation tools | Internal Waymo AD development |
| Business Model | SaaS platform for multiple customers | Internal tool + research |
Future Convergence
The boundary between reconstruction and generation is blurring:
-
Reconstruction-guided generation: Use 3DGS scenes as conditioning for world models, getting geometric accuracy AND generative flexibility.
-
Generative inpainting on reconstructions: Reconstruct the static scene with 3DGS, then use a world model to generate dynamic objects and their behaviors.
-
Foundation model approach: Train a single large model that can do both reconstruction (when given dense input views) and generation (when given sparse conditioning). This is analogous to how large language models can both complete existing text and generate new text.
-
Learned physics in latent space: Rather than separate physics engines and neural renderers, learn physics implicitly in the latent space of the world model while maintaining hard constraints (e.g., conservation laws) through architectural design.
Code Examples
Example 1: Simple Action-Conditioned Video Prediction
This demonstrates the core concept of a world model: given a current frame and an action, predict the next frame.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleActionConditionedPredictor(nn.Module):
"""
Minimal world model: predicts next frame given current frame + action.
This is a simplified version to illustrate the concept.
Real models use diffusion, transformers, and operate in latent space.
"""
def __init__(
self,
img_channels: int = 3,
action_dim: int = 2, # (steering, acceleration)
hidden_dim: int = 256,
latent_dim: int = 64,
):
super().__init__()
# Encode current frame to latent representation
self.encoder = nn.Sequential(
nn.Conv2d(img_channels, 64, 4, stride=2, padding=1), # /2
nn.ReLU(),
nn.Conv2d(64, 128, 4, stride=2, padding=1), # /4
nn.ReLU(),
nn.Conv2d(128, hidden_dim, 4, stride=2, padding=1), # /8
nn.ReLU(),
)
# Action embedding via FiLM conditioning
# FiLM: Feature-wise Linear Modulation
# h_new = gamma(action) * h + beta(action)
self.action_to_gamma = nn.Sequential(
nn.Linear(action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
)
self.action_to_beta = nn.Sequential(
nn.Linear(action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
)
# Predict next latent state
self.dynamics = nn.Sequential(
nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
)
# Decode back to image space
self.decoder = nn.Sequential(
nn.ConvTranspose2d(hidden_dim, 128, 4, stride=2, padding=1), # *2
nn.ReLU(),
nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1), # *4
nn.ReLU(),
nn.ConvTranspose2d(64, img_channels, 4, stride=2, padding=1), # *8
nn.Sigmoid(), # Pixel values in [0, 1]
)
def forward(
self,
current_frame: torch.Tensor, # [B, C, H, W]
action: torch.Tensor, # [B, action_dim]
) -> torch.Tensor: # [B, C, H, W] predicted next frame
# 1. Encode current frame
z = self.encoder(current_frame) # [B, hidden_dim, h, w]
# 2. Condition on action via FiLM
gamma = self.action_to_gamma(action) # [B, hidden_dim]
beta = self.action_to_beta(action) # [B, hidden_dim]
# Reshape for broadcasting: [B, hidden_dim, 1, 1]
gamma = gamma.unsqueeze(-1).unsqueeze(-1)
beta = beta.unsqueeze(-1).unsqueeze(-1)
z_conditioned = gamma * z + beta # FiLM modulation
# 3. Predict dynamics (next latent state)
z_next = self.dynamics(z_conditioned)
# 4. Decode to next frame
next_frame = self.decoder(z_next)
return next_frame
# --- Training Loop ---
def train_simple_world_model():
"""Train the simple world model on driving data."""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleActionConditionedPredictor().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Simulated driving data (in practice, load from nuScenes/Waymo)
batch_size = 16
H, W = 128, 256 # Typical driving aspect ratio
for step in range(10000):
# Simulate a batch of (current_frame, action, next_frame) triplets
current_frames = torch.randn(batch_size, 3, H, W, device=device).sigmoid()
actions = torch.randn(batch_size, 2, device=device) # (steer, accel)
target_frames = torch.randn(batch_size, 3, H, W, device=device).sigmoid()
# Forward pass
predicted_frames = model(current_frames, actions)
# Loss: pixel-wise MSE (real models add perceptual + adversarial losses)
loss = F.mse_loss(predicted_frames, target_frames)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 1000 == 0:
print(f"Step {step}, Loss: {loss.item():.4f}")
# --- Inference: Roll out a trajectory ---
@torch.no_grad()
def rollout_world_model(
model: SimpleActionConditionedPredictor,
initial_frame: torch.Tensor, # [1, 3, H, W]
action_sequence: torch.Tensor, # [T, 2]
) -> list[torch.Tensor]:
"""
Generate a sequence of future frames by autoregressively
applying the world model.
"""
frames = [initial_frame]
current = initial_frame
for t in range(len(action_sequence)):
action = action_sequence[t].unsqueeze(0) # [1, 2]
next_frame = model(current, action)
frames.append(next_frame)
current = next_frame # Autoregressive: feed prediction back
return frames # List of [1, 3, H, W] tensors
Example 2: Diffusion Model for Driving Scene Generation
This example shows a simplified video diffusion model with action and layout conditioning, following the architecture used by DriveDreamer and similar models.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
@dataclass
class DiffusionConfig:
"""Configuration for the driving diffusion model."""
num_frames: int = 8 # Number of frames to generate
height: int = 64 # Latent height (image_h / 8)
width: int = 128 # Latent width (image_w / 8)
latent_channels: int = 4 # VAE latent channels
model_channels: int = 256 # Base model channel dimension
num_heads: int = 8 # Attention heads
action_dim: int = 3 # (steering, acceleration, speed)
text_dim: int = 512 # Text embedding dimension
num_timesteps: int = 1000 # Diffusion timesteps
num_res_blocks: int = 2 # ResNet blocks per level
class SinusoidalTimestepEmbedding(nn.Module):
"""Standard sinusoidal embedding for diffusion timestep."""
def __init__(self, dim: int):
super().__init__()
self.dim = dim
def forward(self, t: torch.Tensor) -> torch.Tensor:
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=t.device) * -emb)
emb = t[:, None].float() * emb[None, :]
return torch.cat([emb.sin(), emb.cos()], dim=-1)
class SpatioTemporalAttention(nn.Module):
"""
Factored attention: spatial within each frame, temporal across frames.
This is the core building block of video diffusion models.
"""
def __init__(self, channels: int, num_heads: int, num_frames: int):
super().__init__()
self.num_frames = num_frames
# Spatial self-attention (within each frame)
self.spatial_norm = nn.GroupNorm(32, channels)
self.spatial_attn = nn.MultiheadAttention(
channels, num_heads, batch_first=True
)
# Temporal self-attention (across frames at same position)
self.temporal_norm = nn.GroupNorm(32, channels)
self.temporal_attn = nn.MultiheadAttention(
channels, num_heads, batch_first=True
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: [B*T, C, H, W] where T = num_frames
Returns:
[B*T, C, H, W]
"""
BT, C, H, W = x.shape
B = BT // self.num_frames
T = self.num_frames
# --- Spatial attention (within each frame) ---
h = self.spatial_norm(x)
h = h.reshape(BT, C, H * W).permute(0, 2, 1) # [BT, HW, C]
h_attn, _ = self.spatial_attn(h, h, h)
h = h_attn.permute(0, 2, 1).reshape(BT, C, H, W)
x = x + h
# --- Temporal attention (across frames) ---
h = self.temporal_norm(x)
# Reshape to [B*H*W, T, C] for temporal attention
h = h.reshape(B, T, C, H, W)
h = h.permute(0, 3, 4, 1, 2).reshape(B * H * W, T, C)
h_attn, _ = self.temporal_attn(h, h, h)
h = h_attn.reshape(B, H, W, T, C).permute(0, 3, 4, 1, 2)
h = h.reshape(BT, C, H, W)
x = x + h
return x
class CrossAttentionConditioning(nn.Module):
"""Cross-attention for text/action conditioning."""
def __init__(self, channels: int, context_dim: int, num_heads: int):
super().__init__()
self.norm = nn.GroupNorm(32, channels)
self.to_q = nn.Linear(channels, channels)
self.to_k = nn.Linear(context_dim, channels)
self.to_v = nn.Linear(context_dim, channels)
self.out_proj = nn.Linear(channels, channels)
self.num_heads = num_heads
def forward(
self,
x: torch.Tensor, # [B, C, H, W]
context: torch.Tensor, # [B, seq_len, context_dim]
) -> torch.Tensor:
B, C, H, W = x.shape
h = self.norm(x)
h = h.reshape(B, C, H * W).permute(0, 2, 1) # [B, HW, C]
q = self.to_q(h)
k = self.to_k(context)
v = self.to_v(context)
# Simple scaled dot-product attention
head_dim = C // self.num_heads
scale = head_dim ** -0.5
attn = torch.bmm(q * scale, k.transpose(-2, -1))
attn = attn.softmax(dim=-1)
out = torch.bmm(attn, v)
out = self.out_proj(out)
out = out.permute(0, 2, 1).reshape(B, C, H, W)
return x + out
class DrivingDiffusionModel(nn.Module):
"""
Simplified video diffusion model for driving scene generation.
Generates a sequence of latent frames conditioned on:
- Past frames (context)
- Ego vehicle actions (steering, acceleration)
- Text description (optional)
- Diffusion timestep
"""
def __init__(self, config: DiffusionConfig):
super().__init__()
self.config = config
C = config.model_channels
# Timestep embedding
self.time_embed = nn.Sequential(
SinusoidalTimestepEmbedding(C),
nn.Linear(C, C * 4),
nn.SiLU(),
nn.Linear(C * 4, C),
)
# Action embedding
self.action_embed = nn.Sequential(
nn.Linear(config.action_dim * config.num_frames, C),
nn.SiLU(),
nn.Linear(C, C),
)
# Input projection (latent channels -> model channels)
self.input_proj = nn.Conv2d(config.latent_channels, C, 3, padding=1)
# Core blocks: ResNet + SpatioTemporal Attention + Cross-Attention
self.blocks = nn.ModuleList()
for _ in range(config.num_res_blocks):
self.blocks.append(nn.ModuleDict({
"resnet": nn.Sequential(
nn.GroupNorm(32, C),
nn.SiLU(),
nn.Conv2d(C, C, 3, padding=1),
nn.GroupNorm(32, C),
nn.SiLU(),
nn.Conv2d(C, C, 3, padding=1),
),
"st_attn": SpatioTemporalAttention(C, config.num_heads, config.num_frames),
"cross_attn": CrossAttentionConditioning(C, config.text_dim, config.num_heads),
}))
# Output projection (model channels -> latent channels)
self.output_proj = nn.Sequential(
nn.GroupNorm(32, C),
nn.SiLU(),
nn.Conv2d(C, config.latent_channels, 3, padding=1),
)
def forward(
self,
noisy_latents: torch.Tensor, # [B, T, latent_c, H, W]
timestep: torch.Tensor, # [B]
actions: torch.Tensor, # [B, T, action_dim]
text_embeddings: torch.Tensor | None = None, # [B, seq_len, text_dim]
) -> torch.Tensor:
"""Predict the noise (or clean sample) for the given noisy input."""
B, T, LC, H, W = noisy_latents.shape
# Flatten batch and time for spatial operations
x = noisy_latents.reshape(B * T, LC, H, W)
x = self.input_proj(x) # [B*T, C, H, W]
# Embed timestep and add to features
t_emb = self.time_embed(timestep) # [B, C]
t_emb = t_emb.unsqueeze(1).repeat(1, T, 1).reshape(B * T, -1)
x = x + t_emb[:, :, None, None]
# Embed actions
a_emb = self.action_embed(actions.reshape(B, -1)) # [B, C]
# If no text, use action embedding as context for cross-attention
if text_embeddings is None:
context = a_emb.unsqueeze(1) # [B, 1, C]
else:
# Concatenate action embedding with text embeddings
context = torch.cat([
a_emb.unsqueeze(1),
text_embeddings
], dim=1) # [B, 1 + seq_len, text_dim]
# Repeat context for each frame
context_expanded = context.unsqueeze(1).repeat(1, T, 1, 1)
context_expanded = context_expanded.reshape(B * T, -1, context.shape[-1])
# Apply blocks
for block in self.blocks:
# ResNet block
h = block["resnet"](x)
x = x + h
# Spatiotemporal attention
x = block["st_attn"](x)
# Cross-attention with conditioning
x = block["cross_attn"](x, context_expanded)
# Output
noise_pred = self.output_proj(x) # [B*T, latent_c, H, W]
noise_pred = noise_pred.reshape(B, T, LC, H, W)
return noise_pred
# --- DDPM Sampling ---
class DDPMSampler:
"""Simple DDPM sampler for the driving diffusion model."""
def __init__(self, num_timesteps: int = 1000):
self.num_timesteps = num_timesteps
# Linear beta schedule
self.betas = torch.linspace(1e-4, 0.02, num_timesteps)
self.alphas = 1.0 - self.betas
self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
@torch.no_grad()
def sample(
self,
model: DrivingDiffusionModel,
actions: torch.Tensor,
text_embeddings: torch.Tensor | None = None,
cfg_scale: float = 7.5,
) -> torch.Tensor:
"""
Generate driving video latents using DDPM sampling.
Args:
model: The diffusion model
actions: [B, T, action_dim] ego actions
text_embeddings: Optional text conditioning
cfg_scale: Classifier-free guidance scale
Returns:
[B, T, latent_c, H, W] generated latents
"""
device = actions.device
B = actions.shape[0]
config = model.config
# Start from pure noise
shape = (B, config.num_frames, config.latent_channels,
config.height, config.width)
x = torch.randn(shape, device=device)
alpha_cumprod = self.alpha_cumprod.to(device)
for t in reversed(range(self.num_timesteps)):
t_batch = torch.full((B,), t, device=device, dtype=torch.long)
# Predict noise (with classifier-free guidance)
noise_pred_cond = model(x, t_batch, actions, text_embeddings)
if cfg_scale > 1.0 and text_embeddings is not None:
noise_pred_uncond = model(x, t_batch, actions, None)
noise_pred = noise_pred_uncond + cfg_scale * (
noise_pred_cond - noise_pred_uncond
)
else:
noise_pred = noise_pred_cond
# DDPM update step
alpha_t = self.alphas[t]
alpha_bar_t = alpha_cumprod[t]
alpha_bar_prev = alpha_cumprod[t - 1] if t > 0 else torch.tensor(1.0)
# Predicted x_0
x0_pred = (x - (1 - alpha_bar_t).sqrt() * noise_pred) / alpha_bar_t.sqrt()
x0_pred = x0_pred.clamp(-1, 1)
# Posterior mean
coeff1 = (alpha_bar_prev.sqrt() * self.betas[t]) / (1 - alpha_bar_t)
coeff2 = (alpha_t.sqrt() * (1 - alpha_bar_prev)) / (1 - alpha_bar_t)
mean = coeff1 * x0_pred + coeff2 * x
# Add noise (except at t=0)
if t > 0:
noise = torch.randn_like(x)
sigma = ((1 - alpha_bar_prev) / (1 - alpha_bar_t) * self.betas[t]).sqrt()
x = mean + sigma * noise
else:
x = mean
return x
# --- Usage Example ---
def generate_driving_scenario():
"""Example: generate a driving scenario with the diffusion model."""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = DiffusionConfig()
model = DrivingDiffusionModel(config).to(device)
sampler = DDPMSampler(config.num_timesteps)
# Define ego actions: gentle left turn with constant speed
B = 1
actions = torch.zeros(B, config.num_frames, config.action_dim, device=device)
actions[:, :, 0] = 0.3 # Steering: slight left
actions[:, :, 1] = 0.0 # Acceleration: constant
actions[:, :, 2] = 0.5 # Speed: moderate
# Generate (in practice, decode latents with a VAE decoder)
latents = sampler.sample(model, actions, cfg_scale=1.0)
print(f"Generated latents shape: {latents.shape}")
# -> [1, 8, 4, 64, 128] = 8 frames of 64x128 latent maps
return latents
Example 3: Evaluation of Generated Scenarios
import torch
import torch.nn as nn
import numpy as np
from typing import NamedTuple
from scipy import linalg
class ScenarioMetrics(NamedTuple):
"""Metrics for evaluating a generated driving scenario."""
fid: float # Frechet Inception Distance
fvd: float # Frechet Video Distance
action_fidelity: float # Does world respond to actions correctly?
physical_plausibility: float # Are physics constraints respected?
temporal_consistency: float # Are objects consistent across frames?
def compute_fid(
real_features: np.ndarray, # [N, D] Inception features of real images
gen_features: np.ndarray, # [M, D] Inception features of generated images
) -> float:
"""
Compute Frechet Inception Distance between real and generated distributions.
FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r @ Sigma_g)^0.5)
Lower is better. FID = 0 means identical distributions.
"""
mu_r = real_features.mean(axis=0)
mu_g = gen_features.mean(axis=0)
sigma_r = np.cov(real_features, rowvar=False)
sigma_g = np.cov(gen_features, rowvar=False)
diff = mu_r - mu_g
covmean, _ = linalg.sqrtm(sigma_r @ sigma_g, disp=False)
# Numerical stability
if np.iscomplexobj(covmean):
covmean = covmean.real
fid = diff @ diff + np.trace(sigma_r + sigma_g - 2 * covmean)
return float(fid)
def compute_action_fidelity(
actions: torch.Tensor, # [B, T, action_dim]
generated_frames: torch.Tensor, # [B, T, C, H, W]
optical_flow_model: nn.Module, # Pretrained optical flow estimator
) -> float:
"""
Measure whether generated videos respond correctly to actions.
Intuition: If we command a left turn, the optical flow in the
generated video should show rightward motion (world moving right
as ego turns left).
Returns a score in [0, 1] where 1 = perfect action fidelity.
"""
B, T, C, H, W = generated_frames.shape
scores = []
for t in range(T - 1):
frame_curr = generated_frames[:, t]
frame_next = generated_frames[:, t + 1]
# Compute optical flow between consecutive frames
flow = optical_flow_model(frame_curr, frame_next) # [B, 2, H, W]
# Expected flow direction based on action
steering = actions[:, t, 0] # positive = left turn
expected_horizontal_flow = -steering # left turn -> rightward flow
# Compute mean horizontal flow
mean_h_flow = flow[:, 0].mean(dim=(-2, -1)) # Average horizontal flow
# Correlation between expected and actual flow direction
# Normalize to get direction agreement
direction_agreement = (
torch.sign(mean_h_flow) == torch.sign(expected_horizontal_flow)
).float()
scores.append(direction_agreement.mean().item())
return float(np.mean(scores))
def compute_physical_plausibility(
generated_bboxes: torch.Tensor, # [B, T, N_agents, 7] (x,y,z,l,w,h,yaw)
) -> float:
"""
Check physical plausibility of generated scenarios.
Checks for:
1. No interpenetration (bounding boxes don't overlap)
2. Reasonable velocities (no teleportation)
3. Vehicles stay on road surface
Returns a score in [0, 1] where 1 = fully plausible.
"""
B, T, N, _ = generated_bboxes.shape
violations = 0
total_checks = 0
for b in range(B):
for t in range(T):
# Check 1: No interpenetration (simplified 2D IoU check)
positions = generated_bboxes[b, t, :, :2] # [N, 2] (x, y)
sizes = generated_bboxes[b, t, :, 3:5] # [N, 2] (l, w)
for i in range(N):
for j in range(i + 1, N):
dist = torch.norm(positions[i] - positions[j])
min_dist = (sizes[i].norm() + sizes[j].norm()) / 2
if dist < min_dist * 0.5: # Significant overlap
violations += 1
total_checks += 1
# Check 2: Reasonable velocities (no teleportation)
if t > 0:
prev_positions = generated_bboxes[b, t - 1, :, :2]
velocities = (positions - prev_positions) # Assuming dt = 0.1s
speeds = torch.norm(velocities, dim=-1) / 0.1 # m/s
max_reasonable_speed = 50.0 # ~180 km/h
violations += (speeds > max_reasonable_speed).sum().item()
total_checks += N
if total_checks == 0:
return 1.0
return 1.0 - (violations / total_checks)
def compute_temporal_consistency(
generated_frames: torch.Tensor, # [B, T, C, H, W]
) -> float:
"""
Measure temporal consistency via inter-frame LPIPS stability.
Consistent videos should have smooth LPIPS changes between
consecutive frames (no sudden jumps in appearance).
Returns a score in [0, 1] where 1 = perfectly consistent.
"""
B, T, C, H, W = generated_frames.shape
# Compute pairwise L2 distances between consecutive frames
# (Simplified -- real implementation uses LPIPS)
deltas = []
for t in range(T - 1):
diff = (generated_frames[:, t] - generated_frames[:, t + 1]).pow(2)
frame_dist = diff.mean(dim=(-3, -2, -1)) # [B]
deltas.append(frame_dist)
deltas = torch.stack(deltas, dim=1) # [B, T-1]
# Consistency = low variance in inter-frame distances
# (smooth changes, no sudden jumps)
variance = deltas.var(dim=1).mean()
# Normalize to [0, 1] (heuristic scaling)
consistency = torch.exp(-10.0 * variance).item()
return consistency
def evaluate_world_model(
world_model: nn.Module,
real_dataset: torch.utils.data.DataLoader,
num_eval_samples: int = 1000,
) -> ScenarioMetrics:
"""
Full evaluation pipeline for a driving world model.
In practice, this would:
1. Generate scenarios using the world model
2. Compute FID/FVD against real driving data
3. Measure action fidelity, physical plausibility, etc.
"""
print("Evaluation pipeline:")
print(" 1. Generating scenarios from world model...")
print(" 2. Extracting Inception features for FID...")
print(" 3. Extracting I3D features for FVD...")
print(" 4. Computing action fidelity...")
print(" 5. Checking physical plausibility...")
print(" 6. Measuring temporal consistency...")
# Placeholder values (real implementation would compute these)
return ScenarioMetrics(
fid=24.5, # Lower is better
fvd=180.3, # Lower is better
action_fidelity=0.87, # Higher is better [0,1]
physical_plausibility=0.93, # Higher is better [0,1]
temporal_consistency=0.81, # Higher is better [0,1]
)
Mental Models and Diagrams
World Model Architecture Overview
WORLD MODEL ARCHITECTURE (Generalized)
┌──────────────────────────────────────────────────────────────────┐
│ INPUTS │
│ │
│ Past Observations Actions Conditions │
│ ┌───────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Cameras (t-k │ │ Steering │ │ Text prompt │ │
│ │ to t) │ │ Acceleration │ │ HD Map │ │
│ │ LiDAR (t-k │ │ Or: waypoints│ │ Weather │ │
│ │ to t) │ │ Or: language │ │ Time of day │ │
│ └───────┬───────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ v v v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ ENCODER MODULE │ │
│ │ │ │
│ │ Visual Encoder (ViT / ResNet / VQ-VAE) │ │
│ │ Action Encoder (MLP / Embedding) │ │
│ │ Condition Encoder (CLIP / T5 / Spatial Encoder) │ │
│ │ │ │
│ └──────────────────────┬───────────────────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ WORLD MODEL CORE │ │
│ │ │ │
│ │ Option A: Autoregressive Transformer │ │
│ │ P(z_{t+1} | z_1, ..., z_t, actions, conditions) │ │
│ │ │ │
│ │ Option B: Diffusion Model │ │
│ │ Score function: s_theta(z_t, t, actions, conds) │ │
│ │ │ │
│ │ Option C: Hybrid (AR + Diffusion) │ │
│ │ AR for temporal structure, Diffusion per frame │ │
│ │ │ │
│ └──────────────────────┬───────────────────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DECODER MODULE │ │
│ │ │ │
│ │ Camera Decoder ──> Multi-view RGB images │ │
│ │ LiDAR Decoder ──> Point clouds / range images │ │
│ │ BEV Decoder ──> Bird's eye view semantic maps │ │
│ │ Agent Decoder ──> 3D bounding boxes + tracks │ │
│ │ Occ Decoder ──> 3D occupancy grids │ │
│ │ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ OUTPUTS │
│ Future multi-modal sensor observations (t+1 to t+H) │
└──────────────────────────────────────────────────────────────────┘
Reconstruction vs Generation Pipeline Comparison
RECONSTRUCTION-BASED PIPELINE (Applied Intuition / Waabi style):
CAPTURE PHASE RECONSTRUCTION PHASE SIMULATION PHASE
┌──────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Drive real │ │ Per-scene 3D │ │ Render from new │
│ vehicle with │──────────>│ reconstruction │────────>│ viewpoints │
│ sensor suite │ │ (3DGS / NeRF) │ │ │
│ │ │ │ │ Move agents │
│ Collect: │ │ Optimize: │ │ Change lighting │
│ - Images │ │ - Gaussian splats │ │ Add rain/fog │
│ - LiDAR │ │ - Neural radiance │ │ │
│ - GPS/IMU │ │ - Signed distance │ │ RE-RENDER the │
│ - Boxes │ │ │ │ same scene with │
└──────────────┘ └───────────────────┘ │ modifications │
└──────────────────┘
Scene A ──> Reconstruct A ──> Simulate in A
Scene B ──> Reconstruct B ──> Simulate in B Each scene is separate!
Scene C ──> Reconstruct C ──> Simulate in C
GENERATIVE PIPELINE (Waymo World Model style):
DATA PHASE TRAINING PHASE GENERATION PHASE
┌──────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Collect │ │ Train ONE world │ │ Generate ANY │
│ large-scale │──────────>│ model on ALL data │────────>│ scenario: │
│ driving data │ │ │ │ │
│ │ │ Learn: │ │ "Rainy highway │
│ 1000s of │ │ - Visual patterns │ │ with a truck │
│ hours across │ │ - Physics rules │ │ cutting in" │
│ many cities │ │ - Agent behaviors │ │ │
│ │ │ - Sensor models │ │ "Pedestrian runs │
└──────────────┘ └───────────────────┘ │ across 4-lane │
│ road at night" │
ONE model covers │ │
ALL scenarios │ "Snowy parking │
│ lot with kids" │
└──────────────────┘
Multi-Modal Generation Flow
MULTI-MODAL GENERATION FLOW (Camera + LiDAR + BEV)
Step 1: Encode current multi-modal observations
┌─────────────────────────────────────────────────────────┐
│ │
│ Front Camera ──┐ │
│ Left Camera ──┼──> Visual ──┐ │
│ Right Camera ──┤ Encoder │ │
│ Rear Camera ──┘ │ │
│ ├──> Fused Latent z_t │
│ LiDAR Scan ──> Point Cloud ──┤ │
│ Encoder │ │
│ │ │
│ HD Map ──────> Map Encoder ──┘ │
│ │
└─────────────────────────────────────────────────────────┘
│
v
Step 2: Apply world model dynamics
┌─────────────────────────────────────────────────────────┐
│ │
│ z_t + action_t ──> World Model Core ──> z_{t+1} │
│ │
│ The latent z_{t+1} encodes the FULL next world state │
│ including all modalities in a shared representation │
│ │
└─────────────────────────────────────────────────────────┘
│
v
Step 3: Decode to each modality (geometrically consistent)
┌─────────────────────────────────────────────────────────┐
│ │
│ z_{t+1} │
│ │ │
│ ┌───────┼───────┬──────────┐ │
│ │ │ │ │ │
│ v v v v │
│ Camera LiDAR BEV Occupancy │
│ Decoder Decoder Decoder Decoder │
│ │ │ │ │ │
│ v v v v │
│ ┌────┐ ┌────┐ ┌────────┐ ┌────────┐ │
│ │ │ │ . │ │ Roads │ │ Voxels │ │
│ │ │ │. . │ │ Cars │ │ 3D occ │ │
│ │ │ │. .│ │ Lanes │ │ grid │ │
│ └────┘ └────┘ └────────┘ └────────┘ │
│ 6 views 64-beam Semantic 256^3 │
│ images LiDAR map voxels │
│ │
│ KEY: All outputs are geometrically consistent because │
│ they decode from the SAME latent z_{t+1} │
│ │
└─────────────────────────────────────────────────────────┘
The Scaling Hypothesis for World Models
SCALING BEHAVIOR OF WORLD MODELS
Quality
(FVD, lower = better)
|
| *
|
| * Scaling law:
| FVD ~ C * N^(-alpha)
| *
| where N = model parameters
| * and alpha ~ 0.3-0.5
| *
| *
| * * * *
| * * *
+-----+-----+-----+-----+-----+-----+-----+----->
100M 500M 1B 2B 5B 10B 50B
Model Size (parameters)
Evidence:
- GAIA-1: 9B params >> smaller variants on all metrics
- Sora: Scaling improved video quality predictably
- Genie 3: Larger models = more consistent physics
Implication: AD world models will get dramatically better
as compute budgets increase. This is not an architecture
problem -- it is a scaling problem.
Hands-On Exercises
Exercise 1: Build a Minimal World Model
Goal: Implement and train a simple frame prediction model on a toy driving dataset.
Steps:
- Use the
SimpleActionConditionedPredictorfrom Code Example 1 - Create a synthetic dataset: moving colored squares on a gray background, where the action controls the camera's panning direction
- Train for 5000 steps and visualize predictions vs ground truth
- Experiment with different action values and observe how predictions change
Expected outcome: The model should learn that steering left causes the scene to shift right, and acceleration causes objects to grow (approach).
Stretch goal: Add a second object that moves independently of the ego action. Does the model learn to predict both ego-induced and independent motion?
Exercise 2: Implement Classifier-Free Guidance
Goal: Add classifier-free guidance to the diffusion model from Code Example 2.
Steps:
- Modify the training loop to randomly drop conditioning with probability 0.1
- During generation, compute both conditional and unconditional predictions
- Apply the CFG formula:
pred = uncond + w * (cond - uncond) - Generate scenarios with w = 1.0, 3.0, 7.5, and 15.0
- Compare the results: how does increasing w affect:
- Action responsiveness
- Visual quality
- Diversity (generate 10 samples with same condition, measure variance)
Expected outcome: Higher w produces more action-responsive but less diverse outputs. Very high w (>15) causes artifacts.
Exercise 3: Multi-View Consistency Check
Goal: Evaluate whether a world model maintains geometric consistency across multiple camera views.
Steps:
- Given generated multi-view images (front, left, right cameras)
- Run a 3D object detector on each view independently
- Project detected 3D boxes from each view into a common BEV coordinate frame
- Measure consistency: do the same objects appear at the same 3D locations across views?
- Compute a "multi-view consistency score" as IoU of projected boxes
def multi_view_consistency_score(
front_detections_bev: list, # List of (x, y, w, l, yaw) in BEV
left_detections_bev: list,
right_detections_bev: list,
iou_threshold: float = 0.3,
) -> float:
"""
Compute how consistently objects appear across camera views.
For each object detected in the front view, check if a matching
object exists in overlapping regions of left/right views.
Returns: fraction of front-view objects with cross-view matches.
"""
# Your implementation here:
# 1. For each front detection, find nearest detection in left/right
# 2. Compute BEV IoU between matched pairs
# 3. Count matches above iou_threshold
# 4. Return match_count / total_front_detections
pass
Exercise 4: Action Fidelity Benchmark
Goal: Quantitatively measure whether a world model responds correctly to actions.
Steps:
- Generate 100 scenarios with action = "strong left turn"
- Generate 100 scenarios with action = "strong right turn"
- Generate 100 scenarios with action = "go straight"
- For each generated video, compute the average optical flow direction
- Verify: left-turn videos should have rightward flow, right-turn should have leftward flow, straight should have forward (downward in image) flow
- Compute classification accuracy: can you tell the action from the flow alone?
Expected outcome: A good world model should achieve >90% action classification accuracy from optical flow analysis.
Exercise 5: Compare AR vs Diffusion Generation
Goal: Understand the quality trade-offs between autoregressive and diffusion world models.
Steps:
- Implement a simple AR model (predict frame t+1 from frame t using a CNN)
- Implement a simple diffusion model (generate frame t+1 conditioned on frame t)
- Train both on the same dataset
- Compare:
- Per-frame quality (PSNR, SSIM)
- Temporal consistency (frame-to-frame LPIPS variance)
- Diversity (generate 10 samples, measure inter-sample variance)
- Inference speed (wall-clock time per frame)
- Plot quality vs sequence length for both models
Expected outcome: AR models degrade faster over long sequences (error accumulation). Diffusion models have higher per-frame quality but are slower.
Exercise 6: Language-Conditioned Scenario Generation
Goal: Add language conditioning to a world model and evaluate controllability.
Steps:
- Extend the diffusion model from Code Example 2 to accept CLIP text embeddings
- Create a small dataset of (video, caption) pairs:
- "Car braking in front" -> videos with decelerating lead vehicle
- "Pedestrian crossing" -> videos with pedestrian in crosswalk
- "Highway driving" -> videos of open highway
- Train with text conditioning (randomly drop text 10% for CFG)
- At inference, generate scenarios from text prompts
- Evaluate: can a human rater correctly identify which prompt generated which video?
- Compute a CLIP-score between generated frames and the text prompt
Interview Questions
Question 1: What is a world model and how does it differ from a traditional simulator?
Answer hints: A world model is a learned neural network that predicts future states of the environment given current state and actions. Unlike traditional simulators that use hand-crafted physics engines, rendering pipelines, and scripted behaviors, world models learn these dynamics from data. Key differences: (1) world models can capture complex behaviors that are hard to hand-code, (2) they generalize to new scenarios by interpolating learned patterns, (3) they may hallucinate physically implausible scenarios since they have no hard physics constraints. The term originates from Ha and Schmidhuber (2018) and has been applied to AD by systems like GAIA-1 and the Waymo World Model.
Question 2: Compare autoregressive and diffusion-based approaches for driving world models. When would you choose one over the other?
Answer hints: Autoregressive models generate frames sequentially, conditioning each frame on all previous ones. They naturally capture temporal dependencies and are conceptually simple (next-token prediction). However, they suffer from error accumulation over long sequences and generate pixels/tokens sequentially within each frame. Diffusion models generate entire frames (or video clips) by iteratively denoising from noise. They produce higher-quality individual frames, naturally handle continuous outputs, and offer strong controllability via CFG. However, they require many denoising steps (slow) and can struggle with long-range temporal consistency. Choose AR for real-time applications or when temporal coherence over many frames is critical. Choose diffusion when per-frame quality and controllability are priorities. Hybrid approaches (like Epona) combine both.
Question 3: How does the Waymo World Model leverage DeepMind's Genie 3?
Answer hints: Genie 3 provides the foundation architecture -- a scalable spatiotemporal transformer designed for interactive environment generation. Key elements carried over: (1) latent action models that learn meaningful action representations from data rather than requiring predefined action spaces, (2) an architecture optimized for generating long, consistent video sequences that respond causally to input actions, (3) scalable training infrastructure. Waymo extends this with AD-specific capabilities: multi-sensor output (camera + LiDAR), driving-specific conditioning (HD maps, traffic rules), and language controllability for rare event generation.
Question 4: What are the main evaluation metrics for a driving world model? Why is evaluation difficult?
Answer hints: Image quality (FID, LPIPS, PSNR, SSIM), video quality (FVD), driving-specific metrics (action fidelity, physical plausibility, temporal consistency, multi-view consistency), and downstream task performance (train a perception model on generated data, test on real data). Evaluation is difficult because: (1) there is no single ground truth for generated scenarios (many valid futures exist), (2) standard image metrics (FID) do not capture driving-specific quality (physics, consistency), (3) human evaluation is expensive and subjective, (4) the best metric -- downstream AD stack performance -- is extremely expensive to compute.
Question 5: Explain the controllability-diversity trade-off in world models. How is it managed in practice?
Answer hints: Higher controllability (the output closely follows the conditioning signal) reduces diversity (fewer varied outputs for the same condition). This is managed primarily through classifier-free guidance (CFG) weight: higher weight increases control at the cost of diversity. Temperature scaling and sampling strategies (DDIM vs DDPM) also affect this trade-off. In practice, different use cases need different settings: safety testing needs high control (specific scenario), training data generation needs high diversity (varied examples), and coverage testing needs both (diverse variants of specific scenario types).
Question 6: How can world models generate safety-critical scenarios that are rare in training data?
Answer hints: Several approaches: (1) Language conditioning -- describe the rare scenario in natural language ("pedestrian darts from behind parked bus"); (2) Guided sampling -- use a reward model to bias generation toward high-risk scenarios; (3) Latent space manipulation -- interpolate between known scenarios in latent space to create novel combinations; (4) Compositional generation -- combine common elements (intersection + cyclist + occluder) to create rare compositions; (5) Adversarial generation -- optimize the conditioning to find scenarios that cause the AD stack to fail. The Waymo World Model specifically emphasizes rare event generation as a key capability.
Question 7: What is the three-tier taxonomy of AD world models?
Answer hints: Tier 1 (Future Physical World Generation) generates sensor observations -- images, point clouds, BEV maps, occupancy grids. Used for data augmentation and sensor simulation. Tier 2 (Agent Behavior Prediction) predicts future trajectories of other agents. Used for motion forecasting and behavior simulation. Tier 3 (Interactive Prediction + Planning) integrates the world model into the planning loop -- the planner imagines consequences of actions using the world model before committing. This is model-based RL applied to driving. Most current production systems use Tier 1 and 2. Tier 3 is an active research frontier.
Question 8: How do world models handle multi-sensor consistency (camera + LiDAR)?
Answer hints: Three main approaches: (1) Independent generation with consistency loss -- separate models for each sensor, trained with a loss encouraging agreement. Simple but weak consistency. (2) Shared backbone with modality-specific heads -- a single encoder-decoder with shared latent representations and separate output heads. Better consistency through shared features. (3) Generate 3D first, then render -- create a 3D scene representation (voxels, point cloud, neural field) and render/raytrace to each sensor modality. Best consistency but most expensive. The Waymo World Model uses a shared representation approach where camera and LiDAR features are jointly processed.
Question 9: Compare reconstruction-based simulation (3DGS/NeRF) with generative world models. When would you use each?
Answer hints: Reconstruction excels at: exact replay of real routes, high geometric accuracy, sensor-level fidelity for specific scenes. Use it for regression testing, perception validation on known routes, and when you need deterministic, reproducible results. Generative world models excel at: novel scenario creation, long-tail event simulation, scalable scenario coverage, and language-controlled specification. Use them for safety scenario exploration, training data augmentation, and testing against scenarios not in your driving logs. The industry is converging on hybrid approaches that use reconstruction for the base scene and generation for dynamic elements, weather, and novel situations.
Question 10: What are the main unsolved challenges in AD world models as of 2026?
Answer hints: (1) Long-horizon consistency -- maintaining coherent, physically valid generation over minutes, not just seconds; (2) Geometric precision -- ensuring generated 3D structure is accurate enough for planning (pixel-level realism does not mean geometric accuracy); (3) Real-time generation -- current models are too slow for hardware-in-the-loop testing; (4) Evaluation -- no consensus on how to measure whether a world model is "good enough" for safety validation; (5) Guaranteed physical validity -- preventing the model from generating impossible scenarios (cars clipping through walls); (6) Multi-agent interaction -- generating realistic reactive behaviors for all agents, not just ego-centric prediction; (7) Sim-to-real transfer -- ensuring that AD systems tested in world-model simulation perform similarly in the real world.
References
Foundational Papers
-
Ha, D. and Schmidhuber, J. (2018). "World Models." arXiv:1803.10122. The foundational paper defining world models for RL agents.
-
Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. The paper that launched modern diffusion models.
-
Rombach, R. et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. Latent diffusion (Stable Diffusion) -- foundation for many driving world models.
Key AD World Model Papers
-
Hu, A. et al. (2023). "GAIA-1: A Generative World Model for Autonomous Driving." arXiv:2309.17080. Wayve's 9B parameter world model.
-
Wang, W. et al. (2023). "DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving." arXiv:2309.09777. First diffusion-based world model from real driving scenarios.
-
Yang, Z. et al. (2024). "GenAD: Generalized Predictive Model for Autonomous Driving." CVPR 2024. Action-conditioned generation with temporal reasoning.
-
Zheng, W. et al. (2024). "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving." ECCV 2024. World model in 3D occupancy space.
-
Gao, G. et al. (2024). "Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability." arXiv:2405.17398.
Surveys
-
Wang, X. et al. (2024). "World Models for Autonomous Driving: An In-Depth Survey." arXiv:2403.02622. Comprehensive survey with taxonomy.
-
Gao, Y. et al. (2024). "A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming." Covers video generation foundations.
Related Systems
-
Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." ICML 2024. DeepMind's interactive environment generation (foundation for Genie 3).
-
Yang, L. et al. (2024). "Generative Data-Driven Simulation." NVIDIA Dreamland -- hybrid physics + neural generation.
-
Waymo (2026). "Introducing the Waymo World Model." Blog post describing the Genie 3-based world model for camera + LiDAR generation.
Reconstruction-Based Comparisons
-
Yang, J. et al. (2023). "UniSim: A Neural Closed-Loop Sensor Simulator." CVPR 2023. Waabi's neural rendering approach.
-
Kerbl, B. et al. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering." SIGGRAPH 2023. Foundation for 3DGS-based simulation.
-
Yan, Z. et al. (2024). "Street Gaussians for Modeling Dynamic Urban Scenes." ECCV 2024. 3DGS applied to driving scenes.
Additional Resources
-
NVIDIA Cosmos (2025). Foundation world model platform for physical AI.
-
OpenAI Sora (2024). Large-scale video generation model demonstrating scaling laws for video quality.
-
Waymax (Gulino et al., 2023). "Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research." NeurIPS 2023. Complements world models as a behavior-level simulator.
-
WOSAC Challenge (2024). Waymo Open Sim Agents Challenge -- benchmark for evaluating agent behavior simulation quality.