Back to all papers
Deep Dive #1245 min read

World Models for AD

GAIA-1, Waymo World Model, DriveDreamer, and generative simulation as an alternative to reconstruction-based approaches.

World Models for Autonomous Driving Simulation: A Deep Dive

Focus: Generative world models as learned simulators for autonomous driving Key Systems: GAIA-1 (Wayve), Waymo World Model (Genie 3), DriveDreamer, Epona, GenAD Read Time: 60 min


Table of Contents

  1. Executive Summary
  2. Background and Motivation
  3. Foundational Concepts
  4. Key World Models for AD
  5. Three-Tier Taxonomy
  6. Technical Deep Dive
  7. World Models vs Reconstruction-Based Simulation
  8. Code Examples
  9. Mental Models and Diagrams
  10. Hands-On Exercises
  11. Interview Questions
  12. References

Executive Summary

The Core Idea

World models are neural networks that learn to simulate the world. Instead of hand-engineering physics, rendering pipelines, and agent behaviors, a world model observes real driving data and learns to predict what happens next -- given the current state and an action, it generates the future state of the world.

Traditional Simulation:                 World Model Simulation:

  Hand-crafted       Scripted              Learned from         Emergent
  physics engine     behaviors             real data            behaviors
       |                |                      |                    |
       v                v                      v                    v
  +-----------+    +-----------+         +---------------------------+
  | Physics   |    | Behavior  |         |   Neural World Model      |
  | Engine    |--->| Scripts   |         |   (Single Learned Model)  |
  | (Bullet/  |    | (Rule-    |         |                           |
  |  PhysX)   |    |  based)   |         |   Input: state + action   |
  +-----------+    +-----------+         |   Output: next state      |
       |                |                +---------------------------+
       v                v                           |
  +-----------+    +-----------+                    v
  | Renderer  |    | Sensor    |         +---------------------------+
  | (Raster/  |    | Sim       |         |  Camera, LiDAR, BEV,     |
  |  RT)      |    |           |         |  Occupancy -- all in one  |
  +-----------+    +-----------+         +---------------------------+

Why This Matters Now

Three converging trends have made world models for AD practical:

  1. Scaling laws for video generation: Models like Sora, Genie 2/3, and Cosmos demonstrated that video generation quality scales predictably with compute and data.

  2. Sensor-realistic generation: Recent models generate not just camera images but also LiDAR point clouds, BEV maps, and occupancy grids -- the full sensor suite needed for AD testing.

  3. Controllability breakthroughs: Language-conditioned and action-conditioned generation means engineers can specify scenarios ("a pedestrian jaywalks from behind a parked truck") rather than hand-scripting them.

The Landscape (Early 2026)

SystemOrganizationDateKey Innovation
GAIA-1WayveJul 2023First large-scale (9B param) AD world model
DriveDreamerGigaAIOct 2023First from real driving data, diffusion-based
ADriver-ICASIANov 2023Interleaved vision-action tokens
GenADCUHK/SenseTimeMar 2024Temporal reasoning, action-conditioned
OccWorldPKUMay 20243D occupancy world model
VistaMIT/MBZUAIMay 2024High-fidelity long-horizon driving simulator
DreamlandNVIDIAJun 2024Physics simulator + video generator hybrid
EponaVarious2025Autoregressive diffusion world model
Waymo WMWaymo/DeepMindFeb 2026Genie 3 backbone, camera+LiDAR, language control

Background and Motivation

What Are World Models?

A world model is a learned function that predicts the next state of the environment given the current state and an action:

s_{t+1} = f_theta(s_t, a_t)

Where:

  • s_t is the current world state (images, point clouds, maps, agent states)
  • a_t is the action taken (steering, acceleration, or higher-level commands)
  • f_theta is a neural network with learned parameters theta
  • s_{t+1} is the predicted next state

The term originates from cognitive science (Kenneth Craik, 1943) and was formalized for RL by Ha and Schmidhuber (2018) in "World Models" -- where an agent learns a compressed representation of the environment to plan actions within an imagined future.

For autonomous driving, this concept becomes:

 Given: current camera images, LiDAR scans, HD map, ego vehicle action
 Predict: future camera images, LiDAR scans, positions of all agents

 More formally:
    {I_{t+1}, L_{t+1}, B_{t+1}} = WorldModel({I_t, L_t, B_t, M}, a_t)

 Where:
    I = camera images (multi-view)
    L = LiDAR point cloud
    B = BEV representation
    M = HD map / road topology
    a = ego action (steering, acceleration)

Why Generative Simulation Matters

Traditional simulation has a fundamental problem: everything must be authored. Every building, every pedestrian behavior, every lighting condition, every sensor artifact must be explicitly modeled by engineers. This creates three critical gaps:

1. The Long-Tail Problem

Frequency of Scenarios in Real Driving:

  |
  |####
  |########
  |############                     <-- Common scenarios
  |##################                   (well-covered by traditional sim)
  |########################
  |##############################
  |####################################
  |############################################
  |######################################################
  |##################################################################
  |########################################################################
  |                                            ^^^^^^^^^^^^^^^^^^^^^^^^
  |                                            Long-tail scenarios
  |                                            (hard to author manually,
  |                                             but critical for safety)
  +---------------------------------------------------------------------->
                     Scenario Type (ordered by frequency)

Real driving encounters an enormous variety of rare situations. Manually authoring these is expensive and incomplete. World models can generate novel scenarios by recombining patterns learned from data.

2. The Realism Problem

Hand-crafted simulators produce outputs that look synthetic. Neural networks trained on these synthetic outputs develop biases. World models generate outputs that are distributionally closer to real sensor data because they learned directly from it.

3. The Scalability Problem

Creating a new city in a traditional simulator takes months. A world model trained on diverse data can generate new environments by interpolating between learned representations. Conditioning on maps, language, or style tokens enables rapid scenario creation.

Reconstruction-Based vs Generative Approaches

The AD simulation community has two major paradigms for creating realistic sensor data:

Reconstruction-Based (3DGS / NeRF)

These approaches reconstruct a 3D representation of a specific real-world scene from captured sensor data, then re-render from novel viewpoints.

Real Drive Log ──> 3D Reconstruction ──> Novel View Synthesis
                   (3DGS, NeRF)          (move camera, change
                                           agent positions)

Key systems: UniSim (Waabi), MARS, StreetGaussians, SplatAD

Strengths:

  • Photorealistic for the specific captured scene
  • Geometrically consistent (real 3D structure)
  • Good for replay-with-perturbation testing

Limitations:

  • Bound to captured scenes (cannot generate truly new environments)
  • Requires dense capture data for each scene
  • Editing (adding/removing objects) is challenging
  • Dynamic objects are particularly hard

Generative / World Models

These approaches learn a generative model from large datasets and can synthesize entirely new scenarios.

Large Driving Dataset ──> Train World Model ──> Generate New Scenarios
                          (diffusion, AR)        (action-conditioned,
                                                   language-controlled)

Key systems: GAIA-1, Waymo World Model, DriveDreamer, GenAD

Strengths:

  • Can generate entirely new scenarios never seen in training data
  • Naturally handles long-tail and rare events
  • Scalable -- one model covers many environments
  • Controllable via language, actions, or layouts

Limitations:

  • May hallucinate physically implausible scenarios
  • Geometric consistency not guaranteed
  • Evaluation is harder (no ground truth for generated scenes)
  • Computationally expensive at inference time

Comparison Table

DimensionReconstruction (3DGS/NeRF)Generative (World Models)
Scene CoverageOnly captured scenesAny scene (generalized)
Geometric AccuracyHigh (real 3D)Approximate (learned)
PhotorealismVery high for captured sceneHigh but can hallucinate
Novel Scenario GenLimited (perturbation only)Strong (recombination)
Data RequirementsDense per-scene captureLarge diverse dataset
Edit/ControlLimited (object removal hard)Flexible (language, action)
Physical ConsistencyDepends on physics modelLearned (can be wrong)
Multi-SensorPer-sensor reconstructionJoint generation possible
Compute (Training)Per-scene optimizationOne-time large training
Compute (Inference)Fast renderingSlow generation
Industry ExampleApplied IntuitionWaymo World Model
Best ForRegression testing on real routesScenario exploration, long-tail

Foundational Concepts

Autoregressive Sequence Modeling for Video and Scenes

Autoregressive (AR) models generate sequences one token at a time, where each new token is conditioned on all previous tokens. For world models, the "tokens" can be image patches, latent codes, or discrete codes from a VQ-VAE.

Autoregressive Video Generation:

  Frame 1     Frame 2     Frame 3     Frame 4
  ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
  │     │───>│     │───>│     │───>│  ?  │
  │ z_1 │    │ z_2 │    │ z_3 │    │ z_4 │
  └─────┘    └─────┘    └─────┘    └─────┘
     |           |           |          ^
     v           v           v          |
  ┌──────────────────────────────────────┐
  │      Transformer (causal mask)       │
  │                                      │
  │  P(z_4 | z_1, z_2, z_3, action)     │
  └──────────────────────────────────────┘

  Each z_t is a set of discrete tokens (VQ-VAE codes)
  or continuous latent vectors (VAE embeddings).

How GAIA-1 uses this: GAIA-1 tokenizes video frames, text descriptions, and actions into a single sequence of discrete tokens, then trains a large Transformer (9B parameters) to predict the next token autoregressively. This is conceptually identical to how GPT generates text -- but the "vocabulary" includes visual tokens.

Key design choices:

  • Tokenization: VQ-VAE or VQ-GAN converts images to discrete tokens
  • Sequence ordering: Raster-scan within frame, temporal order across frames
  • Conditioning: Actions and text are interleaved as special tokens
  • Sampling: Temperature and top-k/top-p control generation diversity

Diffusion Models for Generation

Diffusion models generate data by learning to reverse a noise-addition process. Starting from pure Gaussian noise, the model iteratively denoises to produce a clean sample.

Forward Process (Training):
  Clean Image ──> Add Noise ──> Add More Noise ──> ... ──> Pure Noise
     x_0           x_1              x_2                      x_T
                q(x_t | x_{t-1}) = N(sqrt(1-beta_t) * x_{t-1}, beta_t * I)

Reverse Process (Generation):
  Pure Noise ──> Denoise ──> Denoise ──> ... ──> Clean Image
     x_T          x_{T-1}     x_{T-2}              x_0
                p_theta(x_{t-1} | x_t) -- learned by neural network

For driving world models, diffusion offers several advantages over AR:

  1. Spatial coherence: Generates the entire frame at once (no raster-scan artifacts)
  2. Continuous outputs: Naturally handles continuous-valued sensor data
  3. Classifier-free guidance: Strong controllability through conditioning
  4. Quality: State-of-the-art image quality at high resolutions

Video diffusion extends this to sequences by operating on space-time volumes:

Video Diffusion Model:

  Input: Noisy video volume [T, H, W, C] + conditioning signals
                                  |
                                  v
                    ┌──────────────────────────┐
                    │   3D U-Net / DiT         │
                    │                          │
                    │   Spatial Attention       │
                    │   Temporal Attention      │
                    │   Cross-Attention (cond)  │
                    │                          │
                    └──────────────────────────┘
                                  |
                                  v
                    Predicted noise (or clean video)

Action-Conditioned Generation

The defining feature of a world model (vs a generic video generator) is action conditioning -- the model's output changes based on what action the ego vehicle takes.

Action-Conditioning Mechanisms:

  1. TOKEN CONCATENATION (GAIA-1 style):
     [img_tokens_t, action_t, img_tokens_{t+1}, action_{t+1}, ...]
     - Action is just another token in the sequence

  2. CROSS-ATTENTION (Diffusion style):
     ┌──────────────────┐
     │  Video Denoiser  │ <─── Cross-Attend ──── Action Embedding
     └──────────────────┘

  3. CONTROL SIGNAL INJECTION (ControlNet style):
     ┌──────────────────┐
     │  Video Denoiser  │
     │        +         │
     │  Action Branch   │ <─── Action sequence [a_1, ..., a_T]
     │  (parallel net)  │
     └──────────────────┘

  4. FiLM CONDITIONING:
     For each layer: h = gamma(action) * h + beta(action)
     - Action modulates feature maps via learned affine transforms

Action representations for driving:

RepresentationDimensionalityDescription
Low-level2D(steering_angle, acceleration)
WaypointsT x 2DFuture ego positions [(x1,y1), ..., (xT,yT)]
High-levelCategorical{turn_left, go_straight, turn_right, stop}
LanguageVariable"Turn right at the next intersection"
Trajectory + speedT x 3D[(x, y, speed)_1, ..., (x, y, speed)_T]

Multi-Modal Output Generation

Real AD systems consume data from multiple sensors. A world model must generate consistent outputs across these modalities.

Multi-Modal World Model:

                    ┌─────────────────────────┐
                    │    Shared Latent Space   │
                    │                         │
                    │   z_t (world state)     │
                    │                         │
                    └────┬──────┬──────┬──────┘
                         │      │      │
                    ┌────┘      │      └────┐
                    │           │           │
                    v           v           v
              ┌──────────┐ ┌────────┐ ┌──────────┐
              │  Camera   │ │ LiDAR  │ │   BEV    │
              │  Decoder  │ │ Decoder│ │  Decoder │
              └──────────┘ └────────┘ └──────────┘
                    │           │           │
                    v           v           v
              Multi-view    Point       Bird's Eye
              Images        Cloud       View Map

The joint generation challenge: Camera images and LiDAR point clouds must be geometrically consistent. A car visible in the camera image must also appear as points in the LiDAR scan at the correct 3D position. This requires either:

  1. Shared 3D representation: Generate a 3D scene first, then render to each sensor
  2. Cross-modal attention: Let the camera and LiDAR decoders attend to each other
  3. Joint latent space: Encode both modalities into a unified latent, decode together

The Waymo World Model (2026) achieves this by operating in a shared representation space where camera and LiDAR features are jointly processed before being decoded to their respective modalities.


Key World Models for AD

GAIA-1 (Wayve, July 2023)

GAIA-1 was a landmark paper that demonstrated world models could work at scale for autonomous driving. It showed that the same scaling laws powering LLMs apply to learned driving simulators.

Architecture:

GAIA-1 Architecture:

  Video Frames ──> Image Tokenizer (VQ-VAE) ──> Visual Tokens
  Text Prompts ──> Text Tokenizer             ──> Text Tokens
  Actions      ──> Action Tokenizer           ──> Action Tokens
       │                │                           │
       └────────────────┼───────────────────────────┘
                        │
                        v
              ┌──────────────────────┐
              │  World Model         │
              │  (Transformer, 9B)   │
              │                      │
              │  Autoregressive next │
              │  token prediction    │
              └──────────────────────┘
                        │
                        v
              ┌──────────────────────┐
              │  Video Decoder       │
              │  (VQ-VAE decoder +   │
              │   upsampling)        │
              └──────────────────────┘
                        │
                        v
                  Generated Video
                  (realistic driving scenes)

Key specifications:

PropertyValue
Parameters9 billion
Training DataWayve's proprietary UK driving data
Input ModalitiesVideo, text, action
OutputVideo (camera images)
Resolution288 x 512 (base), upsampled to higher res
Frame Rate~25 FPS generation
Sequence LengthVariable, demonstrated up to ~30 seconds
TokenizerVQ-VAE with 8192 codebook size

Key contributions:

  1. Scaling: Showed that bigger models produce better simulations (9B >> 1B)
  2. Multi-modal conditioning: Text descriptions + actions jointly control generation
  3. Emergent understanding: The model learned 3D structure, object permanence, and basic physics without explicit supervision
  4. Generalization: Could generate scenarios not seen in training data

Limitations:

  • Camera-only (no LiDAR generation)
  • Proprietary data and model (not reproducible)
  • Temporal consistency degrades over long horizons
  • No explicit physics guarantees

Waymo World Model (February 2026)

Waymo's World Model represents the most capable publicly announced AD world model as of early 2026. Built on DeepMind's Genie 3 foundation, it combines advances in video generation with AD-specific capabilities.

Architecture Overview:

Waymo World Model (built on Genie 3):

  ┌───────────────────────────────────────────────────────────┐
  │                    INPUT ENCODING                         │
  │                                                           │
  │  Multi-view Cameras ──> Visual Encoder ──┐                │
  │  LiDAR Point Cloud  ──> LiDAR Encoder ──┼──> Fused       │
  │  HD Map / Road Graph ──> Map Encoder  ──┘    State z_t    │
  │  Language Command    ──> Text Encoder  ──────┘            │
  │  Ego Action          ──> Action Encoder ─────┘            │
  └───────────────────────────────────────────────────────────┘
                              │
                              v
  ┌───────────────────────────────────────────────────────────┐
  │              GENIE 3 BACKBONE                             │
  │                                                           │
  │  Spatiotemporal Transformer with:                         │
  │  - Latent action model (learns action space)              │
  │  - Dynamics model (predicts next latent state)            │
  │  - Scalable architecture (billions of parameters)         │
  │                                                           │
  │  Key innovation from Genie 3:                             │
  │  - Operates on latent tokens, not raw pixels              │
  │  - Learned action representations                         │
  │  - Consistent generation over long horizons               │
  └───────────────────────────────────────────────────────────┘
                              │
                              v
  ┌───────────────────────────────────────────────────────────┐
  │              MULTI-MODAL DECODERS                         │
  │                                                           │
  │  z_{t+1} ──> Camera Decoder ──> Multi-view images        │
  │          ──> LiDAR Decoder  ──> Point cloud               │
  │          ──> BEV Decoder    ──> Bird's eye view           │
  │          ──> Agent Decoder  ──> Bounding boxes / tracks   │
  └───────────────────────────────────────────────────────────┘

Key capabilities:

  1. Multi-sensor generation: Simultaneously generates camera images AND LiDAR point clouds that are geometrically consistent with each other.

  2. Language controllability: Natural language commands control what happens in the generated scenario.

    • "A cyclist enters the intersection from the left"
    • "The lead vehicle brakes suddenly"
    • "Heavy rain reduces visibility"
  3. Rare event generation: Can generate safety-critical scenarios that are extremely rare in real driving data.

    • Pedestrian darting from behind occluded vehicle
    • Multi-vehicle chain-reaction collisions
    • Sensor degradation scenarios (rain, fog, glare)
  4. Long-horizon consistency: Built on Genie 3's architecture for maintaining coherent generation over extended time periods.

  5. Closed-loop interaction: Generated world responds to the AD system's actions, enabling closed-loop testing.

Why Genie 3 matters as a foundation:

Genie (1, 2, 3) is DeepMind's line of generative interactive environment models. The key innovations that carry over to the Waymo World Model:

  • Latent action models: Instead of requiring predefined action spaces, Genie learns meaningful action representations from observation data alone
  • Scalable spatiotemporal architecture: Efficiently processes video-length sequences with attention mechanisms optimized for space-time data
  • Interactive generation: The model generates frames that respond causally to input actions, not just plausible-looking video continuations

DriveDreamer (GigaAI, October 2023)

DriveDreamer was the first world model built entirely from real-world driving scenarios using a diffusion-based approach.

Architecture:

DriveDreamer Pipeline:

  Real Driving Data
       │
       ├──> 3D Boxes + HDMap ──> Structured Representation
       │                                │
       │                                v
       │                    ┌─────────────────────┐
       │                    │  Layout Encoder      │
       │                    │  (3D box positions,  │
       │                    │   road topology)     │
       │                    └─────────┬───────────┘
       │                              │
       v                              v
  ┌──────────────┐         ┌──────────────────────┐
  │  Past Frames │────────>│  Video Diffusion     │
  │  (context)   │         │  Model               │
  └──────────────┘         │  (Stable Diffusion   │
                           │   + temporal layers)  │
                           └──────────┬───────────┘
                                      │
                                      v
                             Future Frame Prediction

Key contributions:

  • First to use structured driving representations (3D boxes, HD maps) as conditioning for a diffusion-based video generator
  • Demonstrated that diffusion models can produce temporally consistent driving videos
  • Introduced a two-stage training: (1) align visual features with structured representations, (2) train conditional video generation

Epona (ICCV 2025)

Epona combines the best of autoregressive and diffusion approaches into a unified autoregressive diffusion world model.

Epona: Autoregressive Diffusion

  Frame t-2    Frame t-1    Frame t      Frame t+1 (generate)
  ┌──────┐    ┌──────┐    ┌──────┐     ┌──────────────┐
  │      │    │      │    │      │     │  Diffusion    │
  │ z    │───>│ z    │───>│ z    │────>│  Process      │
  │      │    │      │    │      │     │  (iterative   │
  └──────┘    └──────┘    └──────┘     │   denoising)  │
                                       └──────────────┘
                                              │
  Autoregressive: each frame depends          v
  on all previous frames              Generated Frame t+1

  Diffusion: each individual frame
  is generated via diffusion denoising

  HYBRID BENEFIT:
  - AR gives temporal consistency (causal structure)
  - Diffusion gives spatial quality (parallel pixel generation)

Key insight: Rather than choosing between AR (good temporal coherence) and diffusion (good spatial quality), Epona uses AR to model the temporal sequence while using diffusion to generate each individual frame. This is conceptually similar to how MAR (Masked Autoregressive generation) works for images, but applied to video.

Dreamland (NVIDIA, 2024)

Dreamland takes a unique hybrid approach: it couples a traditional physics simulator with a neural video generator to get the best of both worlds.

Dreamland Architecture:

  ┌─────────────────────────────────────────────────────┐
  │                PHYSICS SIMULATOR                     │
  │                                                     │
  │   Vehicle dynamics ──> Positions, velocities        │
  │   Collision detection ──> Contact forces             │
  │   Road network ──> Valid trajectories                │
  │                                                     │
  │   OUTPUT: Physically-valid agent states              │
  └────────────────────────┬────────────────────────────┘
                           │ Agent positions,
                           │ velocities, poses
                           v
  ┌─────────────────────────────────────────────────────┐
  │              VIDEO GENERATOR                         │
  │                                                     │
  │   Takes physically-valid states as conditioning      │
  │   Generates photorealistic sensor observations       │
  │                                                     │
  │   Neural renderer:                                   │
  │   - Conditioned on agent layouts from physics sim    │
  │   - Generates multi-view camera images               │
  │   - Maintains visual consistency                     │
  └─────────────────────────────────────────────────────┘

The key insight: Physics simulators are good at vehicle dynamics and collision detection. Neural generators are good at photorealism. Dreamland does not try to replace physics -- it uses a physics engine for what it does well and neural generation for what it does well.

Advantages:

  • Physically valid trajectories (no cars driving through walls)
  • Photorealistic rendering (no synthetic look)
  • Controllable (adjust physics parameters directly)

GenAD and Other Recent Models

GenAD (CUHK/SenseTime, 2024) focuses on temporal reasoning for AD world models:

  • Uses a temporal reasoning module to capture long-range dependencies
  • Action-conditioned generation with ego trajectory as input
  • Demonstrates strong performance on nuScenes benchmark

OccWorld (PKU, 2024) operates in 3D occupancy space:

  • Generates future 3D occupancy grids instead of images
  • Better for planning (3D geometric reasoning)
  • Avoids the photorealism challenge entirely by working in occupancy space

Vista (MIT/MBZUAI, 2024) targets high-fidelity, long-horizon simulation:

  • Extended horizon generation (minutes, not seconds)
  • Multi-view consistency through explicit camera models
  • Demonstrated closed-loop driving with real AD stacks

ADriver-I (CASIA, 2023) uses an interleaved token approach:

  • Vision tokens and action tokens are interleaved in a single sequence
  • Trained with next-token prediction (like an LLM)
  • Demonstrated that LLM-style training works for driving world models

Three-Tier Taxonomy

Recent survey papers (Wang et al. 2024, Gao et al. 2024) organize AD world models into three tiers based on their level of integration with planning:

THREE-TIER TAXONOMY OF AD WORLD MODELS

  ┌─────────────────────────────────────────────────────────────┐
  │  TIER 3: Interactive Prediction + Planning                   │
  │                                                             │
  │  World model directly participates in decision-making.       │
  │  The planner uses the world model to simulate consequences   │
  │  of actions (model-based RL / model-predictive control).     │
  │                                                             │
  │  Examples: MILE, Think2Drive, DriveWM                        │
  │  ┌──────────────────────────────────────────────────┐       │
  │  │  Plan ──> Simulate ──> Evaluate ──> Replan       │       │
  │  │   ^                                    │         │       │
  │  │   └────────────────────────────────────┘         │       │
  │  └──────────────────────────────────────────────────┘       │
  ├─────────────────────────────────────────────────────────────┤
  │  TIER 2: Behavior Planning for Intelligent Agents            │
  │                                                             │
  │  World model predicts agent behaviors and trajectories.      │
  │  Used for forecasting what other agents will do.             │
  │                                                             │
  │  Examples: TrafficBots, CtRL-Sim, BehaviorGPT               │
  │  ┌──────────────────────────────────────────────────┐       │
  │  │  Current State ──> World Model ──> Agent Futures  │       │
  │  │                                    (trajectories) │       │
  │  └──────────────────────────────────────────────────┘       │
  ├─────────────────────────────────────────────────────────────┤
  │  TIER 1: Generation of Future Physical World                 │
  │                                                             │
  │  World model generates future sensor observations            │
  │  (images, point clouds, BEV maps, occupancy grids).          │
  │  Used primarily for data generation and simulation.          │
  │                                                             │
  │  Examples: GAIA-1, DriveDreamer, GenAD, Vista, Waymo WM     │
  │  ┌──────────────────────────────────────────────────┐       │
  │  │  Current Sensors ──> World Model ──> Future       │       │
  │  │  + Action                            Sensors      │       │
  │  └──────────────────────────────────────────────────┘       │
  └─────────────────────────────────────────────────────────────┘

Tier 1: Generation of Future Physical World

This tier focuses on generating realistic sensor observations. The output types include:

Output TypeRepresentationModelsUse Case
RGB ImagesMulti-view camera imagesGAIA-1, DriveDreamer, VistaPerception testing
BEV MapsTop-down semantic gridsBEVWorld, BEVGenPlanning evaluation
Occupancy3D voxel gridsOccWorld, OccSora3D reasoning
Point Cloud3D point setsLiDARGen, RangeLDMLiDAR perception testing
Multi-modalCamera + LiDAR + BEVWaymo WMFull-stack testing

Quality metrics for Tier 1:

  • FID (Frechet Inception Distance) for image quality
  • FVD (Frechet Video Distance) for video quality
  • LPIPS (Learned Perceptual Image Patch Similarity) for perceptual similarity
  • Chamfer Distance for point cloud accuracy
  • mIoU for semantic/occupancy correctness

Tier 2: Behavior Planning for Intelligent Agents

This tier predicts how other agents (vehicles, pedestrians, cyclists) will behave:

Tier 2 World Model for Agent Behavior:

  Scene Context                Agent Behavior Prediction
  ┌───────────────┐           ┌────────────────────────┐
  │  Agent states  │           │  Vehicle A: [traj_1,   │
  │  Road topology │──>WM──>  │              traj_2,   │
  │  Traffic rules │           │              traj_3]   │
  │  Ego plan      │           │  Ped B:     [traj_1]   │
  └───────────────┘           │  Cyclist C: [traj_1,   │
                              │              traj_2]   │
                              └────────────────────────┘
                              (multiple possible futures
                               per agent, with probabilities)

Models like BehaviorGPT and CtRL-Sim fall in this category. They do not generate images but predict the future trajectories of all agents in the scene.

Tier 3: Interactive Prediction and Planning

This tier integrates the world model directly into the planning loop:

Tier 3: World Model in the Planning Loop

  ┌──────────────────────────────────────────────────────────┐
  │                                                          │
  │   1. Propose candidate action:  a_candidate              │
  │                      │                                   │
  │                      v                                   │
  │   2. Imagine future: s_{t+1} = WM(s_t, a_candidate)     │
  │                      │                                   │
  │                      v                                   │
  │   3. Evaluate:       reward = R(s_{t+1})                 │
  │                      │                                   │
  │                      v                                   │
  │   4. Repeat for multiple candidates                      │
  │                      │                                   │
  │                      v                                   │
  │   5. Execute best:   a* = argmax_a R(WM(s_t, a))        │
  │                                                          │
  └──────────────────────────────────────────────────────────┘

This is essentially model-based reinforcement learning applied to driving. The world model serves as a learned dynamics model that the planner uses to simulate consequences of actions before committing to one.

Examples:

  • MILE: Uses a world model for end-to-end driving in CARLA
  • Think2Drive: Plans by imagining future scenarios
  • DriveWM: Integrates world model prediction with planning

Technical Deep Dive

Video Diffusion Models for Driving Scenes

Most state-of-the-art driving world models use some variant of video diffusion. The core architecture typically involves:

Video Diffusion Architecture for Driving:

  ┌────────────────────────────────────────────────────────────────┐
  │                      LATENT SPACE                              │
  │                                                                │
  │  Input frames ──> VAE Encoder ──> Latent z [T, h, w, c]       │
  │                                                                │
  │  ┌──────────────────────────────────────────────────────────┐  │
  │  │            Denoising Network (U-Net or DiT)              │  │
  │  │                                                          │  │
  │  │  For each denoising step t = T, T-1, ..., 1, 0:         │  │
  │  │                                                          │  │
  │  │  ┌──────────────────────────────────────────────────┐    │  │
  │  │  │  Spatial Self-Attention                          │    │  │
  │  │  │  (within each frame)                             │    │  │
  │  │  ├──────────────────────────────────────────────────┤    │  │
  │  │  │  Temporal Self-Attention                         │    │  │
  │  │  │  (across frames at same spatial location)        │    │  │
  │  │  ├──────────────────────────────────────────────────┤    │  │
  │  │  │  Cross-Attention with Conditioning               │    │  │
  │  │  │  (action, text, map, layout)                     │    │  │
  │  │  ├──────────────────────────────────────────────────┤    │  │
  │  │  │  Feedforward + ResNet blocks                     │    │  │
  │  │  └──────────────────────────────────────────────────┘    │  │
  │  │                                                          │  │
  │  └──────────────────────────────────────────────────────────┘  │
  │                                                                │
  │  Denoised latent ──> VAE Decoder ──> Output frames             │
  └────────────────────────────────────────────────────────────────┘

Key design patterns across driving diffusion models:

  1. Latent diffusion: Operate in the latent space of a pretrained VAE, not in pixel space. This dramatically reduces compute (from 512x1024x3 to 64x128x4).

  2. Factored attention: Separate spatial and temporal attention reduces the quadratic cost. Instead of attending over all (THW) tokens, spatial attention is (H*W) per frame and temporal attention is (T) per spatial position.

  3. Multi-view extension: For multi-camera setups (e.g., 6 cameras on Waymo/nuScenes), add a cross-view attention layer between cameras sharing the same timestep.

  4. Progressive generation: Generate keyframes first, then interpolate intermediate frames. This improves long-range temporal consistency.

Conditioning Mechanisms

Driving world models support diverse conditioning signals. Here is how each type is typically integrated:

CONDITIONING MECHANISMS IN DRIVING WORLD MODELS

  ┌─────────────────────────────────────────────────────────────┐
  │                                                             │
  │  ACTION CONDITIONING                                        │
  │  ┌─────────┐    ┌──────────┐    ┌─────────────────┐        │
  │  │ Steering│───>│  MLP     │───>│ Cross-Attention  │        │
  │  │ Accel   │    │  Encoder │    │ or FiLM layers   │        │
  │  └─────────┘    └──────────┘    └─────────────────┘        │
  │                                                             │
  │  LANGUAGE CONDITIONING                                      │
  │  ┌─────────┐    ┌──────────┐    ┌─────────────────┐        │
  │  │ "A car  │───>│  CLIP /  │───>│ Cross-Attention  │        │
  │  │  cuts   │    │  T5      │    │ (like Stable     │        │
  │  │  in"    │    │  Encoder │    │  Diffusion)      │        │
  │  └─────────┘    └──────────┘    └─────────────────┘        │
  │                                                             │
  │  LAYOUT / MAP CONDITIONING                                  │
  │  ┌─────────┐    ┌──────────┐    ┌─────────────────┐        │
  │  │ HD Map  │───>│  Spatial │───>│ Addition to      │        │
  │  │ BBoxes  │    │  Encoder │    │ input or         │        │
  │  │ Lanes   │    │  (Conv)  │    │ ControlNet       │        │
  │  └─────────┘    └──────────┘    └─────────────────┘        │
  │                                                             │
  │  PAST FRAMES CONDITIONING                                   │
  │  ┌─────────┐    ┌──────────┐    ┌─────────────────┐        │
  │  │ Frame   │───>│  VAE     │───>│ Concatenation    │        │
  │  │  t-k to │    │  Encoder │    │ to noisy input   │        │
  │  │  t      │    │          │    │ (channel-wise)   │        │
  │  └─────────┘    └──────────┘    └─────────────────┘        │
  │                                                             │
  └─────────────────────────────────────────────────────────────┘

Classifier-free guidance (CFG) is widely used for controllability:

During training:
  - Randomly drop conditioning with probability p_drop (e.g., 10%)
  - Model learns both conditional and unconditional generation

During inference:
  - epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)
  - w > 1 amplifies the conditioning signal
  - Higher w = more controllable but less diverse
  - Lower w = more diverse but less responsive to conditioning

Multi-Sensor Generation

Generating consistent camera and LiDAR data simultaneously is one of the hardest challenges. Here are the main approaches:

Approach 1: Independent Generation with Consistency Loss

Camera Branch:  z_t ──> Camera Diffusion ──> Images
                                                │
                                                ├──> Consistency
                                                │    Loss
                                                │
LiDAR Branch:   z_t ──> LiDAR Diffusion  ──> Points

Simple but weak -- consistency is only encouraged, not guaranteed.

Approach 2: Shared Backbone with Modality-Specific Heads

                    z_t
                     │
                     v
          ┌──────────────────┐
          │  Shared Backbone  │
          │  (Transformer)    │
          └────┬─────────┬───┘
               │         │
               v         v
          ┌────────┐ ┌────────┐
          │Camera  │ │LiDAR   │
          │Head    │ │Head    │
          └────────┘ └────────┘

Better consistency because features are shared. Used by several recent models.

Approach 3: Generate 3D First, Then Render

          z_t ──> 3D World Generator ──> 3D Scene Representation
                                              │
                                    ┌─────────┼─────────┐
                                    │         │         │
                                    v         v         v
                               Camera    LiDAR     BEV
                               Render    Render    Render

Best consistency but most expensive. The 3D representation can be a neural radiance field, point cloud, voxel grid, or tri-plane feature.

Evaluation Metrics for Generated Driving Scenarios

Evaluating world models is notoriously difficult. Here is a taxonomy of metrics:

Image/Video Quality Metrics:

MetricWhat It MeasuresFormula/Method
FIDDistribution similarityFrechet distance between real and generated Inception features
FVDVideo distribution similarityFID extended to video (I3D features)
LPIPSPerceptual similarityDistance in VGG feature space
PSNRPixel-level accuracy10 * log10(MAX^2 / MSE)
SSIMStructural similarityLuminance, contrast, structure comparison

Driving-Specific Metrics:

MetricWhat It Measures
Scene ConsistencyDo objects maintain identity across frames?
Physical PlausibilityDo vehicles obey physics (no teleportation, clipping)?
Action FidelityDoes the world respond correctly to ego actions?
Downstream Task PerformanceDoes a perception model trained on generated data work in real world?
Collision RateAre generated scenarios physically valid (no interpenetration)?
Scenario DiversityHow varied are the generated scenarios?

The downstream evaluation paradigm:

Real Driving Data ──> Train World Model ──> Generate Synthetic Data
                                                    │
                                                    v
                                           Train Perception Model
                                           on Synthetic Data
                                                    │
                                                    v
                                           Evaluate on Real
                                           Test Set
                                                    │
                                                    v
                                           Performance Delta =
                                           proxy for generation quality

If perception models trained on generated data perform well on real test sets, the world model must be generating realistic, diverse, and useful data.

Controllability vs Diversity Trade-off

A fundamental tension exists in conditional generation:

Controllability-Diversity Trade-off:

  High Control ────────────────────────────── Low Control
  (exact spec)                                (free generation)
       │                                           │
       v                                           v
  Low Diversity                              High Diversity
  (one output per                            (many varied
   condition)                                 outputs)

  ┌──────────────────────────────────────────────────┐
  │                                                  │
  │   Safety Testing ──> Need HIGH control           │
  │   "Generate exactly this scenario"               │
  │                                                  │
  │   Training Data ──> Need HIGH diversity           │
  │   "Generate varied scenarios"                     │
  │                                                  │
  │   Coverage Testing ──> Need BOTH                  │
  │   "Generate diverse variants of this scenario"    │
  │                                                  │
  └──────────────────────────────────────────────────┘

  Knobs for controlling this trade-off:
  - CFG weight (w):     Higher = more control, less diversity
  - Temperature:        Lower = more deterministic, less diverse
  - Conditioning level: More detailed condition = more control
  - Sampling method:    DDIM (deterministic) vs DDPM (stochastic)

World Models vs Reconstruction-Based Simulation

When to Use Each Approach

DECISION TREE: RECONSTRUCTION vs WORLD MODEL

  What is your primary use case?
       │
       ├─── Regression testing on specific real routes?
       │         │
       │         └──> RECONSTRUCTION (3DGS/NeRF)
       │              - You need exact scene fidelity
       │              - You want to re-run real scenarios with perturbations
       │              - Applied Intuition / Waabi UniSim approach
       │
       ├─── Generating novel/rare scenarios for safety?
       │         │
       │         └──> WORLD MODEL (generative)
       │              - You need scenarios not in your driving logs
       │              - Language-controllable scenario creation
       │              - Waymo World Model approach
       │
       ├─── Training data augmentation?
       │         │
       │         └──> WORLD MODEL (generative)
       │              - You need diverse, varied training samples
       │              - Data scaling beyond what you have captured
       │
       ├─── Sensor simulation for perception testing?
       │         │
       │         └──> RECONSTRUCTION (3DGS/NeRF) for accuracy
       │              WORLD MODEL for coverage
       │              BOTH is the ideal
       │
       └─── End-to-end planning in imagination?
                 │
                 └──> WORLD MODEL (Tier 3)
                      - Model-based RL
                      - Planning by imagining futures

Complementary Strengths

Rather than viewing these as competing approaches, the industry is converging on using both:

HYBRID SIMULATION PIPELINE (Emerging Best Practice):

  ┌──────────────────────────────────────────────────────────┐
  │                                                          │
  │  RECONSTRUCTION LAYER                                    │
  │  ┌────────────────────────────────────────────────┐      │
  │  │  3DGS/NeRF scene reconstructions               │      │
  │  │  - Faithful replay of real driving logs         │      │
  │  │  - High geometric accuracy                      │      │
  │  │  - Novel view synthesis for sensor sim          │      │
  │  └────────────────────────────────────────────────┘      │
  │                         │                                │
  │                         v                                │
  │  WORLD MODEL LAYER                                       │
  │  ┌────────────────────────────────────────────────┐      │
  │  │  Generative world model                         │      │
  │  │  - Adds new agents to reconstructed scenes      │      │
  │  │  - Generates counterfactual scenarios            │      │
  │  │  - Creates weather/lighting variations           │      │
  │  │  - Simulates sensor degradation                  │      │
  │  └────────────────────────────────────────────────┘      │
  │                         │                                │
  │                         v                                │
  │  BEHAVIOR MODEL LAYER                                    │
  │  ┌────────────────────────────────────────────────┐      │
  │  │  Learned agent behavior models                   │      │
  │  │  - Realistic multi-agent interactions            │      │
  │  │  - Reactive to ego vehicle actions               │      │
  │  │  - Diverse behavioral modes                      │      │
  │  └────────────────────────────────────────────────┘      │
  │                                                          │
  └──────────────────────────────────────────────────────────┘

Applied Intuition vs Waymo Comparison

These two companies represent different philosophies for AD simulation:

DimensionApplied IntuitionWaymo (World Model)
Core PhilosophyReconstruction + editingGeneration from learned prior
Scene SourceReal captured scenesGenerated or real + generated
Primary Method3DGS/NeRF reconstruction, asset librariesGenie 3-based world model
Scenario CreationManual + algorithmic perturbationLanguage-controlled generation
Sensor SimPhysics-based + neural renderingLearned generation
Agent BehaviorRule-based + learned modelsEmergent from world model
StrengthsGeometric accuracy, determinismNovel scenarios, scalability
LimitationsBound to captured data, labor-intensiveMay hallucinate, less precise
Target UsersOEMs needing validation toolsInternal Waymo AD development
Business ModelSaaS platform for multiple customersInternal tool + research

Future Convergence

The boundary between reconstruction and generation is blurring:

  1. Reconstruction-guided generation: Use 3DGS scenes as conditioning for world models, getting geometric accuracy AND generative flexibility.

  2. Generative inpainting on reconstructions: Reconstruct the static scene with 3DGS, then use a world model to generate dynamic objects and their behaviors.

  3. Foundation model approach: Train a single large model that can do both reconstruction (when given dense input views) and generation (when given sparse conditioning). This is analogous to how large language models can both complete existing text and generate new text.

  4. Learned physics in latent space: Rather than separate physics engines and neural renderers, learn physics implicitly in the latent space of the world model while maintaining hard constraints (e.g., conservation laws) through architectural design.


Code Examples

Example 1: Simple Action-Conditioned Video Prediction

This demonstrates the core concept of a world model: given a current frame and an action, predict the next frame.

import torch
import torch.nn as nn
import torch.nn.functional as F


class SimpleActionConditionedPredictor(nn.Module):
    """
    Minimal world model: predicts next frame given current frame + action.

    This is a simplified version to illustrate the concept.
    Real models use diffusion, transformers, and operate in latent space.
    """

    def __init__(
        self,
        img_channels: int = 3,
        action_dim: int = 2,       # (steering, acceleration)
        hidden_dim: int = 256,
        latent_dim: int = 64,
    ):
        super().__init__()

        # Encode current frame to latent representation
        self.encoder = nn.Sequential(
            nn.Conv2d(img_channels, 64, 4, stride=2, padding=1),   # /2
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, stride=2, padding=1),            # /4
            nn.ReLU(),
            nn.Conv2d(128, hidden_dim, 4, stride=2, padding=1),    # /8
            nn.ReLU(),
        )

        # Action embedding via FiLM conditioning
        # FiLM: Feature-wise Linear Modulation
        # h_new = gamma(action) * h + beta(action)
        self.action_to_gamma = nn.Sequential(
            nn.Linear(action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.action_to_beta = nn.Sequential(
            nn.Linear(action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )

        # Predict next latent state
        self.dynamics = nn.Sequential(
            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
        )

        # Decode back to image space
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(hidden_dim, 128, 4, stride=2, padding=1),  # *2
            nn.ReLU(),
            nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),          # *4
            nn.ReLU(),
            nn.ConvTranspose2d(64, img_channels, 4, stride=2, padding=1), # *8
            nn.Sigmoid(),  # Pixel values in [0, 1]
        )

    def forward(
        self,
        current_frame: torch.Tensor,   # [B, C, H, W]
        action: torch.Tensor,          # [B, action_dim]
    ) -> torch.Tensor:                 # [B, C, H, W] predicted next frame
        # 1. Encode current frame
        z = self.encoder(current_frame)  # [B, hidden_dim, h, w]

        # 2. Condition on action via FiLM
        gamma = self.action_to_gamma(action)  # [B, hidden_dim]
        beta = self.action_to_beta(action)    # [B, hidden_dim]

        # Reshape for broadcasting: [B, hidden_dim, 1, 1]
        gamma = gamma.unsqueeze(-1).unsqueeze(-1)
        beta = beta.unsqueeze(-1).unsqueeze(-1)

        z_conditioned = gamma * z + beta  # FiLM modulation

        # 3. Predict dynamics (next latent state)
        z_next = self.dynamics(z_conditioned)

        # 4. Decode to next frame
        next_frame = self.decoder(z_next)

        return next_frame


# --- Training Loop ---

def train_simple_world_model():
    """Train the simple world model on driving data."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleActionConditionedPredictor().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

    # Simulated driving data (in practice, load from nuScenes/Waymo)
    batch_size = 16
    H, W = 128, 256  # Typical driving aspect ratio

    for step in range(10000):
        # Simulate a batch of (current_frame, action, next_frame) triplets
        current_frames = torch.randn(batch_size, 3, H, W, device=device).sigmoid()
        actions = torch.randn(batch_size, 2, device=device)  # (steer, accel)
        target_frames = torch.randn(batch_size, 3, H, W, device=device).sigmoid()

        # Forward pass
        predicted_frames = model(current_frames, actions)

        # Loss: pixel-wise MSE (real models add perceptual + adversarial losses)
        loss = F.mse_loss(predicted_frames, target_frames)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if step % 1000 == 0:
            print(f"Step {step}, Loss: {loss.item():.4f}")


# --- Inference: Roll out a trajectory ---

@torch.no_grad()
def rollout_world_model(
    model: SimpleActionConditionedPredictor,
    initial_frame: torch.Tensor,    # [1, 3, H, W]
    action_sequence: torch.Tensor,  # [T, 2]
) -> list[torch.Tensor]:
    """
    Generate a sequence of future frames by autoregressively
    applying the world model.
    """
    frames = [initial_frame]
    current = initial_frame

    for t in range(len(action_sequence)):
        action = action_sequence[t].unsqueeze(0)  # [1, 2]
        next_frame = model(current, action)
        frames.append(next_frame)
        current = next_frame  # Autoregressive: feed prediction back

    return frames  # List of [1, 3, H, W] tensors

Example 2: Diffusion Model for Driving Scene Generation

This example shows a simplified video diffusion model with action and layout conditioning, following the architecture used by DriveDreamer and similar models.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass


@dataclass
class DiffusionConfig:
    """Configuration for the driving diffusion model."""
    num_frames: int = 8           # Number of frames to generate
    height: int = 64              # Latent height (image_h / 8)
    width: int = 128              # Latent width (image_w / 8)
    latent_channels: int = 4     # VAE latent channels
    model_channels: int = 256    # Base model channel dimension
    num_heads: int = 8           # Attention heads
    action_dim: int = 3          # (steering, acceleration, speed)
    text_dim: int = 512          # Text embedding dimension
    num_timesteps: int = 1000    # Diffusion timesteps
    num_res_blocks: int = 2      # ResNet blocks per level


class SinusoidalTimestepEmbedding(nn.Module):
    """Standard sinusoidal embedding for diffusion timestep."""

    def __init__(self, dim: int):
        super().__init__()
        self.dim = dim

    def forward(self, t: torch.Tensor) -> torch.Tensor:
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=t.device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        return torch.cat([emb.sin(), emb.cos()], dim=-1)


class SpatioTemporalAttention(nn.Module):
    """
    Factored attention: spatial within each frame, temporal across frames.
    This is the core building block of video diffusion models.
    """

    def __init__(self, channels: int, num_heads: int, num_frames: int):
        super().__init__()
        self.num_frames = num_frames

        # Spatial self-attention (within each frame)
        self.spatial_norm = nn.GroupNorm(32, channels)
        self.spatial_attn = nn.MultiheadAttention(
            channels, num_heads, batch_first=True
        )

        # Temporal self-attention (across frames at same position)
        self.temporal_norm = nn.GroupNorm(32, channels)
        self.temporal_attn = nn.MultiheadAttention(
            channels, num_heads, batch_first=True
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [B*T, C, H, W] where T = num_frames
        Returns:
            [B*T, C, H, W]
        """
        BT, C, H, W = x.shape
        B = BT // self.num_frames
        T = self.num_frames

        # --- Spatial attention (within each frame) ---
        h = self.spatial_norm(x)
        h = h.reshape(BT, C, H * W).permute(0, 2, 1)  # [BT, HW, C]
        h_attn, _ = self.spatial_attn(h, h, h)
        h = h_attn.permute(0, 2, 1).reshape(BT, C, H, W)
        x = x + h

        # --- Temporal attention (across frames) ---
        h = self.temporal_norm(x)
        # Reshape to [B*H*W, T, C] for temporal attention
        h = h.reshape(B, T, C, H, W)
        h = h.permute(0, 3, 4, 1, 2).reshape(B * H * W, T, C)
        h_attn, _ = self.temporal_attn(h, h, h)
        h = h_attn.reshape(B, H, W, T, C).permute(0, 3, 4, 1, 2)
        h = h.reshape(BT, C, H, W)
        x = x + h

        return x


class CrossAttentionConditioning(nn.Module):
    """Cross-attention for text/action conditioning."""

    def __init__(self, channels: int, context_dim: int, num_heads: int):
        super().__init__()
        self.norm = nn.GroupNorm(32, channels)
        self.to_q = nn.Linear(channels, channels)
        self.to_k = nn.Linear(context_dim, channels)
        self.to_v = nn.Linear(context_dim, channels)
        self.out_proj = nn.Linear(channels, channels)
        self.num_heads = num_heads

    def forward(
        self,
        x: torch.Tensor,       # [B, C, H, W]
        context: torch.Tensor,  # [B, seq_len, context_dim]
    ) -> torch.Tensor:
        B, C, H, W = x.shape
        h = self.norm(x)
        h = h.reshape(B, C, H * W).permute(0, 2, 1)  # [B, HW, C]

        q = self.to_q(h)
        k = self.to_k(context)
        v = self.to_v(context)

        # Simple scaled dot-product attention
        head_dim = C // self.num_heads
        scale = head_dim ** -0.5
        attn = torch.bmm(q * scale, k.transpose(-2, -1))
        attn = attn.softmax(dim=-1)
        out = torch.bmm(attn, v)

        out = self.out_proj(out)
        out = out.permute(0, 2, 1).reshape(B, C, H, W)
        return x + out


class DrivingDiffusionModel(nn.Module):
    """
    Simplified video diffusion model for driving scene generation.

    Generates a sequence of latent frames conditioned on:
    - Past frames (context)
    - Ego vehicle actions (steering, acceleration)
    - Text description (optional)
    - Diffusion timestep
    """

    def __init__(self, config: DiffusionConfig):
        super().__init__()
        self.config = config
        C = config.model_channels

        # Timestep embedding
        self.time_embed = nn.Sequential(
            SinusoidalTimestepEmbedding(C),
            nn.Linear(C, C * 4),
            nn.SiLU(),
            nn.Linear(C * 4, C),
        )

        # Action embedding
        self.action_embed = nn.Sequential(
            nn.Linear(config.action_dim * config.num_frames, C),
            nn.SiLU(),
            nn.Linear(C, C),
        )

        # Input projection (latent channels -> model channels)
        self.input_proj = nn.Conv2d(config.latent_channels, C, 3, padding=1)

        # Core blocks: ResNet + SpatioTemporal Attention + Cross-Attention
        self.blocks = nn.ModuleList()
        for _ in range(config.num_res_blocks):
            self.blocks.append(nn.ModuleDict({
                "resnet": nn.Sequential(
                    nn.GroupNorm(32, C),
                    nn.SiLU(),
                    nn.Conv2d(C, C, 3, padding=1),
                    nn.GroupNorm(32, C),
                    nn.SiLU(),
                    nn.Conv2d(C, C, 3, padding=1),
                ),
                "st_attn": SpatioTemporalAttention(C, config.num_heads, config.num_frames),
                "cross_attn": CrossAttentionConditioning(C, config.text_dim, config.num_heads),
            }))

        # Output projection (model channels -> latent channels)
        self.output_proj = nn.Sequential(
            nn.GroupNorm(32, C),
            nn.SiLU(),
            nn.Conv2d(C, config.latent_channels, 3, padding=1),
        )

    def forward(
        self,
        noisy_latents: torch.Tensor,   # [B, T, latent_c, H, W]
        timestep: torch.Tensor,        # [B]
        actions: torch.Tensor,         # [B, T, action_dim]
        text_embeddings: torch.Tensor | None = None,  # [B, seq_len, text_dim]
    ) -> torch.Tensor:
        """Predict the noise (or clean sample) for the given noisy input."""
        B, T, LC, H, W = noisy_latents.shape

        # Flatten batch and time for spatial operations
        x = noisy_latents.reshape(B * T, LC, H, W)
        x = self.input_proj(x)  # [B*T, C, H, W]

        # Embed timestep and add to features
        t_emb = self.time_embed(timestep)  # [B, C]
        t_emb = t_emb.unsqueeze(1).repeat(1, T, 1).reshape(B * T, -1)
        x = x + t_emb[:, :, None, None]

        # Embed actions
        a_emb = self.action_embed(actions.reshape(B, -1))  # [B, C]

        # If no text, use action embedding as context for cross-attention
        if text_embeddings is None:
            context = a_emb.unsqueeze(1)  # [B, 1, C]
        else:
            # Concatenate action embedding with text embeddings
            context = torch.cat([
                a_emb.unsqueeze(1),
                text_embeddings
            ], dim=1)  # [B, 1 + seq_len, text_dim]

        # Repeat context for each frame
        context_expanded = context.unsqueeze(1).repeat(1, T, 1, 1)
        context_expanded = context_expanded.reshape(B * T, -1, context.shape[-1])

        # Apply blocks
        for block in self.blocks:
            # ResNet block
            h = block["resnet"](x)
            x = x + h

            # Spatiotemporal attention
            x = block["st_attn"](x)

            # Cross-attention with conditioning
            x = block["cross_attn"](x, context_expanded)

        # Output
        noise_pred = self.output_proj(x)  # [B*T, latent_c, H, W]
        noise_pred = noise_pred.reshape(B, T, LC, H, W)

        return noise_pred


# --- DDPM Sampling ---

class DDPMSampler:
    """Simple DDPM sampler for the driving diffusion model."""

    def __init__(self, num_timesteps: int = 1000):
        self.num_timesteps = num_timesteps
        # Linear beta schedule
        self.betas = torch.linspace(1e-4, 0.02, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)

    @torch.no_grad()
    def sample(
        self,
        model: DrivingDiffusionModel,
        actions: torch.Tensor,
        text_embeddings: torch.Tensor | None = None,
        cfg_scale: float = 7.5,
    ) -> torch.Tensor:
        """
        Generate driving video latents using DDPM sampling.

        Args:
            model: The diffusion model
            actions: [B, T, action_dim] ego actions
            text_embeddings: Optional text conditioning
            cfg_scale: Classifier-free guidance scale

        Returns:
            [B, T, latent_c, H, W] generated latents
        """
        device = actions.device
        B = actions.shape[0]
        config = model.config

        # Start from pure noise
        shape = (B, config.num_frames, config.latent_channels,
                 config.height, config.width)
        x = torch.randn(shape, device=device)

        alpha_cumprod = self.alpha_cumprod.to(device)

        for t in reversed(range(self.num_timesteps)):
            t_batch = torch.full((B,), t, device=device, dtype=torch.long)

            # Predict noise (with classifier-free guidance)
            noise_pred_cond = model(x, t_batch, actions, text_embeddings)

            if cfg_scale > 1.0 and text_embeddings is not None:
                noise_pred_uncond = model(x, t_batch, actions, None)
                noise_pred = noise_pred_uncond + cfg_scale * (
                    noise_pred_cond - noise_pred_uncond
                )
            else:
                noise_pred = noise_pred_cond

            # DDPM update step
            alpha_t = self.alphas[t]
            alpha_bar_t = alpha_cumprod[t]
            alpha_bar_prev = alpha_cumprod[t - 1] if t > 0 else torch.tensor(1.0)

            # Predicted x_0
            x0_pred = (x - (1 - alpha_bar_t).sqrt() * noise_pred) / alpha_bar_t.sqrt()
            x0_pred = x0_pred.clamp(-1, 1)

            # Posterior mean
            coeff1 = (alpha_bar_prev.sqrt() * self.betas[t]) / (1 - alpha_bar_t)
            coeff2 = (alpha_t.sqrt() * (1 - alpha_bar_prev)) / (1 - alpha_bar_t)
            mean = coeff1 * x0_pred + coeff2 * x

            # Add noise (except at t=0)
            if t > 0:
                noise = torch.randn_like(x)
                sigma = ((1 - alpha_bar_prev) / (1 - alpha_bar_t) * self.betas[t]).sqrt()
                x = mean + sigma * noise
            else:
                x = mean

        return x


# --- Usage Example ---

def generate_driving_scenario():
    """Example: generate a driving scenario with the diffusion model."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    config = DiffusionConfig()
    model = DrivingDiffusionModel(config).to(device)
    sampler = DDPMSampler(config.num_timesteps)

    # Define ego actions: gentle left turn with constant speed
    B = 1
    actions = torch.zeros(B, config.num_frames, config.action_dim, device=device)
    actions[:, :, 0] = 0.3   # Steering: slight left
    actions[:, :, 1] = 0.0   # Acceleration: constant
    actions[:, :, 2] = 0.5   # Speed: moderate

    # Generate (in practice, decode latents with a VAE decoder)
    latents = sampler.sample(model, actions, cfg_scale=1.0)
    print(f"Generated latents shape: {latents.shape}")
    # -> [1, 8, 4, 64, 128] = 8 frames of 64x128 latent maps

    return latents

Example 3: Evaluation of Generated Scenarios

import torch
import torch.nn as nn
import numpy as np
from typing import NamedTuple
from scipy import linalg


class ScenarioMetrics(NamedTuple):
    """Metrics for evaluating a generated driving scenario."""
    fid: float                   # Frechet Inception Distance
    fvd: float                   # Frechet Video Distance
    action_fidelity: float       # Does world respond to actions correctly?
    physical_plausibility: float # Are physics constraints respected?
    temporal_consistency: float  # Are objects consistent across frames?


def compute_fid(
    real_features: np.ndarray,   # [N, D] Inception features of real images
    gen_features: np.ndarray,    # [M, D] Inception features of generated images
) -> float:
    """
    Compute Frechet Inception Distance between real and generated distributions.

    FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r @ Sigma_g)^0.5)

    Lower is better. FID = 0 means identical distributions.
    """
    mu_r = real_features.mean(axis=0)
    mu_g = gen_features.mean(axis=0)
    sigma_r = np.cov(real_features, rowvar=False)
    sigma_g = np.cov(gen_features, rowvar=False)

    diff = mu_r - mu_g
    covmean, _ = linalg.sqrtm(sigma_r @ sigma_g, disp=False)

    # Numerical stability
    if np.iscomplexobj(covmean):
        covmean = covmean.real

    fid = diff @ diff + np.trace(sigma_r + sigma_g - 2 * covmean)
    return float(fid)


def compute_action_fidelity(
    actions: torch.Tensor,         # [B, T, action_dim]
    generated_frames: torch.Tensor, # [B, T, C, H, W]
    optical_flow_model: nn.Module,  # Pretrained optical flow estimator
) -> float:
    """
    Measure whether generated videos respond correctly to actions.

    Intuition: If we command a left turn, the optical flow in the
    generated video should show rightward motion (world moving right
    as ego turns left).

    Returns a score in [0, 1] where 1 = perfect action fidelity.
    """
    B, T, C, H, W = generated_frames.shape
    scores = []

    for t in range(T - 1):
        frame_curr = generated_frames[:, t]
        frame_next = generated_frames[:, t + 1]

        # Compute optical flow between consecutive frames
        flow = optical_flow_model(frame_curr, frame_next)  # [B, 2, H, W]

        # Expected flow direction based on action
        steering = actions[:, t, 0]  # positive = left turn
        expected_horizontal_flow = -steering  # left turn -> rightward flow

        # Compute mean horizontal flow
        mean_h_flow = flow[:, 0].mean(dim=(-2, -1))  # Average horizontal flow

        # Correlation between expected and actual flow direction
        # Normalize to get direction agreement
        direction_agreement = (
            torch.sign(mean_h_flow) == torch.sign(expected_horizontal_flow)
        ).float()
        scores.append(direction_agreement.mean().item())

    return float(np.mean(scores))


def compute_physical_plausibility(
    generated_bboxes: torch.Tensor,  # [B, T, N_agents, 7] (x,y,z,l,w,h,yaw)
) -> float:
    """
    Check physical plausibility of generated scenarios.

    Checks for:
    1. No interpenetration (bounding boxes don't overlap)
    2. Reasonable velocities (no teleportation)
    3. Vehicles stay on road surface

    Returns a score in [0, 1] where 1 = fully plausible.
    """
    B, T, N, _ = generated_bboxes.shape
    violations = 0
    total_checks = 0

    for b in range(B):
        for t in range(T):
            # Check 1: No interpenetration (simplified 2D IoU check)
            positions = generated_bboxes[b, t, :, :2]  # [N, 2] (x, y)
            sizes = generated_bboxes[b, t, :, 3:5]     # [N, 2] (l, w)

            for i in range(N):
                for j in range(i + 1, N):
                    dist = torch.norm(positions[i] - positions[j])
                    min_dist = (sizes[i].norm() + sizes[j].norm()) / 2
                    if dist < min_dist * 0.5:  # Significant overlap
                        violations += 1
                    total_checks += 1

            # Check 2: Reasonable velocities (no teleportation)
            if t > 0:
                prev_positions = generated_bboxes[b, t - 1, :, :2]
                velocities = (positions - prev_positions)  # Assuming dt = 0.1s
                speeds = torch.norm(velocities, dim=-1) / 0.1  # m/s
                max_reasonable_speed = 50.0  # ~180 km/h
                violations += (speeds > max_reasonable_speed).sum().item()
                total_checks += N

    if total_checks == 0:
        return 1.0
    return 1.0 - (violations / total_checks)


def compute_temporal_consistency(
    generated_frames: torch.Tensor,  # [B, T, C, H, W]
) -> float:
    """
    Measure temporal consistency via inter-frame LPIPS stability.

    Consistent videos should have smooth LPIPS changes between
    consecutive frames (no sudden jumps in appearance).

    Returns a score in [0, 1] where 1 = perfectly consistent.
    """
    B, T, C, H, W = generated_frames.shape

    # Compute pairwise L2 distances between consecutive frames
    # (Simplified -- real implementation uses LPIPS)
    deltas = []
    for t in range(T - 1):
        diff = (generated_frames[:, t] - generated_frames[:, t + 1]).pow(2)
        frame_dist = diff.mean(dim=(-3, -2, -1))  # [B]
        deltas.append(frame_dist)

    deltas = torch.stack(deltas, dim=1)  # [B, T-1]

    # Consistency = low variance in inter-frame distances
    # (smooth changes, no sudden jumps)
    variance = deltas.var(dim=1).mean()

    # Normalize to [0, 1] (heuristic scaling)
    consistency = torch.exp(-10.0 * variance).item()
    return consistency


def evaluate_world_model(
    world_model: nn.Module,
    real_dataset: torch.utils.data.DataLoader,
    num_eval_samples: int = 1000,
) -> ScenarioMetrics:
    """
    Full evaluation pipeline for a driving world model.

    In practice, this would:
    1. Generate scenarios using the world model
    2. Compute FID/FVD against real driving data
    3. Measure action fidelity, physical plausibility, etc.
    """
    print("Evaluation pipeline:")
    print("  1. Generating scenarios from world model...")
    print("  2. Extracting Inception features for FID...")
    print("  3. Extracting I3D features for FVD...")
    print("  4. Computing action fidelity...")
    print("  5. Checking physical plausibility...")
    print("  6. Measuring temporal consistency...")

    # Placeholder values (real implementation would compute these)
    return ScenarioMetrics(
        fid=24.5,                    # Lower is better
        fvd=180.3,                   # Lower is better
        action_fidelity=0.87,        # Higher is better [0,1]
        physical_plausibility=0.93,  # Higher is better [0,1]
        temporal_consistency=0.81,   # Higher is better [0,1]
    )

Mental Models and Diagrams

World Model Architecture Overview

WORLD MODEL ARCHITECTURE (Generalized)

  ┌──────────────────────────────────────────────────────────────────┐
  │                         INPUTS                                   │
  │                                                                  │
  │   Past Observations          Actions           Conditions        │
  │   ┌───────────────┐    ┌──────────────┐   ┌───────────────┐     │
  │   │ Cameras (t-k  │    │ Steering     │   │ Text prompt   │     │
  │   │  to t)        │    │ Acceleration │   │ HD Map        │     │
  │   │ LiDAR (t-k    │    │ Or: waypoints│   │ Weather       │     │
  │   │  to t)        │    │ Or: language  │   │ Time of day   │     │
  │   └───────┬───────┘    └──────┬───────┘   └───────┬───────┘     │
  │           │                   │                   │              │
  │           v                   v                   v              │
  │   ┌──────────────────────────────────────────────────────┐       │
  │   │                  ENCODER MODULE                      │       │
  │   │                                                      │       │
  │   │  Visual Encoder (ViT / ResNet / VQ-VAE)              │       │
  │   │  Action Encoder (MLP / Embedding)                    │       │
  │   │  Condition Encoder (CLIP / T5 / Spatial Encoder)     │       │
  │   │                                                      │       │
  │   └──────────────────────┬───────────────────────────────┘       │
  │                          │                                       │
  │                          v                                       │
  │   ┌──────────────────────────────────────────────────────┐       │
  │   │              WORLD MODEL CORE                        │       │
  │   │                                                      │       │
  │   │  Option A: Autoregressive Transformer                │       │
  │   │    P(z_{t+1} | z_1, ..., z_t, actions, conditions)   │       │
  │   │                                                      │       │
  │   │  Option B: Diffusion Model                           │       │
  │   │    Score function: s_theta(z_t, t, actions, conds)   │       │
  │   │                                                      │       │
  │   │  Option C: Hybrid (AR + Diffusion)                   │       │
  │   │    AR for temporal structure, Diffusion per frame     │       │
  │   │                                                      │       │
  │   └──────────────────────┬───────────────────────────────┘       │
  │                          │                                       │
  │                          v                                       │
  │   ┌──────────────────────────────────────────────────────┐       │
  │   │                  DECODER MODULE                      │       │
  │   │                                                      │       │
  │   │  Camera Decoder ──> Multi-view RGB images            │       │
  │   │  LiDAR Decoder  ──> Point clouds / range images      │       │
  │   │  BEV Decoder    ──> Bird's eye view semantic maps    │       │
  │   │  Agent Decoder  ──> 3D bounding boxes + tracks       │       │
  │   │  Occ Decoder    ──> 3D occupancy grids               │       │
  │   │                                                      │       │
  │   └──────────────────────────────────────────────────────┘       │
  │                                                                  │
  │                         OUTPUTS                                  │
  │   Future multi-modal sensor observations (t+1 to t+H)           │
  └──────────────────────────────────────────────────────────────────┘

Reconstruction vs Generation Pipeline Comparison

RECONSTRUCTION-BASED PIPELINE (Applied Intuition / Waabi style):

  CAPTURE PHASE                RECONSTRUCTION PHASE          SIMULATION PHASE
  ┌──────────────┐            ┌───────────────────┐         ┌──────────────────┐
  │ Drive real   │            │ Per-scene 3D       │         │ Render from new  │
  │ vehicle with │──────────>│ reconstruction     │────────>│ viewpoints       │
  │ sensor suite │            │ (3DGS / NeRF)     │         │                  │
  │              │            │                   │         │ Move agents      │
  │ Collect:     │            │ Optimize:         │         │ Change lighting  │
  │ - Images     │            │ - Gaussian splats │         │ Add rain/fog     │
  │ - LiDAR      │            │ - Neural radiance │         │                  │
  │ - GPS/IMU    │            │ - Signed distance │         │ RE-RENDER the    │
  │ - Boxes      │            │                   │         │ same scene with  │
  └──────────────┘            └───────────────────┘         │ modifications    │
                                                            └──────────────────┘
       Scene A ──> Reconstruct A ──> Simulate in A
       Scene B ──> Reconstruct B ──> Simulate in B    Each scene is separate!
       Scene C ──> Reconstruct C ──> Simulate in C


GENERATIVE PIPELINE (Waymo World Model style):

  DATA PHASE                   TRAINING PHASE               GENERATION PHASE
  ┌──────────────┐            ┌───────────────────┐         ┌──────────────────┐
  │ Collect      │            │ Train ONE world   │         │ Generate ANY     │
  │ large-scale  │──────────>│ model on ALL data │────────>│ scenario:        │
  │ driving data │            │                   │         │                  │
  │              │            │ Learn:            │         │ "Rainy highway   │
  │ 1000s of    │            │ - Visual patterns │         │  with a truck    │
  │ hours across │            │ - Physics rules   │         │  cutting in"     │
  │ many cities  │            │ - Agent behaviors │         │                  │
  │              │            │ - Sensor models   │         │ "Pedestrian runs │
  └──────────────┘            └───────────────────┘         │  across 4-lane   │
                                                            │  road at night"  │
                              ONE model covers              │                  │
                              ALL scenarios                  │ "Snowy parking   │
                                                            │  lot with kids"  │
                                                            └──────────────────┘

Multi-Modal Generation Flow

MULTI-MODAL GENERATION FLOW (Camera + LiDAR + BEV)

  Step 1: Encode current multi-modal observations
  ┌─────────────────────────────────────────────────────────┐
  │                                                         │
  │   Front Camera ──┐                                      │
  │   Left Camera  ──┼──> Visual    ──┐                     │
  │   Right Camera ──┤    Encoder     │                     │
  │   Rear Camera  ──┘                │                     │
  │                                   ├──> Fused Latent z_t │
  │   LiDAR Scan ──> Point Cloud  ──┤                     │
  │                   Encoder        │                     │
  │                                   │                     │
  │   HD Map ──────> Map Encoder  ──┘                     │
  │                                                         │
  └─────────────────────────────────────────────────────────┘
                         │
                         v
  Step 2: Apply world model dynamics
  ┌─────────────────────────────────────────────────────────┐
  │                                                         │
  │   z_t + action_t ──> World Model Core ──> z_{t+1}      │
  │                                                         │
  │   The latent z_{t+1} encodes the FULL next world state  │
  │   including all modalities in a shared representation   │
  │                                                         │
  └─────────────────────────────────────────────────────────┘
                         │
                         v
  Step 3: Decode to each modality (geometrically consistent)
  ┌─────────────────────────────────────────────────────────┐
  │                                                         │
  │          z_{t+1}                                        │
  │            │                                            │
  │    ┌───────┼───────┬──────────┐                         │
  │    │       │       │          │                         │
  │    v       v       v          v                         │
  │  Camera  LiDAR   BEV      Occupancy                    │
  │  Decoder Decoder Decoder  Decoder                      │
  │    │       │       │          │                         │
  │    v       v       v          v                         │
  │  ┌────┐ ┌────┐ ┌────────┐ ┌────────┐                   │
  │  │    │ │ .  │ │ Roads  │ │ Voxels │                   │
  │  │    │ │. . │ │ Cars   │ │ 3D occ │                   │
  │  │    │ │.  .│ │ Lanes  │ │ grid   │                   │
  │  └────┘ └────┘ └────────┘ └────────┘                   │
  │  6 views  64-beam  Semantic  256^3                      │
  │  images   LiDAR    map       voxels                     │
  │                                                         │
  │  KEY: All outputs are geometrically consistent because  │
  │  they decode from the SAME latent z_{t+1}               │
  │                                                         │
  └─────────────────────────────────────────────────────────┘

The Scaling Hypothesis for World Models

SCALING BEHAVIOR OF WORLD MODELS

  Quality
  (FVD, lower = better)

  |
  |  *
  |
  |      *                        Scaling law:
  |                               FVD ~ C * N^(-alpha)
  |           *
  |                               where N = model parameters
  |                *              and alpha ~ 0.3-0.5
  |                     *
  |                          *
  |                              * * *  *
  |                                        *  *  *
  +-----+-----+-----+-----+-----+-----+-----+----->
       100M   500M    1B    2B    5B   10B   50B
                    Model Size (parameters)

  Evidence:
  - GAIA-1: 9B params >> smaller variants on all metrics
  - Sora: Scaling improved video quality predictably
  - Genie 3: Larger models = more consistent physics

  Implication: AD world models will get dramatically better
  as compute budgets increase. This is not an architecture
  problem -- it is a scaling problem.

Hands-On Exercises

Exercise 1: Build a Minimal World Model

Goal: Implement and train a simple frame prediction model on a toy driving dataset.

Steps:

  1. Use the SimpleActionConditionedPredictor from Code Example 1
  2. Create a synthetic dataset: moving colored squares on a gray background, where the action controls the camera's panning direction
  3. Train for 5000 steps and visualize predictions vs ground truth
  4. Experiment with different action values and observe how predictions change

Expected outcome: The model should learn that steering left causes the scene to shift right, and acceleration causes objects to grow (approach).

Stretch goal: Add a second object that moves independently of the ego action. Does the model learn to predict both ego-induced and independent motion?

Exercise 2: Implement Classifier-Free Guidance

Goal: Add classifier-free guidance to the diffusion model from Code Example 2.

Steps:

  1. Modify the training loop to randomly drop conditioning with probability 0.1
  2. During generation, compute both conditional and unconditional predictions
  3. Apply the CFG formula: pred = uncond + w * (cond - uncond)
  4. Generate scenarios with w = 1.0, 3.0, 7.5, and 15.0
  5. Compare the results: how does increasing w affect:
    • Action responsiveness
    • Visual quality
    • Diversity (generate 10 samples with same condition, measure variance)

Expected outcome: Higher w produces more action-responsive but less diverse outputs. Very high w (>15) causes artifacts.

Exercise 3: Multi-View Consistency Check

Goal: Evaluate whether a world model maintains geometric consistency across multiple camera views.

Steps:

  1. Given generated multi-view images (front, left, right cameras)
  2. Run a 3D object detector on each view independently
  3. Project detected 3D boxes from each view into a common BEV coordinate frame
  4. Measure consistency: do the same objects appear at the same 3D locations across views?
  5. Compute a "multi-view consistency score" as IoU of projected boxes
def multi_view_consistency_score(
    front_detections_bev: list,   # List of (x, y, w, l, yaw) in BEV
    left_detections_bev: list,
    right_detections_bev: list,
    iou_threshold: float = 0.3,
) -> float:
    """
    Compute how consistently objects appear across camera views.

    For each object detected in the front view, check if a matching
    object exists in overlapping regions of left/right views.

    Returns: fraction of front-view objects with cross-view matches.
    """
    # Your implementation here:
    # 1. For each front detection, find nearest detection in left/right
    # 2. Compute BEV IoU between matched pairs
    # 3. Count matches above iou_threshold
    # 4. Return match_count / total_front_detections
    pass

Exercise 4: Action Fidelity Benchmark

Goal: Quantitatively measure whether a world model responds correctly to actions.

Steps:

  1. Generate 100 scenarios with action = "strong left turn"
  2. Generate 100 scenarios with action = "strong right turn"
  3. Generate 100 scenarios with action = "go straight"
  4. For each generated video, compute the average optical flow direction
  5. Verify: left-turn videos should have rightward flow, right-turn should have leftward flow, straight should have forward (downward in image) flow
  6. Compute classification accuracy: can you tell the action from the flow alone?

Expected outcome: A good world model should achieve >90% action classification accuracy from optical flow analysis.

Exercise 5: Compare AR vs Diffusion Generation

Goal: Understand the quality trade-offs between autoregressive and diffusion world models.

Steps:

  1. Implement a simple AR model (predict frame t+1 from frame t using a CNN)
  2. Implement a simple diffusion model (generate frame t+1 conditioned on frame t)
  3. Train both on the same dataset
  4. Compare:
    • Per-frame quality (PSNR, SSIM)
    • Temporal consistency (frame-to-frame LPIPS variance)
    • Diversity (generate 10 samples, measure inter-sample variance)
    • Inference speed (wall-clock time per frame)
  5. Plot quality vs sequence length for both models

Expected outcome: AR models degrade faster over long sequences (error accumulation). Diffusion models have higher per-frame quality but are slower.

Exercise 6: Language-Conditioned Scenario Generation

Goal: Add language conditioning to a world model and evaluate controllability.

Steps:

  1. Extend the diffusion model from Code Example 2 to accept CLIP text embeddings
  2. Create a small dataset of (video, caption) pairs:
    • "Car braking in front" -> videos with decelerating lead vehicle
    • "Pedestrian crossing" -> videos with pedestrian in crosswalk
    • "Highway driving" -> videos of open highway
  3. Train with text conditioning (randomly drop text 10% for CFG)
  4. At inference, generate scenarios from text prompts
  5. Evaluate: can a human rater correctly identify which prompt generated which video?
  6. Compute a CLIP-score between generated frames and the text prompt

Interview Questions

Question 1: What is a world model and how does it differ from a traditional simulator?

Answer hints: A world model is a learned neural network that predicts future states of the environment given current state and actions. Unlike traditional simulators that use hand-crafted physics engines, rendering pipelines, and scripted behaviors, world models learn these dynamics from data. Key differences: (1) world models can capture complex behaviors that are hard to hand-code, (2) they generalize to new scenarios by interpolating learned patterns, (3) they may hallucinate physically implausible scenarios since they have no hard physics constraints. The term originates from Ha and Schmidhuber (2018) and has been applied to AD by systems like GAIA-1 and the Waymo World Model.

Question 2: Compare autoregressive and diffusion-based approaches for driving world models. When would you choose one over the other?

Answer hints: Autoregressive models generate frames sequentially, conditioning each frame on all previous ones. They naturally capture temporal dependencies and are conceptually simple (next-token prediction). However, they suffer from error accumulation over long sequences and generate pixels/tokens sequentially within each frame. Diffusion models generate entire frames (or video clips) by iteratively denoising from noise. They produce higher-quality individual frames, naturally handle continuous outputs, and offer strong controllability via CFG. However, they require many denoising steps (slow) and can struggle with long-range temporal consistency. Choose AR for real-time applications or when temporal coherence over many frames is critical. Choose diffusion when per-frame quality and controllability are priorities. Hybrid approaches (like Epona) combine both.

Question 3: How does the Waymo World Model leverage DeepMind's Genie 3?

Answer hints: Genie 3 provides the foundation architecture -- a scalable spatiotemporal transformer designed for interactive environment generation. Key elements carried over: (1) latent action models that learn meaningful action representations from data rather than requiring predefined action spaces, (2) an architecture optimized for generating long, consistent video sequences that respond causally to input actions, (3) scalable training infrastructure. Waymo extends this with AD-specific capabilities: multi-sensor output (camera + LiDAR), driving-specific conditioning (HD maps, traffic rules), and language controllability for rare event generation.

Question 4: What are the main evaluation metrics for a driving world model? Why is evaluation difficult?

Answer hints: Image quality (FID, LPIPS, PSNR, SSIM), video quality (FVD), driving-specific metrics (action fidelity, physical plausibility, temporal consistency, multi-view consistency), and downstream task performance (train a perception model on generated data, test on real data). Evaluation is difficult because: (1) there is no single ground truth for generated scenarios (many valid futures exist), (2) standard image metrics (FID) do not capture driving-specific quality (physics, consistency), (3) human evaluation is expensive and subjective, (4) the best metric -- downstream AD stack performance -- is extremely expensive to compute.

Question 5: Explain the controllability-diversity trade-off in world models. How is it managed in practice?

Answer hints: Higher controllability (the output closely follows the conditioning signal) reduces diversity (fewer varied outputs for the same condition). This is managed primarily through classifier-free guidance (CFG) weight: higher weight increases control at the cost of diversity. Temperature scaling and sampling strategies (DDIM vs DDPM) also affect this trade-off. In practice, different use cases need different settings: safety testing needs high control (specific scenario), training data generation needs high diversity (varied examples), and coverage testing needs both (diverse variants of specific scenario types).

Question 6: How can world models generate safety-critical scenarios that are rare in training data?

Answer hints: Several approaches: (1) Language conditioning -- describe the rare scenario in natural language ("pedestrian darts from behind parked bus"); (2) Guided sampling -- use a reward model to bias generation toward high-risk scenarios; (3) Latent space manipulation -- interpolate between known scenarios in latent space to create novel combinations; (4) Compositional generation -- combine common elements (intersection + cyclist + occluder) to create rare compositions; (5) Adversarial generation -- optimize the conditioning to find scenarios that cause the AD stack to fail. The Waymo World Model specifically emphasizes rare event generation as a key capability.

Question 7: What is the three-tier taxonomy of AD world models?

Answer hints: Tier 1 (Future Physical World Generation) generates sensor observations -- images, point clouds, BEV maps, occupancy grids. Used for data augmentation and sensor simulation. Tier 2 (Agent Behavior Prediction) predicts future trajectories of other agents. Used for motion forecasting and behavior simulation. Tier 3 (Interactive Prediction + Planning) integrates the world model into the planning loop -- the planner imagines consequences of actions using the world model before committing. This is model-based RL applied to driving. Most current production systems use Tier 1 and 2. Tier 3 is an active research frontier.

Question 8: How do world models handle multi-sensor consistency (camera + LiDAR)?

Answer hints: Three main approaches: (1) Independent generation with consistency loss -- separate models for each sensor, trained with a loss encouraging agreement. Simple but weak consistency. (2) Shared backbone with modality-specific heads -- a single encoder-decoder with shared latent representations and separate output heads. Better consistency through shared features. (3) Generate 3D first, then render -- create a 3D scene representation (voxels, point cloud, neural field) and render/raytrace to each sensor modality. Best consistency but most expensive. The Waymo World Model uses a shared representation approach where camera and LiDAR features are jointly processed.

Question 9: Compare reconstruction-based simulation (3DGS/NeRF) with generative world models. When would you use each?

Answer hints: Reconstruction excels at: exact replay of real routes, high geometric accuracy, sensor-level fidelity for specific scenes. Use it for regression testing, perception validation on known routes, and when you need deterministic, reproducible results. Generative world models excel at: novel scenario creation, long-tail event simulation, scalable scenario coverage, and language-controlled specification. Use them for safety scenario exploration, training data augmentation, and testing against scenarios not in your driving logs. The industry is converging on hybrid approaches that use reconstruction for the base scene and generation for dynamic elements, weather, and novel situations.

Question 10: What are the main unsolved challenges in AD world models as of 2026?

Answer hints: (1) Long-horizon consistency -- maintaining coherent, physically valid generation over minutes, not just seconds; (2) Geometric precision -- ensuring generated 3D structure is accurate enough for planning (pixel-level realism does not mean geometric accuracy); (3) Real-time generation -- current models are too slow for hardware-in-the-loop testing; (4) Evaluation -- no consensus on how to measure whether a world model is "good enough" for safety validation; (5) Guaranteed physical validity -- preventing the model from generating impossible scenarios (cars clipping through walls); (6) Multi-agent interaction -- generating realistic reactive behaviors for all agents, not just ego-centric prediction; (7) Sim-to-real transfer -- ensuring that AD systems tested in world-model simulation perform similarly in the real world.


References

Foundational Papers

  1. Ha, D. and Schmidhuber, J. (2018). "World Models." arXiv:1803.10122. The foundational paper defining world models for RL agents.

  2. Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. The paper that launched modern diffusion models.

  3. Rombach, R. et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. Latent diffusion (Stable Diffusion) -- foundation for many driving world models.

Key AD World Model Papers

  1. Hu, A. et al. (2023). "GAIA-1: A Generative World Model for Autonomous Driving." arXiv:2309.17080. Wayve's 9B parameter world model.

  2. Wang, W. et al. (2023). "DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving." arXiv:2309.09777. First diffusion-based world model from real driving scenarios.

  3. Yang, Z. et al. (2024). "GenAD: Generalized Predictive Model for Autonomous Driving." CVPR 2024. Action-conditioned generation with temporal reasoning.

  4. Zheng, W. et al. (2024). "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving." ECCV 2024. World model in 3D occupancy space.

  5. Gao, G. et al. (2024). "Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability." arXiv:2405.17398.

Surveys

  1. Wang, X. et al. (2024). "World Models for Autonomous Driving: An In-Depth Survey." arXiv:2403.02622. Comprehensive survey with taxonomy.

  2. Gao, Y. et al. (2024). "A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming." Covers video generation foundations.

  1. Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." ICML 2024. DeepMind's interactive environment generation (foundation for Genie 3).

  2. Yang, L. et al. (2024). "Generative Data-Driven Simulation." NVIDIA Dreamland -- hybrid physics + neural generation.

  3. Waymo (2026). "Introducing the Waymo World Model." Blog post describing the Genie 3-based world model for camera + LiDAR generation.

Reconstruction-Based Comparisons

  1. Yang, J. et al. (2023). "UniSim: A Neural Closed-Loop Sensor Simulator." CVPR 2023. Waabi's neural rendering approach.

  2. Kerbl, B. et al. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering." SIGGRAPH 2023. Foundation for 3DGS-based simulation.

  3. Yan, Z. et al. (2024). "Street Gaussians for Modeling Dynamic Urban Scenes." ECCV 2024. 3DGS applied to driving scenes.

Additional Resources

  1. NVIDIA Cosmos (2025). Foundation world model platform for physical AI.

  2. OpenAI Sora (2024). Large-scale video generation model demonstrating scaling laws for video quality.

  3. Waymax (Gulino et al., 2023). "Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research." NeurIPS 2023. Complements world models as a behavior-level simulator.

  4. WOSAC Challenge (2024). Waymo Open Sim Agents Challenge -- benchmark for evaluating agent behavior simulation quality.