Back to all projects
Track C: Agent Behavior & InfrastructureIntermediate15-20 hours

Closed-Loop Evaluation System

Build evaluation harness comparing open-loop vs closed-loop performance.

SimulationMetricsPython

Closed-Loop Evaluation System

Build an evaluation harness that exposes the dangerous gap between open-loop metrics and closed-loop reality -- the gap that separates simulation theater from safety-critical engineering.


Overview

In this project you will build a complete evaluation harness that runs the same autonomous driving policy in two fundamentally different simulation modes -- open-loop replay and closed-loop reactive simulation -- and then rigorously quantifies how well one predicts the other. The system takes a set of WOMD scenarios and a driving policy as input, executes evaluation in both modes using Waymax, computes a comprehensive suite of driving quality metrics, and produces a statistical analysis of the correlation gap between the two evaluation regimes.

This matters because the autonomous driving industry has repeatedly discovered that open-loop metrics (Average Displacement Error, Final Displacement Error, miss rate) are unreliable proxies for real-world driving performance. A policy that achieves state-of-the-art ADE on a motion forecasting benchmark can perform catastrophically in closed-loop deployment because open-loop evaluation ignores compounding errors, distributional shift, and multi-agent interaction effects. Closed-loop evaluation -- where all agents react to each other in real time -- is the gold standard for validating AD systems, and understanding why open-loop metrics fail is essential knowledge for anyone building or evaluating driving policies.

Your deliverable is a modular Python evaluation framework consisting of: (1) open-loop and closed-loop evaluator modules built on Waymax, (2) a metric computation suite covering safety, comfort, progress, and realism dimensions, (3) a correlation analysis pipeline that quantifies which open-loop metrics are predictive and which are misleading, (4) a policy comparison framework that demonstrates rank reversals across evaluation modes, and (5) a visual dashboard summarizing all findings. The entire system should be reproducible, well-documented, and extensible to new policies and metrics.


Learning Objectives

By completing this project, you will be able to:

  • Understand the structural differences between open-loop and closed-loop simulation and articulate why this distinction is the most important axis in AD evaluation design.
  • Implement evaluation protocols in Waymax using both waymax.env.PlanningAgentEnvironment for closed-loop and direct trajectory replay for open-loop, with proper scenario lifecycle management.
  • Compute a comprehensive suite of driving quality metrics including collision rate, off-road rate, route progress, jerk, lateral acceleration, time-to-collision, and WOSAC-style log-divergence measures.
  • Build a reproducible evaluation harness with clean separation between simulation execution, metric computation, aggregation, and reporting, following software engineering best practices.
  • Perform statistical analysis of correlation gaps using Pearson and Spearman correlations, paired significance tests, and confidence intervals to determine which open-loop metrics are predictive of closed-loop outcomes.
  • Create clear visualizations of evaluation results including radar plots, correlation heatmaps, scatter plots with regression lines, and per-scenario failure breakdowns that communicate findings to both technical and non-technical audiences.

Prerequisites

  • Required: Python proficiency (comfortable with NumPy, dataclasses, and module organization), familiarity with Waymax basics (loading scenarios, stepping simulations, accessing state), and a conceptual understanding of simulation in the AD context.
  • Recommended: JAX fundamentals (pure functions, jax.jit, jax.vmap), basic statistics (correlation coefficients, hypothesis testing, confidence intervals), and experience with Matplotlib or a similar plotting library.
  • Deep Dive Reading:
    • Waymax Deep Dive -- Comprehensive guide to Waymax architecture, the simulation loop, and the data model. Essential background for Steps 2-4.
    • WOSAC Challenge Deep Dive -- Detailed coverage of realism metrics, evaluation protocols, and the WOSAC benchmark. Directly relevant to the metric computation and correlation analysis portions of this project.

Key Concepts

Open-Loop vs Closed-Loop Simulation

The distinction between open-loop and closed-loop simulation is the single most important concept in AD evaluation.

In open-loop evaluation, the ego vehicle's planned trajectory is compared against recorded ground-truth trajectories, but the simulation never actually executes those plans. Other agents follow their logged trajectories regardless of what the ego vehicle does. This is equivalent to asking: "Given this exact traffic situation at time t, what would the policy do?" -- but never asking the follow-up question: "And then what happens next, given that the ego actually moved?"

Open-Loop (Replay):
  t=0: Observe logged state -> Policy predicts trajectory -> Compare to logged trajectory
  t=1: Observe LOGGED state (not affected by policy) -> Predict -> Compare
  t=2: Observe LOGGED state -> Predict -> Compare
  ...
  Result: ADE = 1.2m, FDE = 2.8m  (looks good!)

In closed-loop evaluation, the ego vehicle's actions are actually executed in the simulator, the world state updates accordingly, and other agents react (or at minimum, the ego must deal with the consequences of its own prior actions). This creates a feedback loop where small errors compound over time.

Closed-Loop (Reactive):
  t=0: Observe state -> Policy acts -> World updates (all agents react)
  t=1: Observe NEW state (affected by policy) -> Policy acts -> World updates
  t=2: Observe NEW state (errors compound) -> Policy acts -> World updates
  ...
  Result: Collision at t=14, off-road at t=22  (reality check!)

In Waymax, you control this distinction through the environment configuration. The waymax.env.PlanningAgentEnvironment class supports both modes: you can configure which agents are controlled by the policy under test and which follow their logged trajectories. For full closed-loop evaluation, non-ego agents can be driven by reactive models like the Intelligent Driver Model (IDM) via waymax.agents.IDMRoutePolicy.

The Correlation Gap

Research across multiple groups (Waymo, nuPlan, Motional) has consistently found that open-loop metrics are poor predictors of closed-loop performance. A 2023 analysis on the nuPlan benchmark showed that the Spearman rank correlation between open-loop ADE and closed-loop collision rate was as low as 0.15 -- barely better than random.

The reasons are structural:

  1. Compounding errors: In open-loop, each prediction starts from the ground-truth state. In closed-loop, errors at time t propagate to time t+1. A policy with 0.5m ADE per step can drift meters off course over a 5-second rollout.

  2. Distributional shift: The policy was trained (or the logged trajectory was recorded) under a specific distribution of states. Once the ego deviates from the logged path, it encounters states that may be out of distribution, causing cascading failures.

  3. Interaction effects: Open-loop evaluation cannot capture how other agents respond to the ego's behavior. A lane change that looks smooth in open-loop might force another vehicle to brake hard in closed-loop, creating a dangerous situation that open-loop metrics cannot detect.

  4. Metric insensitivity: Open-loop metrics like ADE treat all displacement errors equally. A 2m error that places the ego in an adjacent (empty) lane scores the same as a 2m error that places the ego inside another vehicle. Closed-loop metrics naturally capture this distinction through collision detection.

Driving Quality Metrics

A comprehensive evaluation requires metrics across multiple dimensions:

Safety Metrics measure how dangerous the driving behavior is:

  • Collision rate: Fraction of scenarios where the ego overlaps with another object (vehicle, pedestrian, cyclist). In Waymax, check overlap using waymax.metrics.OverlapMetric or compute bounding-box intersection directly from sim_state.sim_trajectory.
  • Time-to-collision (TTC): Minimum time until a collision would occur given current velocities. Lower values indicate more dangerous situations even if no collision occurs.
  • Near-miss rate: Fraction of scenarios where TTC drops below a threshold (e.g., 1.5 seconds) without actual collision.

Comfort Metrics measure ride quality and human acceptability:

  • Longitudinal jerk: Rate of change of acceleration. High jerk indicates jerky, uncomfortable driving. Compute as the finite difference of acceleration over the trajectory.
  • Lateral acceleration: Acceleration perpendicular to the heading. High values indicate aggressive turning.
  • Lane deviation: Distance from lane center, measuring how well the ego tracks its intended lane.

Progress Metrics measure whether the ego accomplishes its driving objective:

  • Route completion rate: Fraction of the planned route that is completed before the scenario ends or the ego fails.
  • Speed maintenance: How close the ego's speed is to the road's speed limit or the flow of traffic.
  • Time efficiency: Ratio of actual travel time to optimal travel time.

Realism Metrics measure how human-like the driving behavior is:

  • Log-divergence: Distance between the ego's trajectory and the logged (human) trajectory, weighted by time. This is the core WOSAC metric.
  • Kinematic Infeasibility Rate (KIR): Fraction of timesteps where the ego's state transition violates kinematic constraints (e.g., exceeds maximum steering rate or acceleration). Waymax provides waymax.metrics.KinematicsInfeasibilityMetric for this.
  • Speed distribution divergence: KL-divergence between the ego's speed distribution and the distribution in the logged data.

Waymax Evaluation Framework

Waymax provides the building blocks for both evaluation modes through a consistent API:

  • waymax.datatypes.SimulatorState: The central state object containing trajectories, validity masks, and metadata for all agents across all timesteps. This is your primary data source for metric computation.
  • waymax.env.PlanningAgentEnvironment: Wraps the Waymax dynamics model into a standard environment interface with reset(), step(), and observe() methods. Configure controlled_agents to specify which agents are driven by the policy under test.
  • waymax.metrics: Module containing built-in metric functions including OverlapMetric, OffroadMetric, KinematicsInfeasibilityMetric, and LogDivergenceMetric. Each operates on SimulatorState and returns per-agent, per-timestep values.
  • waymax.agents: Module containing baseline agents. IDMRoutePolicy implements the Intelligent Driver Model for reactive non-ego agents. LogPlaybackPolicy replays logged trajectories.
  • waymax.dynamics: Dynamics models (InvertibleBicycleModel, DeltaLocal, DeltaGlobal) that govern how actions translate to state transitions. The choice of dynamics model affects which kinematic constraints are enforced.

For batch evaluation across many scenarios, use jax.vmap over the scenario dimension to evaluate hundreds of scenarios in parallel on GPU/TPU, achieving the throughput needed for statistically meaningful results.

Statistical Evaluation

Evaluating a driving policy on a single scenario is meaningless. Robust evaluation requires:

  • Multiple scenarios: Evaluate across hundreds or thousands of WOMD scenarios to capture the diversity of driving situations (highway merges, unprotected left turns, pedestrian crossings, etc.).
  • Confidence intervals: Report metrics with bootstrap confidence intervals rather than point estimates. A collision rate of 3.2% +/- 0.8% communicates uncertainty that a bare number does not.
  • Paired tests: When comparing two policies, use paired statistical tests (Wilcoxon signed-rank or paired t-test) that account for scenario-level pairing. Policy A and Policy B evaluated on the same scenario are not independent samples.
  • Effect size: Beyond p-values, report effect sizes (Cohen's d or rank-biserial correlation) to distinguish statistically significant but practically meaningless differences from genuinely important ones.
  • Multiple comparisons: When comparing many metrics across many policies, apply corrections (Bonferroni or Benjamini-Hochberg) to control false discovery rate.

Step-by-Step Implementation Guide

Step 1: Environment Setup (45 min)

Set up the project structure and verify that all dependencies work correctly.

1.1 Project Directory Structure

closed-loop-eval/
├── eval/
│   ├── __init__.py
│   ├── open_loop.py          # Open-loop evaluator
│   ├── closed_loop.py        # Closed-loop evaluator
│   ├── metrics.py            # Metric computation suite
│   ├── correlation.py        # Correlation analysis
│   ├── comparison.py         # Policy comparison framework
│   └── dashboard.py          # Visualization and reporting
├── configs/
│   └── eval_config.py        # Evaluation configuration dataclass
├── notebooks/
│   ├── 01_open_vs_closed_loop.ipynb
│   ├── 02_driving_metrics.ipynb
│   ├── 03_correlation_analysis.ipynb
│   └── 04_evaluation_harness.ipynb
├── results/                  # Output directory for evaluation results
├── requirements.txt
└── run_eval.py               # Main entry point

1.2 Install Dependencies

python -m venv .venv
source .venv/bin/activate

pip install jax jaxlib
pip install waymax  # or install from source per Waymax docs
pip install numpy matplotlib scipy pandas seaborn

1.3 Verify Setup

import jax
import jax.numpy as jnp
from waymax import config as waymax_config
from waymax import dataloader
from waymax import datatypes
from waymax import dynamics
from waymax import env as waymax_env
from waymax import agents

print(f"JAX devices: {jax.devices()}")
print(f"JAX version: {jax.__version__}")

# Load a single scenario to verify data pipeline
data_config = waymax_config.DatasetConfig(
    path='path/to/womd/training',
    max_num_objects=32,
    data_format=waymax_config.DataFormat.WOMD
)
data_iter = dataloader.simulator_state_generator(config=data_config)
scenario = next(data_iter)
print(f"Scenario loaded: {scenario.shape}")
print(f"Number of objects: {scenario.log_trajectory.shape[-2]}")
print(f"Number of timesteps: {scenario.log_trajectory.shape[-1]}")

1.4 Define Evaluation Configuration

# configs/eval_config.py
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class EvalConfig:
    """Configuration for the evaluation harness."""
    # Data
    data_path: str = 'path/to/womd/validation'
    max_num_objects: int = 32
    num_scenarios: int = 100  # Number of scenarios to evaluate

    # Simulation
    sim_steps: int = 80  # 8 seconds at 10 Hz
    dt: float = 0.1

    # Metrics
    ttc_threshold: float = 1.5  # seconds, for near-miss detection
    max_jerk: float = 10.0  # m/s^3, comfort threshold
    max_lat_accel: float = 4.0  # m/s^2, comfort threshold

    # Analysis
    confidence_level: float = 0.95
    bootstrap_samples: int = 1000

    # Output
    results_dir: str = 'results/'
    export_html: bool = True

Step 2: Implement Open-Loop Evaluator (2 hours)

The open-loop evaluator runs a driving policy against logged scenarios where other agents are unaffected by the ego's actions and the ego always observes ground-truth states.

2.1 Core Open-Loop Evaluation Loop

# eval/open_loop.py
import jax
import jax.numpy as jnp
from waymax import datatypes
from waymax import dynamics
from typing import Dict, Any, Callable, List
from dataclasses import dataclass

@dataclass
class OpenLoopResult:
    """Results from a single open-loop evaluation."""
    scenario_id: str
    ego_trajectory: jnp.ndarray       # (T, 5) - x, y, heading, vx, vy
    logged_trajectory: jnp.ndarray    # (T, 5) - ground-truth
    other_trajectories: jnp.ndarray   # (N, T, 5) - other agents (from log)
    metrics: Dict[str, float]
    valid_mask: jnp.ndarray           # (T,) - which timesteps are valid

def evaluate_open_loop(
    scenario: datatypes.SimulatorState,
    policy_fn: Callable,
    config: 'EvalConfig'
) -> OpenLoopResult:
    """
    Evaluate a policy in open-loop mode.

    In open-loop, at each timestep:
    1. The policy observes the LOGGED state (ground truth)
    2. The policy produces an action or trajectory
    3. We record the policy's output but do NOT update the sim state
    4. We move to the next LOGGED timestep

    This means the policy always sees perfect information and never
    has to deal with the consequences of its own actions.
    """
    num_steps = config.sim_steps
    ego_idx = scenario.object_metadata.is_sdc  # Boolean mask for ego

    ego_predictions = []

    for t in range(num_steps):
        # Policy observes the ground-truth logged state at time t
        observation = _extract_observation(scenario, t)

        # Policy predicts action or trajectory
        action = policy_fn(observation, t)
        ego_predictions.append(action)

    ego_trajectory = jnp.stack(ego_predictions, axis=0)
    logged_trajectory = scenario.log_trajectory[ego_idx]

    # Compute open-loop metrics (ADE, FDE, miss rate, etc.)
    metrics = _compute_open_loop_metrics(
        ego_trajectory, logged_trajectory, scenario, config
    )

    return OpenLoopResult(
        scenario_id=_get_scenario_id(scenario),
        ego_trajectory=ego_trajectory,
        logged_trajectory=logged_trajectory,
        other_trajectories=scenario.log_trajectory[~ego_idx],
        metrics=metrics,
        valid_mask=scenario.log_trajectory.valid[ego_idx, :num_steps]
    )

2.2 Open-Loop Metric Computation

def _compute_open_loop_metrics(
    predicted: jnp.ndarray,
    logged: jnp.ndarray,
    scenario: datatypes.SimulatorState,
    config: 'EvalConfig'
) -> Dict[str, float]:
    """Compute standard open-loop metrics."""
    # Position displacement at each timestep
    displacement = jnp.linalg.norm(
        predicted[:, :2] - logged[:, :2], axis=-1
    )

    metrics = {
        # Average Displacement Error
        'ade': float(jnp.mean(displacement)),
        # Final Displacement Error
        'fde': float(displacement[-1]),
        # Miss rate (FDE > 2m threshold)
        'miss_rate_2m': float(displacement[-1] > 2.0),
        # Heading error (radians)
        'heading_error_mean': float(jnp.mean(
            jnp.abs(_wrap_angle(predicted[:, 2] - logged[:, 2]))
        )),
        # Speed error
        'speed_error_mean': float(jnp.mean(jnp.abs(
            jnp.linalg.norm(predicted[:, 3:5], axis=-1) -
            jnp.linalg.norm(logged[:, 3:5], axis=-1)
        ))),
    }
    return metrics

2.3 Batch Open-Loop Evaluation

def evaluate_open_loop_batch(
    scenarios: List[datatypes.SimulatorState],
    policy_fn: Callable,
    config: 'EvalConfig'
) -> List[OpenLoopResult]:
    """Evaluate a policy across multiple scenarios in open-loop."""
    results = []
    for scenario in scenarios:
        result = evaluate_open_loop(scenario, policy_fn, config)
        results.append(result)
    return results

def aggregate_open_loop_results(
    results: List[OpenLoopResult]
) -> Dict[str, Any]:
    """Aggregate open-loop results across scenarios."""
    all_metrics = {}
    for key in results[0].metrics.keys():
        values = [r.metrics[key] for r in results]
        all_metrics[key] = {
            'mean': float(jnp.mean(jnp.array(values))),
            'std': float(jnp.std(jnp.array(values))),
            'median': float(jnp.median(jnp.array(values))),
            'values': values,  # Keep per-scenario values for correlation
        }
    return all_metrics

Key implementation notes for Step 2:

  • The logged replay baseline (using waymax.agents.LogPlaybackPolicy) should produce near-zero open-loop error by construction -- use this as a sanity check.
  • Store per-scenario results with scenario IDs so you can pair them with closed-loop results in Step 5.
  • Consider using jax.vmap over the scenario batch dimension if scenarios are stacked into a single array, but be mindful of memory for large batches.

Step 3: Implement Closed-Loop Evaluator (2.5 hours)

The closed-loop evaluator runs the same policy but in reactive simulation where the ego's actions affect the world state and (optionally) other agents react.

3.1 Core Closed-Loop Evaluation Loop

# eval/closed_loop.py
import jax
import jax.numpy as jnp
from waymax import config as waymax_config
from waymax import datatypes
from waymax import dynamics
from waymax import env as waymax_env
from waymax import agents
from typing import Dict, Any, Callable, List
from dataclasses import dataclass

@dataclass
class ClosedLoopResult:
    """Results from a single closed-loop evaluation."""
    scenario_id: str
    ego_trajectory: jnp.ndarray       # (T, 5) actual executed trajectory
    logged_trajectory: jnp.ndarray    # (T, 5) what the log says
    other_trajectories: jnp.ndarray   # (N, T, 5) other agents' actual paths
    metrics: Dict[str, float]
    valid_mask: jnp.ndarray
    terminated_early: bool
    termination_reason: str           # 'completed', 'collision', 'off_road'
    termination_step: int

def evaluate_closed_loop(
    scenario: datatypes.SimulatorState,
    policy_fn: Callable,
    config: 'EvalConfig',
    non_ego_policy: str = 'log_playback'  # 'log_playback' or 'idm'
) -> ClosedLoopResult:
    """
    Evaluate a policy in closed-loop mode.

    In closed-loop, at each timestep:
    1. The policy observes the CURRENT sim state (affected by prior actions)
    2. The policy produces an action
    3. The action is executed via the dynamics model
    4. The world state updates (and optionally other agents react)
    5. We move to the NEXT state (which reflects the ego's action)

    Errors compound. Distributional shift occurs. Reality bites.
    """
    # Configure Waymax environment for closed-loop
    env_config = waymax_config.EnvironmentConfig(
        controlled_object=waymax_config.ObjectType.SDC,
    )

    dynamics_model = dynamics.InvertibleBicycleModel()

    env = waymax_env.PlanningAgentEnvironment(
        dynamics_model=dynamics_model,
        config=env_config,
    )

    # Select non-ego agent policy
    if non_ego_policy == 'idm':
        bg_agent = agents.IDMRoutePolicy(
            is_controlled_func=lambda state: ~state.object_metadata.is_sdc
        )
    else:  # log_playback
        bg_agent = agents.LogPlaybackPolicy()

    # Initialize simulation
    state = env.reset(scenario)
    ego_states = []
    terminated_early = False
    termination_reason = 'completed'
    termination_step = config.sim_steps

    for t in range(config.sim_steps):
        # Extract observation for the ego policy
        observation = env.observe(state)

        # Ego policy produces action
        ego_action = policy_fn(observation, t)

        # Background agents produce actions
        bg_action = bg_agent.select_action(state)

        # Combine actions and step
        # (Implementation depends on how you merge ego + background actions)
        combined_action = _merge_actions(ego_action, bg_action, state)
        state = env.step(state, combined_action)

        # Record ego state
        ego_states.append(_extract_ego_state(state))

        # Check termination conditions
        if _check_collision(state):
            terminated_early = True
            termination_reason = 'collision'
            termination_step = t
            break
        if _check_offroad(state):
            terminated_early = True
            termination_reason = 'off_road'
            termination_step = t
            break

    ego_trajectory = jnp.stack(ego_states, axis=0)

    # Compute closed-loop metrics
    metrics = _compute_closed_loop_metrics(
        ego_trajectory, state, scenario, config,
        termination_step, termination_reason
    )

    return ClosedLoopResult(
        scenario_id=_get_scenario_id(scenario),
        ego_trajectory=ego_trajectory,
        logged_trajectory=scenario.log_trajectory[
            scenario.object_metadata.is_sdc
        ],
        other_trajectories=_extract_other_trajectories(state),
        metrics=metrics,
        valid_mask=jnp.ones(len(ego_states), dtype=bool),
        terminated_early=terminated_early,
        termination_reason=termination_reason,
        termination_step=termination_step,
    )

3.2 Closed-Loop Specific Considerations

def _compute_closed_loop_metrics(
    ego_trajectory: jnp.ndarray,
    final_state: datatypes.SimulatorState,
    scenario: datatypes.SimulatorState,
    config: 'EvalConfig',
    termination_step: int,
    termination_reason: str,
) -> Dict[str, float]:
    """
    Compute closed-loop specific metrics.

    Key difference from open-loop: these metrics reflect what
    ACTUALLY HAPPENED in the simulation, not what would have
    happened in an idealized replay.
    """
    metrics = {}

    # Binary outcomes
    metrics['collision'] = float(termination_reason == 'collision')
    metrics['off_road'] = float(termination_reason == 'off_road')
    metrics['completed'] = float(termination_reason == 'completed')

    # Progress metrics (how far the ego got before termination)
    metrics['progress_ratio'] = termination_step / config.sim_steps
    metrics['distance_traveled'] = float(_compute_distance(ego_trajectory))

    # Log divergence (how far ego deviated from the human trajectory)
    logged = scenario.log_trajectory[scenario.object_metadata.is_sdc]
    displacement = jnp.linalg.norm(
        ego_trajectory[:termination_step, :2] -
        logged[:termination_step, :2],
        axis=-1
    )
    metrics['log_divergence_mean'] = float(jnp.mean(displacement))
    metrics['log_divergence_final'] = float(displacement[-1])

    return metrics

3.3 Handling Early Termination

A critical design decision: when the ego collides or goes off-road, do you stop the simulation or continue? Both choices have implications:

  • Stop on failure: Reflects real-world consequences (a collision ends the scenario). Progress metrics are meaningful. But you get fewer timesteps of data for metric computation.
  • Continue after failure: Gives you a full trajectory for metric computation and allows observing recovery behavior. But collision metrics become ambiguous (does one collision count the same as 10 overlapping timesteps?).

The recommended approach: stop on first failure but record a full set of metrics up to the termination point. For metrics that require a full trajectory (like comfort metrics), pad or normalize by the number of valid timesteps.


Step 4: Build Metric Computation Suite (2.5 hours)

Implement a unified metric computation module that works with both open-loop and closed-loop trajectories.

4.1 Safety Metrics

# eval/metrics.py
import jax
import jax.numpy as jnp
from waymax import datatypes
from waymax import metrics as waymax_metrics
from typing import Dict

def compute_collision_rate(
    ego_traj: jnp.ndarray,
    other_trajs: jnp.ndarray,
    ego_extent: jnp.ndarray,
    other_extents: jnp.ndarray,
) -> Dict[str, float]:
    """
    Compute collision metrics using bounding box overlap.

    For Waymax-native collision detection, use:
        waymax.metrics.OverlapMetric
    which handles the rotated bounding box geometry correctly.
    """
    # Simplified axis-aligned check (for illustration)
    # In production, use waymax_metrics.OverlapMetric on SimulatorState
    collisions_per_step = _check_bbox_overlap(
        ego_traj, other_trajs, ego_extent, other_extents
    )

    return {
        'collision_rate': float(jnp.any(collisions_per_step)),
        'collision_count': int(jnp.sum(collisions_per_step)),
        'first_collision_step': int(
            jnp.argmax(collisions_per_step)
        ) if jnp.any(collisions_per_step) else -1,
    }


def compute_time_to_collision(
    ego_traj: jnp.ndarray,
    other_trajs: jnp.ndarray,
    dt: float = 0.1,
) -> Dict[str, float]:
    """
    Compute time-to-collision (TTC) assuming constant velocity.

    TTC at time t = distance / closing_speed
    Only defined when closing_speed > 0 (vehicles approaching).
    """
    T = ego_traj.shape[0]
    N = other_trajs.shape[0]

    ttc_values = []
    for t in range(T):
        ego_pos = ego_traj[t, :2]
        ego_vel = ego_traj[t, 3:5]

        for n in range(N):
            other_pos = other_trajs[n, t, :2]
            other_vel = other_trajs[n, t, 3:5]

            rel_pos = other_pos - ego_pos
            rel_vel = other_vel - ego_vel
            distance = jnp.linalg.norm(rel_pos)
            closing_speed = -jnp.dot(rel_pos, rel_vel) / (distance + 1e-6)

            if closing_speed > 0:
                ttc = distance / closing_speed
                ttc_values.append(float(ttc))

    ttc_array = jnp.array(ttc_values) if ttc_values else jnp.array([float('inf')])

    return {
        'min_ttc': float(jnp.min(ttc_array)),
        'mean_ttc': float(jnp.mean(ttc_array)),
        'near_miss_rate': float(jnp.mean(ttc_array < 1.5)),
    }

4.2 Comfort Metrics

def compute_comfort_metrics(
    ego_traj: jnp.ndarray,
    dt: float = 0.1,
) -> Dict[str, float]:
    """
    Compute ride comfort metrics from ego trajectory.

    ego_traj: (T, 5) with columns [x, y, heading, vx, vy]
    """
    # Velocity magnitude
    speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)

    # Acceleration (finite difference of velocity)
    accel = jnp.diff(ego_traj[:, 3:5], axis=0) / dt
    accel_magnitude = jnp.linalg.norm(accel, axis=-1)

    # Jerk (finite difference of acceleration)
    jerk = jnp.diff(accel, axis=0) / dt
    jerk_magnitude = jnp.linalg.norm(jerk, axis=-1)

    # Lateral acceleration
    # Project acceleration onto the perpendicular of heading
    heading = ego_traj[:-1, 2]  # heading at each accel timestep
    lat_direction = jnp.stack([-jnp.sin(heading), jnp.cos(heading)], axis=-1)
    lat_accel = jnp.abs(jnp.sum(accel * lat_direction, axis=-1))

    # Longitudinal acceleration
    lon_direction = jnp.stack([jnp.cos(heading), jnp.sin(heading)], axis=-1)
    lon_accel = jnp.sum(accel * lon_direction, axis=-1)

    return {
        'mean_jerk': float(jnp.mean(jerk_magnitude)),
        'max_jerk': float(jnp.max(jerk_magnitude)),
        'mean_lat_accel': float(jnp.mean(lat_accel)),
        'max_lat_accel': float(jnp.max(lat_accel)),
        'mean_lon_accel': float(jnp.mean(jnp.abs(lon_accel))),
        'max_decel': float(jnp.min(lon_accel)),  # Most negative = hardest brake
        'speed_std': float(jnp.std(speed)),
    }

4.3 Progress Metrics

def compute_progress_metrics(
    ego_traj: jnp.ndarray,
    logged_traj: jnp.ndarray,
    road_speed_limit: float = 13.4,  # m/s (~30 mph default)
    total_steps: int = 80,
) -> Dict[str, float]:
    """Compute route progress and efficiency metrics."""
    # Distance traveled by ego
    ego_distance = _path_length(ego_traj[:, :2])
    log_distance = _path_length(logged_traj[:, :2])

    # Speed
    ego_speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)

    # Route completion: ratio of ego distance to logged distance
    route_completion = float(ego_distance / (log_distance + 1e-6))

    return {
        'distance_traveled': float(ego_distance),
        'route_completion': min(route_completion, 1.0),
        'mean_speed': float(jnp.mean(ego_speed)),
        'speed_limit_compliance': float(
            jnp.mean(ego_speed <= road_speed_limit * 1.1)
        ),
        'time_stationary': float(jnp.mean(ego_speed < 0.1)),
    }

def _path_length(positions: jnp.ndarray) -> float:
    """Total path length from (T, 2) position array."""
    deltas = jnp.diff(positions, axis=0)
    return float(jnp.sum(jnp.linalg.norm(deltas, axis=-1)))

4.4 Realism Metrics (WOSAC-style)

def compute_realism_metrics(
    ego_traj: jnp.ndarray,
    logged_traj: jnp.ndarray,
    sim_state: datatypes.SimulatorState,
    dt: float = 0.1,
) -> Dict[str, float]:
    """
    Compute WOSAC-inspired realism metrics.

    For full WOSAC compliance, use waymax.metrics directly:
        - waymax.metrics.LogDivergenceMetric
        - waymax.metrics.KinematicsInfeasibilityMetric
        - waymax.metrics.OffroadMetric
    """
    # Log divergence (position)
    pos_divergence = jnp.linalg.norm(
        ego_traj[:, :2] - logged_traj[:, :2], axis=-1
    )

    # Heading divergence
    heading_divergence = jnp.abs(
        _wrap_angle(ego_traj[:, 2] - logged_traj[:, 2])
    )

    # Speed divergence
    ego_speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)
    log_speed = jnp.linalg.norm(logged_traj[:, 3:5], axis=-1)
    speed_divergence = jnp.abs(ego_speed - log_speed)

    # Kinematic infeasibility: check if accelerations exceed vehicle limits
    accel = jnp.diff(ego_traj[:, 3:5], axis=0) / dt
    accel_mag = jnp.linalg.norm(accel, axis=-1)
    MAX_ACCEL = 8.0  # m/s^2 (reasonable vehicle limit)
    kir = float(jnp.mean(accel_mag > MAX_ACCEL))

    return {
        'log_divergence_ade': float(jnp.mean(pos_divergence)),
        'log_divergence_fde': float(pos_divergence[-1]),
        'heading_divergence_mean': float(jnp.mean(heading_divergence)),
        'speed_divergence_mean': float(jnp.mean(speed_divergence)),
        'kinematic_infeasibility_rate': kir,
    }

def _wrap_angle(angle: jnp.ndarray) -> jnp.ndarray:
    """Wrap angle to [-pi, pi]."""
    return (angle + jnp.pi) % (2 * jnp.pi) - jnp.pi

4.5 Unified Metric Runner

def compute_all_metrics(
    ego_traj: jnp.ndarray,
    logged_traj: jnp.ndarray,
    other_trajs: jnp.ndarray,
    sim_state: datatypes.SimulatorState,
    config: 'EvalConfig',
) -> Dict[str, float]:
    """Compute all metrics for a single scenario evaluation."""
    metrics = {}
    metrics.update(compute_collision_rate(
        ego_traj, other_trajs,
        _get_ego_extent(sim_state), _get_other_extents(sim_state)
    ))
    metrics.update(compute_time_to_collision(ego_traj, other_trajs, config.dt))
    metrics.update(compute_comfort_metrics(ego_traj, config.dt))
    metrics.update(compute_progress_metrics(
        ego_traj, logged_traj, total_steps=config.sim_steps
    ))
    metrics.update(compute_realism_metrics(
        ego_traj, logged_traj, sim_state, config.dt
    ))
    return metrics

Step 5: Correlation Analysis (2 hours)

This is the core analytical contribution: quantifying how well open-loop metrics predict closed-loop performance.

5.1 Pair Results by Scenario

# eval/correlation.py
import numpy as np
from scipy import stats
from typing import Dict, List, Tuple

def pair_results(
    open_loop_results: List['OpenLoopResult'],
    closed_loop_results: List['ClosedLoopResult'],
) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray]]:
    """
    Pair open-loop and closed-loop results by scenario ID.

    Returns matched arrays for each metric.
    """
    # Build lookup by scenario ID
    ol_by_id = {r.scenario_id: r.metrics for r in open_loop_results}
    cl_by_id = {r.scenario_id: r.metrics for r in closed_loop_results}

    # Find common scenarios
    common_ids = set(ol_by_id.keys()) & set(cl_by_id.keys())
    assert len(common_ids) > 0, "No matching scenarios between OL and CL results"

    ol_metrics = {}
    cl_metrics = {}

    for metric_name in ol_by_id[next(iter(common_ids))].keys():
        ol_metrics[metric_name] = np.array([
            ol_by_id[sid][metric_name] for sid in sorted(common_ids)
        ])

    for metric_name in cl_by_id[next(iter(common_ids))].keys():
        cl_metrics[metric_name] = np.array([
            cl_by_id[sid][metric_name] for sid in sorted(common_ids)
        ])

    return ol_metrics, cl_metrics

5.2 Compute Correlation Matrix

def compute_correlation_matrix(
    ol_metrics: Dict[str, np.ndarray],
    cl_metrics: Dict[str, np.ndarray],
) -> Dict[str, Dict[str, Dict[str, float]]]:
    """
    Compute pairwise correlations between all open-loop
    and all closed-loop metrics.

    Returns nested dict: ol_metric -> cl_metric -> {pearson, spearman, p_value}
    """
    results = {}

    for ol_name, ol_values in ol_metrics.items():
        results[ol_name] = {}
        for cl_name, cl_values in cl_metrics.items():
            # Skip if either is constant (no variance)
            if np.std(ol_values) < 1e-10 or np.std(cl_values) < 1e-10:
                results[ol_name][cl_name] = {
                    'pearson': 0.0, 'spearman': 0.0,
                    'p_pearson': 1.0, 'p_spearman': 1.0,
                }
                continue

            pearson_r, pearson_p = stats.pearsonr(ol_values, cl_values)
            spearman_r, spearman_p = stats.spearmanr(ol_values, cl_values)

            results[ol_name][cl_name] = {
                'pearson': float(pearson_r),
                'spearman': float(spearman_r),
                'p_pearson': float(pearson_p),
                'p_spearman': float(spearman_p),
            }

    return results

5.3 Identify Predictive and Misleading Metrics

def analyze_predictiveness(
    correlation_matrix: Dict,
    target_cl_metrics: List[str] = ['collision_rate', 'off_road', 'route_completion'],
    significance_threshold: float = 0.05,
) -> Dict[str, List[Dict]]:
    """
    Identify which open-loop metrics are most/least predictive
    of key closed-loop outcomes.
    """
    analysis = {}

    for cl_metric in target_cl_metrics:
        predictors = []
        for ol_metric, correlations in correlation_matrix.items():
            if cl_metric in correlations:
                corr = correlations[cl_metric]
                predictors.append({
                    'ol_metric': ol_metric,
                    'spearman': corr['spearman'],
                    'p_value': corr['p_spearman'],
                    'significant': corr['p_spearman'] < significance_threshold,
                    'predictive_quality': _classify_correlation(corr['spearman']),
                })

        # Sort by absolute correlation strength
        predictors.sort(key=lambda x: abs(x['spearman']), reverse=True)
        analysis[cl_metric] = predictors

    return analysis

def _classify_correlation(r: float) -> str:
    """Classify correlation strength."""
    abs_r = abs(r)
    if abs_r >= 0.7:
        return 'strong'
    elif abs_r >= 0.4:
        return 'moderate'
    elif abs_r >= 0.2:
        return 'weak'
    else:
        return 'negligible'

5.4 Bootstrap Confidence Intervals

def bootstrap_correlation(
    x: np.ndarray,
    y: np.ndarray,
    n_bootstrap: int = 1000,
    confidence: float = 0.95,
) -> Dict[str, float]:
    """
    Compute bootstrap confidence interval for Spearman correlation.
    """
    n = len(x)
    rng = np.random.default_rng(42)
    boot_correlations = []

    for _ in range(n_bootstrap):
        indices = rng.choice(n, size=n, replace=True)
        r, _ = stats.spearmanr(x[indices], y[indices])
        boot_correlations.append(r)

    boot_correlations = np.array(boot_correlations)
    alpha = 1 - confidence
    ci_lower = np.percentile(boot_correlations, 100 * alpha / 2)
    ci_upper = np.percentile(boot_correlations, 100 * (1 - alpha / 2))

    return {
        'mean': float(np.mean(boot_correlations)),
        'ci_lower': float(ci_lower),
        'ci_upper': float(ci_upper),
        'std': float(np.std(boot_correlations)),
    }

Step 6: Policy Comparison Framework (2 hours)

Compare multiple policies to demonstrate that open-loop and closed-loop evaluations can produce contradictory rankings.

6.1 Define Baseline Policies

# eval/comparison.py
import jax.numpy as jnp
from waymax import agents

def get_baseline_policies():
    """
    Define 2-3 baseline policies for comparison.

    These policies have deliberately different characteristics
    to illustrate rank reversals between open-loop and closed-loop.
    """
    policies = {}

    # Policy 1: Log Replay (perfect in open-loop, baseline in closed-loop)
    # Replays the logged human trajectory. By definition, ADE = 0 in
    # open-loop. In closed-loop, it does not react to deviations.
    policies['log_replay'] = agents.LogPlaybackPolicy()

    # Policy 2: IDM (Intelligent Driver Model)
    # Reactive car-following model. Moderate open-loop error (it doesn't
    # know the exact logged trajectory), but handles closed-loop
    # interactions well because it reacts to the vehicle ahead.
    policies['idm'] = agents.IDMRoutePolicy(
        is_controlled_func=lambda s: s.object_metadata.is_sdc
    )

    # Policy 3: Constant Velocity
    # Maintains current speed and heading. Low jerk (smooth), but
    # cannot handle curves, stops, or other agents. Interesting because
    # it may have decent open-loop metrics on straight roads but
    # catastrophic closed-loop performance on curves.
    policies['constant_velocity'] = ConstantVelocityPolicy()

    return policies


class ConstantVelocityPolicy:
    """Simple policy that maintains current velocity vector."""

    def select_action(self, observation, timestep):
        # Action = zero acceleration, zero steering
        return jnp.zeros(2)  # [accel, steering_rate]

6.2 Run All Policies Through Both Modes

def compare_policies(
    scenarios: list,
    policies: dict,
    config: 'EvalConfig',
) -> Dict[str, Dict[str, Any]]:
    """
    Run every policy through both evaluation modes on the same scenarios.

    Returns nested dict: policy_name -> mode -> aggregated_metrics
    """
    results = {}

    for policy_name, policy in policies.items():
        print(f"Evaluating policy: {policy_name}")

        # Open-loop evaluation
        ol_results = evaluate_open_loop_batch(scenarios, policy, config)
        ol_aggregated = aggregate_open_loop_results(ol_results)

        # Closed-loop evaluation
        cl_results = evaluate_closed_loop_batch(scenarios, policy, config)
        cl_aggregated = aggregate_closed_loop_results(cl_results)

        results[policy_name] = {
            'open_loop': ol_aggregated,
            'closed_loop': cl_aggregated,
            'ol_raw': ol_results,
            'cl_raw': cl_results,
        }

    return results

6.3 Detect Rank Reversals

def detect_rank_reversals(
    comparison_results: Dict[str, Dict[str, Any]],
    metrics_to_check: List[str],
) -> List[Dict]:
    """
    Find metrics where open-loop and closed-loop disagree
    on which policy is better.

    A rank reversal occurs when Policy A beats Policy B in
    open-loop but Policy B beats Policy A in closed-loop
    (or vice versa) on a given metric.
    """
    reversals = []
    policy_names = list(comparison_results.keys())

    for metric in metrics_to_check:
        # Get rankings in each mode
        ol_scores = {}
        cl_scores = {}
        for name in policy_names:
            ol_data = comparison_results[name]['open_loop']
            cl_data = comparison_results[name]['closed_loop']
            if metric in ol_data:
                ol_scores[name] = ol_data[metric]['mean']
            if metric in cl_data:
                cl_scores[name] = cl_data[metric]['mean']

        if not ol_scores or not cl_scores:
            continue

        # Rank policies (lower is better for error metrics, context-dependent)
        ol_ranking = sorted(ol_scores, key=ol_scores.get)
        cl_ranking = sorted(cl_scores, key=cl_scores.get)

        if ol_ranking != cl_ranking:
            reversals.append({
                'metric': metric,
                'open_loop_ranking': ol_ranking,
                'closed_loop_ranking': cl_ranking,
                'ol_scores': ol_scores,
                'cl_scores': cl_scores,
            })

    return reversals

6.4 Statistical Significance of Differences

def test_policy_difference(
    results_a: List,
    results_b: List,
    metric_name: str,
) -> Dict[str, float]:
    """
    Paired Wilcoxon signed-rank test for comparing two policies
    on the same set of scenarios.
    """
    values_a = np.array([r.metrics[metric_name] for r in results_a])
    values_b = np.array([r.metrics[metric_name] for r in results_b])

    # Paired test (same scenarios)
    stat, p_value = stats.wilcoxon(values_a, values_b, alternative='two-sided')

    # Effect size (rank-biserial correlation)
    n = len(values_a)
    effect_size = 1 - (2 * stat) / (n * (n + 1) / 2)

    return {
        'statistic': float(stat),
        'p_value': float(p_value),
        'effect_size': float(effect_size),
        'significant': p_value < 0.05,
        'mean_a': float(np.mean(values_a)),
        'mean_b': float(np.mean(values_b)),
    }

Step 7: Evaluation Dashboard (1.5 hours)

Build visualizations that communicate your findings clearly.

7.1 Summary Comparison Table

# eval/dashboard.py
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import seaborn as sns
from typing import Dict

def plot_summary_table(
    comparison_results: Dict,
    metrics: list,
    output_path: str = 'results/summary_table.png',
):
    """
    Create a policy x metric x mode summary table as a heatmap.
    """
    policies = list(comparison_results.keys())
    modes = ['open_loop', 'closed_loop']

    # Build data for display
    rows = []
    row_labels = []
    for policy in policies:
        for mode in modes:
            row_labels.append(f"{policy} ({mode[:2].upper()})")
            row = []
            for metric in metrics:
                data = comparison_results[policy][mode]
                if metric in data:
                    row.append(data[metric]['mean'])
                else:
                    row.append(np.nan)
            rows.append(row)

    data_array = np.array(rows)

    fig, ax = plt.subplots(figsize=(14, len(rows) * 0.8 + 2))
    sns.heatmap(
        data_array, annot=True, fmt='.3f',
        xticklabels=metrics, yticklabels=row_labels,
        cmap='RdYlGn_r', ax=ax
    )
    ax.set_title('Policy Evaluation Summary: Open-Loop vs Closed-Loop')
    plt.tight_layout()
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

7.2 Radar Plot for Policy Comparison

def plot_radar_comparison(
    comparison_results: Dict,
    metrics: list,
    mode: str = 'closed_loop',
    output_path: str = 'results/radar_comparison.png',
):
    """
    Radar plot comparing policies across metric dimensions.
    Normalize each metric to [0, 1] range for comparability.
    """
    policies = list(comparison_results.keys())
    N = len(metrics)

    # Compute angles
    angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
    angles += angles[:1]  # Close the polygon

    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

    for policy in policies:
        values = []
        for metric in metrics:
            data = comparison_results[policy][mode]
            values.append(data.get(metric, {}).get('mean', 0))

        # Normalize to [0, 1]
        values = _normalize_radar_values(values, metrics, comparison_results, mode)
        values += values[:1]

        ax.plot(angles, values, 'o-', linewidth=2, label=policy)
        ax.fill(angles, values, alpha=0.1)

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(metrics, size=8)
    ax.set_title(f'Policy Comparison ({mode.replace("_", " ").title()})')
    ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
    plt.tight_layout()
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

7.3 Correlation Heatmap

def plot_correlation_heatmap(
    correlation_matrix: Dict,
    ol_metrics: list,
    cl_metrics: list,
    output_path: str = 'results/correlation_heatmap.png',
):
    """
    Heatmap showing Spearman correlation between each open-loop
    metric (rows) and each closed-loop metric (columns).

    This is the signature visualization for this project:
    it reveals which OL metrics predict CL outcomes and which don't.
    """
    data = np.zeros((len(ol_metrics), len(cl_metrics)))

    for i, ol in enumerate(ol_metrics):
        for j, cl in enumerate(cl_metrics):
            if ol in correlation_matrix and cl in correlation_matrix[ol]:
                data[i, j] = correlation_matrix[ol][cl]['spearman']

    fig, ax = plt.subplots(figsize=(12, 8))
    sns.heatmap(
        data, annot=True, fmt='.2f',
        xticklabels=cl_metrics, yticklabels=ol_metrics,
        cmap='RdBu_r', center=0, vmin=-1, vmax=1,
        ax=ax,
        cbar_kws={'label': 'Spearman Correlation'}
    )
    ax.set_xlabel('Closed-Loop Metrics')
    ax.set_ylabel('Open-Loop Metrics')
    ax.set_title('Open-Loop vs Closed-Loop Metric Correlation')
    plt.tight_layout()
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

7.4 Scatter Plots with Regression

def plot_metric_scatter(
    ol_values: np.ndarray,
    cl_values: np.ndarray,
    ol_name: str,
    cl_name: str,
    output_path: str,
):
    """
    Scatter plot of open-loop metric vs closed-loop metric
    across scenarios, with regression line and confidence band.
    """
    fig, ax = plt.subplots(figsize=(8, 6))

    ax.scatter(ol_values, cl_values, alpha=0.5, s=20, edgecolors='none')

    # Regression line
    if np.std(ol_values) > 0 and np.std(cl_values) > 0:
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            ol_values, cl_values
        )
        x_line = np.linspace(ol_values.min(), ol_values.max(), 100)
        y_line = slope * x_line + intercept
        ax.plot(x_line, y_line, 'r-', linewidth=2,
                label=f'r={r_value:.3f}, p={p_value:.2e}')

    ax.set_xlabel(f'Open-Loop: {ol_name}')
    ax.set_ylabel(f'Closed-Loop: {cl_name}')
    ax.set_title(f'{ol_name} vs {cl_name}')
    ax.legend()
    plt.tight_layout()
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

7.5 HTML Report Export

def export_html_report(
    comparison_results: Dict,
    correlation_analysis: Dict,
    rank_reversals: List[Dict],
    output_path: str = 'results/evaluation_report.html',
):
    """
    Export a self-contained HTML report with all figures and tables.
    Embed images as base64 for portability.
    """
    import base64
    from pathlib import Path

    html_parts = [
        '<html><head><title>Closed-Loop Evaluation Report</title>',
        '<style>',
        'body { font-family: Arial, sans-serif; max-width: 1200px; margin: auto; }',
        'table { border-collapse: collapse; width: 100%; margin: 20px 0; }',
        'th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }',
        'th { background-color: #4a90d9; color: white; }',
        'img { max-width: 100%; }',
        '.reversal { background-color: #fff3cd; padding: 10px; margin: 10px 0; }',
        '</style></head><body>',
        '<h1>Closed-Loop Evaluation Report</h1>',
    ]

    # Embed each generated figure
    for img_name in ['summary_table.png', 'radar_comparison.png',
                     'correlation_heatmap.png']:
        img_path = Path('results') / img_name
        if img_path.exists():
            with open(img_path, 'rb') as f:
                b64 = base64.b64encode(f.read()).decode()
            html_parts.append(f'<h2>{img_name.replace("_", " ").title()}</h2>')
            html_parts.append(f'<img src="data:image/png;base64,{b64}" />')

    # Rank reversals section
    if rank_reversals:
        html_parts.append('<h2>Rank Reversals</h2>')
        for rev in rank_reversals:
            html_parts.append(f'<div class="reversal">')
            html_parts.append(f'<strong>Metric: {rev["metric"]}</strong><br>')
            html_parts.append(f'OL ranking: {rev["open_loop_ranking"]}<br>')
            html_parts.append(f'CL ranking: {rev["closed_loop_ranking"]}')
            html_parts.append('</div>')

    html_parts.append('</body></html>')

    with open(output_path, 'w') as f:
        f.write('\n'.join(html_parts))

Notebook Exercises

#NotebookFocusTime
101_open_vs_closed_loop.ipynbRun the same scenario in both modes using Waymax. Visualize the ego trajectory divergence over time. Observe how a small initial deviation at t=5 becomes a large deviation by t=40.45 min
202_driving_metrics.ipynbImplement the full metric suite (safety, comfort, progress, realism). Apply each metric to hand-crafted trajectories first (e.g., a trajectory that collides, one that goes off-road) to verify correctness before running on real data.60 min
303_correlation_analysis.ipynbRun 50+ scenarios through both modes. Compute the full correlation matrix. Identify which open-loop metrics are predictive of collision rate and which are not. Create the correlation heatmap.60 min
404_evaluation_harness.ipynbFull pipeline: load scenarios, evaluate 3 policies in both modes, detect rank reversals, generate the HTML dashboard. Write a 1-page summary of findings.60 min

Expected Deliverables

  1. Evaluation harness Python module (eval/): A clean, importable package with separate modules for open-loop evaluation, closed-loop evaluation, metric computation, correlation analysis, policy comparison, and dashboard generation.

  2. Metric computation suite (eval/metrics.py): Implementations of at least 15 distinct metrics across safety, comfort, progress, and realism dimensions, each tested against known inputs.

  3. Correlation analysis report: A statistical analysis showing Pearson and Spearman correlations between all open-loop and closed-loop metric pairs, with confidence intervals, significance tests, and a clear narrative identifying which open-loop metrics are trustworthy and which are misleading.

  4. Policy comparison dashboard: An HTML report containing the summary table, radar plots, correlation heatmap, scatter plots, and rank reversal analysis for at least 3 baseline policies evaluated across at least 50 scenarios.

  5. Written analysis of findings (1-2 pages within the final notebook): Interpretation of the correlation gap, discussion of which open-loop metrics are most predictive, explanation of observed rank reversals, and practical recommendations for when open-loop evaluation is sufficient vs when closed-loop evaluation is necessary.


Evaluation Criteria

CriteriaWeightDescription
Evaluation Correctness30%Metrics are computed correctly in both open-loop and closed-loop modes. Collision detection handles rotated bounding boxes. Comfort metrics use proper coordinate projections. WOSAC-style metrics follow the published specification.
Analysis Quality25%Correlation analysis uses appropriate statistical methods (Spearman for non-linear relationships, bootstrap for confidence intervals, paired tests for policy comparison). Findings are well-supported by evidence and clearly communicated.
Code Architecture20%Clean separation of concerns between simulation, metrics, analysis, and visualization. Functions are composable and reusable. Configuration is externalized. Code runs reproducibly given the same data and config.
Visualization15%Dashboard communicates findings effectively. Correlation heatmap is readable. Scatter plots include regression statistics. Radar plots enable quick comparison. Figures have proper labels, legends, and titles.
Documentation10%README explains how to run the evaluation. Each module has docstrings. Notebook markdown cells explain the reasoning behind each step. Results are interpretable by someone who did not write the code.

  • Waymax Deep Dive: Comprehensive guide to Waymax architecture, the simulation loop, data model, and APIs. Essential reference for understanding how PlanningAgentEnvironment, SimulatorState, and the dynamics models work under the hood.

  • WOSAC Challenge Deep Dive: Detailed coverage of the Waymo Open Sim Agents Challenge including the realism meta-metric, log-divergence computation, kinematic infeasibility rate, and how the challenge evaluates multi-agent simulation fidelity. Directly informs the realism metrics in Step 4 and the statistical methodology in Step 5.


Next Steps

After completing this project, consider these follow-up projects that build directly on your evaluation harness:

  • Sim Agent with Realism Metrics: Use your closed-loop evaluation framework to train and evaluate a learned sim agent (e.g., a behavior cloning policy or a diffusion-based trajectory generator) that optimizes for WOSAC realism scores. Your metric suite becomes the reward signal.

  • Adversarial Scenario Generator: Extend your harness to identify scenarios where a given policy fails in closed-loop but succeeds in open-loop. Then build a scenario perturbation system that generates adversarial variants (modified agent trajectories, adjusted timing, shifted initial conditions) to stress-test policies at the boundary of their capability.

  • Multi-Agent Closed-Loop Evaluation: Scale from single-ego evaluation to full multi-agent evaluation where every agent in the scene is controlled by a policy. This requires extending your metric computation to handle N-agent interactions and adapting your correlation analysis to multi-agent realism scores.