Closed-Loop Evaluation System
Build an evaluation harness that exposes the dangerous gap between open-loop metrics and closed-loop reality -- the gap that separates simulation theater from safety-critical engineering.
Overview
In this project you will build a complete evaluation harness that runs the same autonomous driving policy in two fundamentally different simulation modes -- open-loop replay and closed-loop reactive simulation -- and then rigorously quantifies how well one predicts the other. The system takes a set of WOMD scenarios and a driving policy as input, executes evaluation in both modes using Waymax, computes a comprehensive suite of driving quality metrics, and produces a statistical analysis of the correlation gap between the two evaluation regimes.
This matters because the autonomous driving industry has repeatedly discovered that open-loop metrics (Average Displacement Error, Final Displacement Error, miss rate) are unreliable proxies for real-world driving performance. A policy that achieves state-of-the-art ADE on a motion forecasting benchmark can perform catastrophically in closed-loop deployment because open-loop evaluation ignores compounding errors, distributional shift, and multi-agent interaction effects. Closed-loop evaluation -- where all agents react to each other in real time -- is the gold standard for validating AD systems, and understanding why open-loop metrics fail is essential knowledge for anyone building or evaluating driving policies.
Your deliverable is a modular Python evaluation framework consisting of: (1) open-loop and closed-loop evaluator modules built on Waymax, (2) a metric computation suite covering safety, comfort, progress, and realism dimensions, (3) a correlation analysis pipeline that quantifies which open-loop metrics are predictive and which are misleading, (4) a policy comparison framework that demonstrates rank reversals across evaluation modes, and (5) a visual dashboard summarizing all findings. The entire system should be reproducible, well-documented, and extensible to new policies and metrics.
Learning Objectives
By completing this project, you will be able to:
- Understand the structural differences between open-loop and closed-loop simulation and articulate why this distinction is the most important axis in AD evaluation design.
- Implement evaluation protocols in Waymax using both
waymax.env.PlanningAgentEnvironmentfor closed-loop and direct trajectory replay for open-loop, with proper scenario lifecycle management. - Compute a comprehensive suite of driving quality metrics including collision rate, off-road rate, route progress, jerk, lateral acceleration, time-to-collision, and WOSAC-style log-divergence measures.
- Build a reproducible evaluation harness with clean separation between simulation execution, metric computation, aggregation, and reporting, following software engineering best practices.
- Perform statistical analysis of correlation gaps using Pearson and Spearman correlations, paired significance tests, and confidence intervals to determine which open-loop metrics are predictive of closed-loop outcomes.
- Create clear visualizations of evaluation results including radar plots, correlation heatmaps, scatter plots with regression lines, and per-scenario failure breakdowns that communicate findings to both technical and non-technical audiences.
Prerequisites
- Required: Python proficiency (comfortable with NumPy, dataclasses, and module organization), familiarity with Waymax basics (loading scenarios, stepping simulations, accessing state), and a conceptual understanding of simulation in the AD context.
- Recommended: JAX fundamentals (pure functions,
jax.jit,jax.vmap), basic statistics (correlation coefficients, hypothesis testing, confidence intervals), and experience with Matplotlib or a similar plotting library. - Deep Dive Reading:
- Waymax Deep Dive -- Comprehensive guide to Waymax architecture, the simulation loop, and the data model. Essential background for Steps 2-4.
- WOSAC Challenge Deep Dive -- Detailed coverage of realism metrics, evaluation protocols, and the WOSAC benchmark. Directly relevant to the metric computation and correlation analysis portions of this project.
Key Concepts
Open-Loop vs Closed-Loop Simulation
The distinction between open-loop and closed-loop simulation is the single most important concept in AD evaluation.
In open-loop evaluation, the ego vehicle's planned trajectory is compared against recorded ground-truth trajectories, but the simulation never actually executes those plans. Other agents follow their logged trajectories regardless of what the ego vehicle does. This is equivalent to asking: "Given this exact traffic situation at time t, what would the policy do?" -- but never asking the follow-up question: "And then what happens next, given that the ego actually moved?"
Open-Loop (Replay):
t=0: Observe logged state -> Policy predicts trajectory -> Compare to logged trajectory
t=1: Observe LOGGED state (not affected by policy) -> Predict -> Compare
t=2: Observe LOGGED state -> Predict -> Compare
...
Result: ADE = 1.2m, FDE = 2.8m (looks good!)
In closed-loop evaluation, the ego vehicle's actions are actually executed in the simulator, the world state updates accordingly, and other agents react (or at minimum, the ego must deal with the consequences of its own prior actions). This creates a feedback loop where small errors compound over time.
Closed-Loop (Reactive):
t=0: Observe state -> Policy acts -> World updates (all agents react)
t=1: Observe NEW state (affected by policy) -> Policy acts -> World updates
t=2: Observe NEW state (errors compound) -> Policy acts -> World updates
...
Result: Collision at t=14, off-road at t=22 (reality check!)
In Waymax, you control this distinction through the environment configuration. The waymax.env.PlanningAgentEnvironment class supports both modes: you can configure which agents are controlled by the policy under test and which follow their logged trajectories. For full closed-loop evaluation, non-ego agents can be driven by reactive models like the Intelligent Driver Model (IDM) via waymax.agents.IDMRoutePolicy.
The Correlation Gap
Research across multiple groups (Waymo, nuPlan, Motional) has consistently found that open-loop metrics are poor predictors of closed-loop performance. A 2023 analysis on the nuPlan benchmark showed that the Spearman rank correlation between open-loop ADE and closed-loop collision rate was as low as 0.15 -- barely better than random.
The reasons are structural:
-
Compounding errors: In open-loop, each prediction starts from the ground-truth state. In closed-loop, errors at time t propagate to time t+1. A policy with 0.5m ADE per step can drift meters off course over a 5-second rollout.
-
Distributional shift: The policy was trained (or the logged trajectory was recorded) under a specific distribution of states. Once the ego deviates from the logged path, it encounters states that may be out of distribution, causing cascading failures.
-
Interaction effects: Open-loop evaluation cannot capture how other agents respond to the ego's behavior. A lane change that looks smooth in open-loop might force another vehicle to brake hard in closed-loop, creating a dangerous situation that open-loop metrics cannot detect.
-
Metric insensitivity: Open-loop metrics like ADE treat all displacement errors equally. A 2m error that places the ego in an adjacent (empty) lane scores the same as a 2m error that places the ego inside another vehicle. Closed-loop metrics naturally capture this distinction through collision detection.
Driving Quality Metrics
A comprehensive evaluation requires metrics across multiple dimensions:
Safety Metrics measure how dangerous the driving behavior is:
- Collision rate: Fraction of scenarios where the ego overlaps with another object (vehicle, pedestrian, cyclist). In Waymax, check overlap using
waymax.metrics.OverlapMetricor compute bounding-box intersection directly fromsim_state.sim_trajectory. - Time-to-collision (TTC): Minimum time until a collision would occur given current velocities. Lower values indicate more dangerous situations even if no collision occurs.
- Near-miss rate: Fraction of scenarios where TTC drops below a threshold (e.g., 1.5 seconds) without actual collision.
Comfort Metrics measure ride quality and human acceptability:
- Longitudinal jerk: Rate of change of acceleration. High jerk indicates jerky, uncomfortable driving. Compute as the finite difference of acceleration over the trajectory.
- Lateral acceleration: Acceleration perpendicular to the heading. High values indicate aggressive turning.
- Lane deviation: Distance from lane center, measuring how well the ego tracks its intended lane.
Progress Metrics measure whether the ego accomplishes its driving objective:
- Route completion rate: Fraction of the planned route that is completed before the scenario ends or the ego fails.
- Speed maintenance: How close the ego's speed is to the road's speed limit or the flow of traffic.
- Time efficiency: Ratio of actual travel time to optimal travel time.
Realism Metrics measure how human-like the driving behavior is:
- Log-divergence: Distance between the ego's trajectory and the logged (human) trajectory, weighted by time. This is the core WOSAC metric.
- Kinematic Infeasibility Rate (KIR): Fraction of timesteps where the ego's state transition violates kinematic constraints (e.g., exceeds maximum steering rate or acceleration). Waymax provides
waymax.metrics.KinematicsInfeasibilityMetricfor this. - Speed distribution divergence: KL-divergence between the ego's speed distribution and the distribution in the logged data.
Waymax Evaluation Framework
Waymax provides the building blocks for both evaluation modes through a consistent API:
waymax.datatypes.SimulatorState: The central state object containing trajectories, validity masks, and metadata for all agents across all timesteps. This is your primary data source for metric computation.waymax.env.PlanningAgentEnvironment: Wraps the Waymax dynamics model into a standard environment interface withreset(),step(), andobserve()methods. Configurecontrolled_agentsto specify which agents are driven by the policy under test.waymax.metrics: Module containing built-in metric functions includingOverlapMetric,OffroadMetric,KinematicsInfeasibilityMetric, andLogDivergenceMetric. Each operates onSimulatorStateand returns per-agent, per-timestep values.waymax.agents: Module containing baseline agents.IDMRoutePolicyimplements the Intelligent Driver Model for reactive non-ego agents.LogPlaybackPolicyreplays logged trajectories.waymax.dynamics: Dynamics models (InvertibleBicycleModel,DeltaLocal,DeltaGlobal) that govern how actions translate to state transitions. The choice of dynamics model affects which kinematic constraints are enforced.
For batch evaluation across many scenarios, use jax.vmap over the scenario dimension to evaluate hundreds of scenarios in parallel on GPU/TPU, achieving the throughput needed for statistically meaningful results.
Statistical Evaluation
Evaluating a driving policy on a single scenario is meaningless. Robust evaluation requires:
- Multiple scenarios: Evaluate across hundreds or thousands of WOMD scenarios to capture the diversity of driving situations (highway merges, unprotected left turns, pedestrian crossings, etc.).
- Confidence intervals: Report metrics with bootstrap confidence intervals rather than point estimates. A collision rate of 3.2% +/- 0.8% communicates uncertainty that a bare number does not.
- Paired tests: When comparing two policies, use paired statistical tests (Wilcoxon signed-rank or paired t-test) that account for scenario-level pairing. Policy A and Policy B evaluated on the same scenario are not independent samples.
- Effect size: Beyond p-values, report effect sizes (Cohen's d or rank-biserial correlation) to distinguish statistically significant but practically meaningless differences from genuinely important ones.
- Multiple comparisons: When comparing many metrics across many policies, apply corrections (Bonferroni or Benjamini-Hochberg) to control false discovery rate.
Step-by-Step Implementation Guide
Step 1: Environment Setup (45 min)
Set up the project structure and verify that all dependencies work correctly.
1.1 Project Directory Structure
closed-loop-eval/
├── eval/
│ ├── __init__.py
│ ├── open_loop.py # Open-loop evaluator
│ ├── closed_loop.py # Closed-loop evaluator
│ ├── metrics.py # Metric computation suite
│ ├── correlation.py # Correlation analysis
│ ├── comparison.py # Policy comparison framework
│ └── dashboard.py # Visualization and reporting
├── configs/
│ └── eval_config.py # Evaluation configuration dataclass
├── notebooks/
│ ├── 01_open_vs_closed_loop.ipynb
│ ├── 02_driving_metrics.ipynb
│ ├── 03_correlation_analysis.ipynb
│ └── 04_evaluation_harness.ipynb
├── results/ # Output directory for evaluation results
├── requirements.txt
└── run_eval.py # Main entry point
1.2 Install Dependencies
python -m venv .venv
source .venv/bin/activate
pip install jax jaxlib
pip install waymax # or install from source per Waymax docs
pip install numpy matplotlib scipy pandas seaborn
1.3 Verify Setup
import jax
import jax.numpy as jnp
from waymax import config as waymax_config
from waymax import dataloader
from waymax import datatypes
from waymax import dynamics
from waymax import env as waymax_env
from waymax import agents
print(f"JAX devices: {jax.devices()}")
print(f"JAX version: {jax.__version__}")
# Load a single scenario to verify data pipeline
data_config = waymax_config.DatasetConfig(
path='path/to/womd/training',
max_num_objects=32,
data_format=waymax_config.DataFormat.WOMD
)
data_iter = dataloader.simulator_state_generator(config=data_config)
scenario = next(data_iter)
print(f"Scenario loaded: {scenario.shape}")
print(f"Number of objects: {scenario.log_trajectory.shape[-2]}")
print(f"Number of timesteps: {scenario.log_trajectory.shape[-1]}")
1.4 Define Evaluation Configuration
# configs/eval_config.py
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class EvalConfig:
"""Configuration for the evaluation harness."""
# Data
data_path: str = 'path/to/womd/validation'
max_num_objects: int = 32
num_scenarios: int = 100 # Number of scenarios to evaluate
# Simulation
sim_steps: int = 80 # 8 seconds at 10 Hz
dt: float = 0.1
# Metrics
ttc_threshold: float = 1.5 # seconds, for near-miss detection
max_jerk: float = 10.0 # m/s^3, comfort threshold
max_lat_accel: float = 4.0 # m/s^2, comfort threshold
# Analysis
confidence_level: float = 0.95
bootstrap_samples: int = 1000
# Output
results_dir: str = 'results/'
export_html: bool = True
Step 2: Implement Open-Loop Evaluator (2 hours)
The open-loop evaluator runs a driving policy against logged scenarios where other agents are unaffected by the ego's actions and the ego always observes ground-truth states.
2.1 Core Open-Loop Evaluation Loop
# eval/open_loop.py
import jax
import jax.numpy as jnp
from waymax import datatypes
from waymax import dynamics
from typing import Dict, Any, Callable, List
from dataclasses import dataclass
@dataclass
class OpenLoopResult:
"""Results from a single open-loop evaluation."""
scenario_id: str
ego_trajectory: jnp.ndarray # (T, 5) - x, y, heading, vx, vy
logged_trajectory: jnp.ndarray # (T, 5) - ground-truth
other_trajectories: jnp.ndarray # (N, T, 5) - other agents (from log)
metrics: Dict[str, float]
valid_mask: jnp.ndarray # (T,) - which timesteps are valid
def evaluate_open_loop(
scenario: datatypes.SimulatorState,
policy_fn: Callable,
config: 'EvalConfig'
) -> OpenLoopResult:
"""
Evaluate a policy in open-loop mode.
In open-loop, at each timestep:
1. The policy observes the LOGGED state (ground truth)
2. The policy produces an action or trajectory
3. We record the policy's output but do NOT update the sim state
4. We move to the next LOGGED timestep
This means the policy always sees perfect information and never
has to deal with the consequences of its own actions.
"""
num_steps = config.sim_steps
ego_idx = scenario.object_metadata.is_sdc # Boolean mask for ego
ego_predictions = []
for t in range(num_steps):
# Policy observes the ground-truth logged state at time t
observation = _extract_observation(scenario, t)
# Policy predicts action or trajectory
action = policy_fn(observation, t)
ego_predictions.append(action)
ego_trajectory = jnp.stack(ego_predictions, axis=0)
logged_trajectory = scenario.log_trajectory[ego_idx]
# Compute open-loop metrics (ADE, FDE, miss rate, etc.)
metrics = _compute_open_loop_metrics(
ego_trajectory, logged_trajectory, scenario, config
)
return OpenLoopResult(
scenario_id=_get_scenario_id(scenario),
ego_trajectory=ego_trajectory,
logged_trajectory=logged_trajectory,
other_trajectories=scenario.log_trajectory[~ego_idx],
metrics=metrics,
valid_mask=scenario.log_trajectory.valid[ego_idx, :num_steps]
)
2.2 Open-Loop Metric Computation
def _compute_open_loop_metrics(
predicted: jnp.ndarray,
logged: jnp.ndarray,
scenario: datatypes.SimulatorState,
config: 'EvalConfig'
) -> Dict[str, float]:
"""Compute standard open-loop metrics."""
# Position displacement at each timestep
displacement = jnp.linalg.norm(
predicted[:, :2] - logged[:, :2], axis=-1
)
metrics = {
# Average Displacement Error
'ade': float(jnp.mean(displacement)),
# Final Displacement Error
'fde': float(displacement[-1]),
# Miss rate (FDE > 2m threshold)
'miss_rate_2m': float(displacement[-1] > 2.0),
# Heading error (radians)
'heading_error_mean': float(jnp.mean(
jnp.abs(_wrap_angle(predicted[:, 2] - logged[:, 2]))
)),
# Speed error
'speed_error_mean': float(jnp.mean(jnp.abs(
jnp.linalg.norm(predicted[:, 3:5], axis=-1) -
jnp.linalg.norm(logged[:, 3:5], axis=-1)
))),
}
return metrics
2.3 Batch Open-Loop Evaluation
def evaluate_open_loop_batch(
scenarios: List[datatypes.SimulatorState],
policy_fn: Callable,
config: 'EvalConfig'
) -> List[OpenLoopResult]:
"""Evaluate a policy across multiple scenarios in open-loop."""
results = []
for scenario in scenarios:
result = evaluate_open_loop(scenario, policy_fn, config)
results.append(result)
return results
def aggregate_open_loop_results(
results: List[OpenLoopResult]
) -> Dict[str, Any]:
"""Aggregate open-loop results across scenarios."""
all_metrics = {}
for key in results[0].metrics.keys():
values = [r.metrics[key] for r in results]
all_metrics[key] = {
'mean': float(jnp.mean(jnp.array(values))),
'std': float(jnp.std(jnp.array(values))),
'median': float(jnp.median(jnp.array(values))),
'values': values, # Keep per-scenario values for correlation
}
return all_metrics
Key implementation notes for Step 2:
- The logged replay baseline (using
waymax.agents.LogPlaybackPolicy) should produce near-zero open-loop error by construction -- use this as a sanity check. - Store per-scenario results with scenario IDs so you can pair them with closed-loop results in Step 5.
- Consider using
jax.vmapover the scenario batch dimension if scenarios are stacked into a single array, but be mindful of memory for large batches.
Step 3: Implement Closed-Loop Evaluator (2.5 hours)
The closed-loop evaluator runs the same policy but in reactive simulation where the ego's actions affect the world state and (optionally) other agents react.
3.1 Core Closed-Loop Evaluation Loop
# eval/closed_loop.py
import jax
import jax.numpy as jnp
from waymax import config as waymax_config
from waymax import datatypes
from waymax import dynamics
from waymax import env as waymax_env
from waymax import agents
from typing import Dict, Any, Callable, List
from dataclasses import dataclass
@dataclass
class ClosedLoopResult:
"""Results from a single closed-loop evaluation."""
scenario_id: str
ego_trajectory: jnp.ndarray # (T, 5) actual executed trajectory
logged_trajectory: jnp.ndarray # (T, 5) what the log says
other_trajectories: jnp.ndarray # (N, T, 5) other agents' actual paths
metrics: Dict[str, float]
valid_mask: jnp.ndarray
terminated_early: bool
termination_reason: str # 'completed', 'collision', 'off_road'
termination_step: int
def evaluate_closed_loop(
scenario: datatypes.SimulatorState,
policy_fn: Callable,
config: 'EvalConfig',
non_ego_policy: str = 'log_playback' # 'log_playback' or 'idm'
) -> ClosedLoopResult:
"""
Evaluate a policy in closed-loop mode.
In closed-loop, at each timestep:
1. The policy observes the CURRENT sim state (affected by prior actions)
2. The policy produces an action
3. The action is executed via the dynamics model
4. The world state updates (and optionally other agents react)
5. We move to the NEXT state (which reflects the ego's action)
Errors compound. Distributional shift occurs. Reality bites.
"""
# Configure Waymax environment for closed-loop
env_config = waymax_config.EnvironmentConfig(
controlled_object=waymax_config.ObjectType.SDC,
)
dynamics_model = dynamics.InvertibleBicycleModel()
env = waymax_env.PlanningAgentEnvironment(
dynamics_model=dynamics_model,
config=env_config,
)
# Select non-ego agent policy
if non_ego_policy == 'idm':
bg_agent = agents.IDMRoutePolicy(
is_controlled_func=lambda state: ~state.object_metadata.is_sdc
)
else: # log_playback
bg_agent = agents.LogPlaybackPolicy()
# Initialize simulation
state = env.reset(scenario)
ego_states = []
terminated_early = False
termination_reason = 'completed'
termination_step = config.sim_steps
for t in range(config.sim_steps):
# Extract observation for the ego policy
observation = env.observe(state)
# Ego policy produces action
ego_action = policy_fn(observation, t)
# Background agents produce actions
bg_action = bg_agent.select_action(state)
# Combine actions and step
# (Implementation depends on how you merge ego + background actions)
combined_action = _merge_actions(ego_action, bg_action, state)
state = env.step(state, combined_action)
# Record ego state
ego_states.append(_extract_ego_state(state))
# Check termination conditions
if _check_collision(state):
terminated_early = True
termination_reason = 'collision'
termination_step = t
break
if _check_offroad(state):
terminated_early = True
termination_reason = 'off_road'
termination_step = t
break
ego_trajectory = jnp.stack(ego_states, axis=0)
# Compute closed-loop metrics
metrics = _compute_closed_loop_metrics(
ego_trajectory, state, scenario, config,
termination_step, termination_reason
)
return ClosedLoopResult(
scenario_id=_get_scenario_id(scenario),
ego_trajectory=ego_trajectory,
logged_trajectory=scenario.log_trajectory[
scenario.object_metadata.is_sdc
],
other_trajectories=_extract_other_trajectories(state),
metrics=metrics,
valid_mask=jnp.ones(len(ego_states), dtype=bool),
terminated_early=terminated_early,
termination_reason=termination_reason,
termination_step=termination_step,
)
3.2 Closed-Loop Specific Considerations
def _compute_closed_loop_metrics(
ego_trajectory: jnp.ndarray,
final_state: datatypes.SimulatorState,
scenario: datatypes.SimulatorState,
config: 'EvalConfig',
termination_step: int,
termination_reason: str,
) -> Dict[str, float]:
"""
Compute closed-loop specific metrics.
Key difference from open-loop: these metrics reflect what
ACTUALLY HAPPENED in the simulation, not what would have
happened in an idealized replay.
"""
metrics = {}
# Binary outcomes
metrics['collision'] = float(termination_reason == 'collision')
metrics['off_road'] = float(termination_reason == 'off_road')
metrics['completed'] = float(termination_reason == 'completed')
# Progress metrics (how far the ego got before termination)
metrics['progress_ratio'] = termination_step / config.sim_steps
metrics['distance_traveled'] = float(_compute_distance(ego_trajectory))
# Log divergence (how far ego deviated from the human trajectory)
logged = scenario.log_trajectory[scenario.object_metadata.is_sdc]
displacement = jnp.linalg.norm(
ego_trajectory[:termination_step, :2] -
logged[:termination_step, :2],
axis=-1
)
metrics['log_divergence_mean'] = float(jnp.mean(displacement))
metrics['log_divergence_final'] = float(displacement[-1])
return metrics
3.3 Handling Early Termination
A critical design decision: when the ego collides or goes off-road, do you stop the simulation or continue? Both choices have implications:
- Stop on failure: Reflects real-world consequences (a collision ends the scenario). Progress metrics are meaningful. But you get fewer timesteps of data for metric computation.
- Continue after failure: Gives you a full trajectory for metric computation and allows observing recovery behavior. But collision metrics become ambiguous (does one collision count the same as 10 overlapping timesteps?).
The recommended approach: stop on first failure but record a full set of metrics up to the termination point. For metrics that require a full trajectory (like comfort metrics), pad or normalize by the number of valid timesteps.
Step 4: Build Metric Computation Suite (2.5 hours)
Implement a unified metric computation module that works with both open-loop and closed-loop trajectories.
4.1 Safety Metrics
# eval/metrics.py
import jax
import jax.numpy as jnp
from waymax import datatypes
from waymax import metrics as waymax_metrics
from typing import Dict
def compute_collision_rate(
ego_traj: jnp.ndarray,
other_trajs: jnp.ndarray,
ego_extent: jnp.ndarray,
other_extents: jnp.ndarray,
) -> Dict[str, float]:
"""
Compute collision metrics using bounding box overlap.
For Waymax-native collision detection, use:
waymax.metrics.OverlapMetric
which handles the rotated bounding box geometry correctly.
"""
# Simplified axis-aligned check (for illustration)
# In production, use waymax_metrics.OverlapMetric on SimulatorState
collisions_per_step = _check_bbox_overlap(
ego_traj, other_trajs, ego_extent, other_extents
)
return {
'collision_rate': float(jnp.any(collisions_per_step)),
'collision_count': int(jnp.sum(collisions_per_step)),
'first_collision_step': int(
jnp.argmax(collisions_per_step)
) if jnp.any(collisions_per_step) else -1,
}
def compute_time_to_collision(
ego_traj: jnp.ndarray,
other_trajs: jnp.ndarray,
dt: float = 0.1,
) -> Dict[str, float]:
"""
Compute time-to-collision (TTC) assuming constant velocity.
TTC at time t = distance / closing_speed
Only defined when closing_speed > 0 (vehicles approaching).
"""
T = ego_traj.shape[0]
N = other_trajs.shape[0]
ttc_values = []
for t in range(T):
ego_pos = ego_traj[t, :2]
ego_vel = ego_traj[t, 3:5]
for n in range(N):
other_pos = other_trajs[n, t, :2]
other_vel = other_trajs[n, t, 3:5]
rel_pos = other_pos - ego_pos
rel_vel = other_vel - ego_vel
distance = jnp.linalg.norm(rel_pos)
closing_speed = -jnp.dot(rel_pos, rel_vel) / (distance + 1e-6)
if closing_speed > 0:
ttc = distance / closing_speed
ttc_values.append(float(ttc))
ttc_array = jnp.array(ttc_values) if ttc_values else jnp.array([float('inf')])
return {
'min_ttc': float(jnp.min(ttc_array)),
'mean_ttc': float(jnp.mean(ttc_array)),
'near_miss_rate': float(jnp.mean(ttc_array < 1.5)),
}
4.2 Comfort Metrics
def compute_comfort_metrics(
ego_traj: jnp.ndarray,
dt: float = 0.1,
) -> Dict[str, float]:
"""
Compute ride comfort metrics from ego trajectory.
ego_traj: (T, 5) with columns [x, y, heading, vx, vy]
"""
# Velocity magnitude
speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)
# Acceleration (finite difference of velocity)
accel = jnp.diff(ego_traj[:, 3:5], axis=0) / dt
accel_magnitude = jnp.linalg.norm(accel, axis=-1)
# Jerk (finite difference of acceleration)
jerk = jnp.diff(accel, axis=0) / dt
jerk_magnitude = jnp.linalg.norm(jerk, axis=-1)
# Lateral acceleration
# Project acceleration onto the perpendicular of heading
heading = ego_traj[:-1, 2] # heading at each accel timestep
lat_direction = jnp.stack([-jnp.sin(heading), jnp.cos(heading)], axis=-1)
lat_accel = jnp.abs(jnp.sum(accel * lat_direction, axis=-1))
# Longitudinal acceleration
lon_direction = jnp.stack([jnp.cos(heading), jnp.sin(heading)], axis=-1)
lon_accel = jnp.sum(accel * lon_direction, axis=-1)
return {
'mean_jerk': float(jnp.mean(jerk_magnitude)),
'max_jerk': float(jnp.max(jerk_magnitude)),
'mean_lat_accel': float(jnp.mean(lat_accel)),
'max_lat_accel': float(jnp.max(lat_accel)),
'mean_lon_accel': float(jnp.mean(jnp.abs(lon_accel))),
'max_decel': float(jnp.min(lon_accel)), # Most negative = hardest brake
'speed_std': float(jnp.std(speed)),
}
4.3 Progress Metrics
def compute_progress_metrics(
ego_traj: jnp.ndarray,
logged_traj: jnp.ndarray,
road_speed_limit: float = 13.4, # m/s (~30 mph default)
total_steps: int = 80,
) -> Dict[str, float]:
"""Compute route progress and efficiency metrics."""
# Distance traveled by ego
ego_distance = _path_length(ego_traj[:, :2])
log_distance = _path_length(logged_traj[:, :2])
# Speed
ego_speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)
# Route completion: ratio of ego distance to logged distance
route_completion = float(ego_distance / (log_distance + 1e-6))
return {
'distance_traveled': float(ego_distance),
'route_completion': min(route_completion, 1.0),
'mean_speed': float(jnp.mean(ego_speed)),
'speed_limit_compliance': float(
jnp.mean(ego_speed <= road_speed_limit * 1.1)
),
'time_stationary': float(jnp.mean(ego_speed < 0.1)),
}
def _path_length(positions: jnp.ndarray) -> float:
"""Total path length from (T, 2) position array."""
deltas = jnp.diff(positions, axis=0)
return float(jnp.sum(jnp.linalg.norm(deltas, axis=-1)))
4.4 Realism Metrics (WOSAC-style)
def compute_realism_metrics(
ego_traj: jnp.ndarray,
logged_traj: jnp.ndarray,
sim_state: datatypes.SimulatorState,
dt: float = 0.1,
) -> Dict[str, float]:
"""
Compute WOSAC-inspired realism metrics.
For full WOSAC compliance, use waymax.metrics directly:
- waymax.metrics.LogDivergenceMetric
- waymax.metrics.KinematicsInfeasibilityMetric
- waymax.metrics.OffroadMetric
"""
# Log divergence (position)
pos_divergence = jnp.linalg.norm(
ego_traj[:, :2] - logged_traj[:, :2], axis=-1
)
# Heading divergence
heading_divergence = jnp.abs(
_wrap_angle(ego_traj[:, 2] - logged_traj[:, 2])
)
# Speed divergence
ego_speed = jnp.linalg.norm(ego_traj[:, 3:5], axis=-1)
log_speed = jnp.linalg.norm(logged_traj[:, 3:5], axis=-1)
speed_divergence = jnp.abs(ego_speed - log_speed)
# Kinematic infeasibility: check if accelerations exceed vehicle limits
accel = jnp.diff(ego_traj[:, 3:5], axis=0) / dt
accel_mag = jnp.linalg.norm(accel, axis=-1)
MAX_ACCEL = 8.0 # m/s^2 (reasonable vehicle limit)
kir = float(jnp.mean(accel_mag > MAX_ACCEL))
return {
'log_divergence_ade': float(jnp.mean(pos_divergence)),
'log_divergence_fde': float(pos_divergence[-1]),
'heading_divergence_mean': float(jnp.mean(heading_divergence)),
'speed_divergence_mean': float(jnp.mean(speed_divergence)),
'kinematic_infeasibility_rate': kir,
}
def _wrap_angle(angle: jnp.ndarray) -> jnp.ndarray:
"""Wrap angle to [-pi, pi]."""
return (angle + jnp.pi) % (2 * jnp.pi) - jnp.pi
4.5 Unified Metric Runner
def compute_all_metrics(
ego_traj: jnp.ndarray,
logged_traj: jnp.ndarray,
other_trajs: jnp.ndarray,
sim_state: datatypes.SimulatorState,
config: 'EvalConfig',
) -> Dict[str, float]:
"""Compute all metrics for a single scenario evaluation."""
metrics = {}
metrics.update(compute_collision_rate(
ego_traj, other_trajs,
_get_ego_extent(sim_state), _get_other_extents(sim_state)
))
metrics.update(compute_time_to_collision(ego_traj, other_trajs, config.dt))
metrics.update(compute_comfort_metrics(ego_traj, config.dt))
metrics.update(compute_progress_metrics(
ego_traj, logged_traj, total_steps=config.sim_steps
))
metrics.update(compute_realism_metrics(
ego_traj, logged_traj, sim_state, config.dt
))
return metrics
Step 5: Correlation Analysis (2 hours)
This is the core analytical contribution: quantifying how well open-loop metrics predict closed-loop performance.
5.1 Pair Results by Scenario
# eval/correlation.py
import numpy as np
from scipy import stats
from typing import Dict, List, Tuple
def pair_results(
open_loop_results: List['OpenLoopResult'],
closed_loop_results: List['ClosedLoopResult'],
) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray]]:
"""
Pair open-loop and closed-loop results by scenario ID.
Returns matched arrays for each metric.
"""
# Build lookup by scenario ID
ol_by_id = {r.scenario_id: r.metrics for r in open_loop_results}
cl_by_id = {r.scenario_id: r.metrics for r in closed_loop_results}
# Find common scenarios
common_ids = set(ol_by_id.keys()) & set(cl_by_id.keys())
assert len(common_ids) > 0, "No matching scenarios between OL and CL results"
ol_metrics = {}
cl_metrics = {}
for metric_name in ol_by_id[next(iter(common_ids))].keys():
ol_metrics[metric_name] = np.array([
ol_by_id[sid][metric_name] for sid in sorted(common_ids)
])
for metric_name in cl_by_id[next(iter(common_ids))].keys():
cl_metrics[metric_name] = np.array([
cl_by_id[sid][metric_name] for sid in sorted(common_ids)
])
return ol_metrics, cl_metrics
5.2 Compute Correlation Matrix
def compute_correlation_matrix(
ol_metrics: Dict[str, np.ndarray],
cl_metrics: Dict[str, np.ndarray],
) -> Dict[str, Dict[str, Dict[str, float]]]:
"""
Compute pairwise correlations between all open-loop
and all closed-loop metrics.
Returns nested dict: ol_metric -> cl_metric -> {pearson, spearman, p_value}
"""
results = {}
for ol_name, ol_values in ol_metrics.items():
results[ol_name] = {}
for cl_name, cl_values in cl_metrics.items():
# Skip if either is constant (no variance)
if np.std(ol_values) < 1e-10 or np.std(cl_values) < 1e-10:
results[ol_name][cl_name] = {
'pearson': 0.0, 'spearman': 0.0,
'p_pearson': 1.0, 'p_spearman': 1.0,
}
continue
pearson_r, pearson_p = stats.pearsonr(ol_values, cl_values)
spearman_r, spearman_p = stats.spearmanr(ol_values, cl_values)
results[ol_name][cl_name] = {
'pearson': float(pearson_r),
'spearman': float(spearman_r),
'p_pearson': float(pearson_p),
'p_spearman': float(spearman_p),
}
return results
5.3 Identify Predictive and Misleading Metrics
def analyze_predictiveness(
correlation_matrix: Dict,
target_cl_metrics: List[str] = ['collision_rate', 'off_road', 'route_completion'],
significance_threshold: float = 0.05,
) -> Dict[str, List[Dict]]:
"""
Identify which open-loop metrics are most/least predictive
of key closed-loop outcomes.
"""
analysis = {}
for cl_metric in target_cl_metrics:
predictors = []
for ol_metric, correlations in correlation_matrix.items():
if cl_metric in correlations:
corr = correlations[cl_metric]
predictors.append({
'ol_metric': ol_metric,
'spearman': corr['spearman'],
'p_value': corr['p_spearman'],
'significant': corr['p_spearman'] < significance_threshold,
'predictive_quality': _classify_correlation(corr['spearman']),
})
# Sort by absolute correlation strength
predictors.sort(key=lambda x: abs(x['spearman']), reverse=True)
analysis[cl_metric] = predictors
return analysis
def _classify_correlation(r: float) -> str:
"""Classify correlation strength."""
abs_r = abs(r)
if abs_r >= 0.7:
return 'strong'
elif abs_r >= 0.4:
return 'moderate'
elif abs_r >= 0.2:
return 'weak'
else:
return 'negligible'
5.4 Bootstrap Confidence Intervals
def bootstrap_correlation(
x: np.ndarray,
y: np.ndarray,
n_bootstrap: int = 1000,
confidence: float = 0.95,
) -> Dict[str, float]:
"""
Compute bootstrap confidence interval for Spearman correlation.
"""
n = len(x)
rng = np.random.default_rng(42)
boot_correlations = []
for _ in range(n_bootstrap):
indices = rng.choice(n, size=n, replace=True)
r, _ = stats.spearmanr(x[indices], y[indices])
boot_correlations.append(r)
boot_correlations = np.array(boot_correlations)
alpha = 1 - confidence
ci_lower = np.percentile(boot_correlations, 100 * alpha / 2)
ci_upper = np.percentile(boot_correlations, 100 * (1 - alpha / 2))
return {
'mean': float(np.mean(boot_correlations)),
'ci_lower': float(ci_lower),
'ci_upper': float(ci_upper),
'std': float(np.std(boot_correlations)),
}
Step 6: Policy Comparison Framework (2 hours)
Compare multiple policies to demonstrate that open-loop and closed-loop evaluations can produce contradictory rankings.
6.1 Define Baseline Policies
# eval/comparison.py
import jax.numpy as jnp
from waymax import agents
def get_baseline_policies():
"""
Define 2-3 baseline policies for comparison.
These policies have deliberately different characteristics
to illustrate rank reversals between open-loop and closed-loop.
"""
policies = {}
# Policy 1: Log Replay (perfect in open-loop, baseline in closed-loop)
# Replays the logged human trajectory. By definition, ADE = 0 in
# open-loop. In closed-loop, it does not react to deviations.
policies['log_replay'] = agents.LogPlaybackPolicy()
# Policy 2: IDM (Intelligent Driver Model)
# Reactive car-following model. Moderate open-loop error (it doesn't
# know the exact logged trajectory), but handles closed-loop
# interactions well because it reacts to the vehicle ahead.
policies['idm'] = agents.IDMRoutePolicy(
is_controlled_func=lambda s: s.object_metadata.is_sdc
)
# Policy 3: Constant Velocity
# Maintains current speed and heading. Low jerk (smooth), but
# cannot handle curves, stops, or other agents. Interesting because
# it may have decent open-loop metrics on straight roads but
# catastrophic closed-loop performance on curves.
policies['constant_velocity'] = ConstantVelocityPolicy()
return policies
class ConstantVelocityPolicy:
"""Simple policy that maintains current velocity vector."""
def select_action(self, observation, timestep):
# Action = zero acceleration, zero steering
return jnp.zeros(2) # [accel, steering_rate]
6.2 Run All Policies Through Both Modes
def compare_policies(
scenarios: list,
policies: dict,
config: 'EvalConfig',
) -> Dict[str, Dict[str, Any]]:
"""
Run every policy through both evaluation modes on the same scenarios.
Returns nested dict: policy_name -> mode -> aggregated_metrics
"""
results = {}
for policy_name, policy in policies.items():
print(f"Evaluating policy: {policy_name}")
# Open-loop evaluation
ol_results = evaluate_open_loop_batch(scenarios, policy, config)
ol_aggregated = aggregate_open_loop_results(ol_results)
# Closed-loop evaluation
cl_results = evaluate_closed_loop_batch(scenarios, policy, config)
cl_aggregated = aggregate_closed_loop_results(cl_results)
results[policy_name] = {
'open_loop': ol_aggregated,
'closed_loop': cl_aggregated,
'ol_raw': ol_results,
'cl_raw': cl_results,
}
return results
6.3 Detect Rank Reversals
def detect_rank_reversals(
comparison_results: Dict[str, Dict[str, Any]],
metrics_to_check: List[str],
) -> List[Dict]:
"""
Find metrics where open-loop and closed-loop disagree
on which policy is better.
A rank reversal occurs when Policy A beats Policy B in
open-loop but Policy B beats Policy A in closed-loop
(or vice versa) on a given metric.
"""
reversals = []
policy_names = list(comparison_results.keys())
for metric in metrics_to_check:
# Get rankings in each mode
ol_scores = {}
cl_scores = {}
for name in policy_names:
ol_data = comparison_results[name]['open_loop']
cl_data = comparison_results[name]['closed_loop']
if metric in ol_data:
ol_scores[name] = ol_data[metric]['mean']
if metric in cl_data:
cl_scores[name] = cl_data[metric]['mean']
if not ol_scores or not cl_scores:
continue
# Rank policies (lower is better for error metrics, context-dependent)
ol_ranking = sorted(ol_scores, key=ol_scores.get)
cl_ranking = sorted(cl_scores, key=cl_scores.get)
if ol_ranking != cl_ranking:
reversals.append({
'metric': metric,
'open_loop_ranking': ol_ranking,
'closed_loop_ranking': cl_ranking,
'ol_scores': ol_scores,
'cl_scores': cl_scores,
})
return reversals
6.4 Statistical Significance of Differences
def test_policy_difference(
results_a: List,
results_b: List,
metric_name: str,
) -> Dict[str, float]:
"""
Paired Wilcoxon signed-rank test for comparing two policies
on the same set of scenarios.
"""
values_a = np.array([r.metrics[metric_name] for r in results_a])
values_b = np.array([r.metrics[metric_name] for r in results_b])
# Paired test (same scenarios)
stat, p_value = stats.wilcoxon(values_a, values_b, alternative='two-sided')
# Effect size (rank-biserial correlation)
n = len(values_a)
effect_size = 1 - (2 * stat) / (n * (n + 1) / 2)
return {
'statistic': float(stat),
'p_value': float(p_value),
'effect_size': float(effect_size),
'significant': p_value < 0.05,
'mean_a': float(np.mean(values_a)),
'mean_b': float(np.mean(values_b)),
}
Step 7: Evaluation Dashboard (1.5 hours)
Build visualizations that communicate your findings clearly.
7.1 Summary Comparison Table
# eval/dashboard.py
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import seaborn as sns
from typing import Dict
def plot_summary_table(
comparison_results: Dict,
metrics: list,
output_path: str = 'results/summary_table.png',
):
"""
Create a policy x metric x mode summary table as a heatmap.
"""
policies = list(comparison_results.keys())
modes = ['open_loop', 'closed_loop']
# Build data for display
rows = []
row_labels = []
for policy in policies:
for mode in modes:
row_labels.append(f"{policy} ({mode[:2].upper()})")
row = []
for metric in metrics:
data = comparison_results[policy][mode]
if metric in data:
row.append(data[metric]['mean'])
else:
row.append(np.nan)
rows.append(row)
data_array = np.array(rows)
fig, ax = plt.subplots(figsize=(14, len(rows) * 0.8 + 2))
sns.heatmap(
data_array, annot=True, fmt='.3f',
xticklabels=metrics, yticklabels=row_labels,
cmap='RdYlGn_r', ax=ax
)
ax.set_title('Policy Evaluation Summary: Open-Loop vs Closed-Loop')
plt.tight_layout()
plt.savefig(output_path, dpi=150, bbox_inches='tight')
plt.close()
7.2 Radar Plot for Policy Comparison
def plot_radar_comparison(
comparison_results: Dict,
metrics: list,
mode: str = 'closed_loop',
output_path: str = 'results/radar_comparison.png',
):
"""
Radar plot comparing policies across metric dimensions.
Normalize each metric to [0, 1] range for comparability.
"""
policies = list(comparison_results.keys())
N = len(metrics)
# Compute angles
angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
angles += angles[:1] # Close the polygon
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
for policy in policies:
values = []
for metric in metrics:
data = comparison_results[policy][mode]
values.append(data.get(metric, {}).get('mean', 0))
# Normalize to [0, 1]
values = _normalize_radar_values(values, metrics, comparison_results, mode)
values += values[:1]
ax.plot(angles, values, 'o-', linewidth=2, label=policy)
ax.fill(angles, values, alpha=0.1)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics, size=8)
ax.set_title(f'Policy Comparison ({mode.replace("_", " ").title()})')
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
plt.tight_layout()
plt.savefig(output_path, dpi=150, bbox_inches='tight')
plt.close()
7.3 Correlation Heatmap
def plot_correlation_heatmap(
correlation_matrix: Dict,
ol_metrics: list,
cl_metrics: list,
output_path: str = 'results/correlation_heatmap.png',
):
"""
Heatmap showing Spearman correlation between each open-loop
metric (rows) and each closed-loop metric (columns).
This is the signature visualization for this project:
it reveals which OL metrics predict CL outcomes and which don't.
"""
data = np.zeros((len(ol_metrics), len(cl_metrics)))
for i, ol in enumerate(ol_metrics):
for j, cl in enumerate(cl_metrics):
if ol in correlation_matrix and cl in correlation_matrix[ol]:
data[i, j] = correlation_matrix[ol][cl]['spearman']
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(
data, annot=True, fmt='.2f',
xticklabels=cl_metrics, yticklabels=ol_metrics,
cmap='RdBu_r', center=0, vmin=-1, vmax=1,
ax=ax,
cbar_kws={'label': 'Spearman Correlation'}
)
ax.set_xlabel('Closed-Loop Metrics')
ax.set_ylabel('Open-Loop Metrics')
ax.set_title('Open-Loop vs Closed-Loop Metric Correlation')
plt.tight_layout()
plt.savefig(output_path, dpi=150, bbox_inches='tight')
plt.close()
7.4 Scatter Plots with Regression
def plot_metric_scatter(
ol_values: np.ndarray,
cl_values: np.ndarray,
ol_name: str,
cl_name: str,
output_path: str,
):
"""
Scatter plot of open-loop metric vs closed-loop metric
across scenarios, with regression line and confidence band.
"""
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(ol_values, cl_values, alpha=0.5, s=20, edgecolors='none')
# Regression line
if np.std(ol_values) > 0 and np.std(cl_values) > 0:
slope, intercept, r_value, p_value, std_err = stats.linregress(
ol_values, cl_values
)
x_line = np.linspace(ol_values.min(), ol_values.max(), 100)
y_line = slope * x_line + intercept
ax.plot(x_line, y_line, 'r-', linewidth=2,
label=f'r={r_value:.3f}, p={p_value:.2e}')
ax.set_xlabel(f'Open-Loop: {ol_name}')
ax.set_ylabel(f'Closed-Loop: {cl_name}')
ax.set_title(f'{ol_name} vs {cl_name}')
ax.legend()
plt.tight_layout()
plt.savefig(output_path, dpi=150, bbox_inches='tight')
plt.close()
7.5 HTML Report Export
def export_html_report(
comparison_results: Dict,
correlation_analysis: Dict,
rank_reversals: List[Dict],
output_path: str = 'results/evaluation_report.html',
):
"""
Export a self-contained HTML report with all figures and tables.
Embed images as base64 for portability.
"""
import base64
from pathlib import Path
html_parts = [
'<html><head><title>Closed-Loop Evaluation Report</title>',
'<style>',
'body { font-family: Arial, sans-serif; max-width: 1200px; margin: auto; }',
'table { border-collapse: collapse; width: 100%; margin: 20px 0; }',
'th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }',
'th { background-color: #4a90d9; color: white; }',
'img { max-width: 100%; }',
'.reversal { background-color: #fff3cd; padding: 10px; margin: 10px 0; }',
'</style></head><body>',
'<h1>Closed-Loop Evaluation Report</h1>',
]
# Embed each generated figure
for img_name in ['summary_table.png', 'radar_comparison.png',
'correlation_heatmap.png']:
img_path = Path('results') / img_name
if img_path.exists():
with open(img_path, 'rb') as f:
b64 = base64.b64encode(f.read()).decode()
html_parts.append(f'<h2>{img_name.replace("_", " ").title()}</h2>')
html_parts.append(f'<img src="data:image/png;base64,{b64}" />')
# Rank reversals section
if rank_reversals:
html_parts.append('<h2>Rank Reversals</h2>')
for rev in rank_reversals:
html_parts.append(f'<div class="reversal">')
html_parts.append(f'<strong>Metric: {rev["metric"]}</strong><br>')
html_parts.append(f'OL ranking: {rev["open_loop_ranking"]}<br>')
html_parts.append(f'CL ranking: {rev["closed_loop_ranking"]}')
html_parts.append('</div>')
html_parts.append('</body></html>')
with open(output_path, 'w') as f:
f.write('\n'.join(html_parts))
Notebook Exercises
| # | Notebook | Focus | Time |
|---|---|---|---|
| 1 | 01_open_vs_closed_loop.ipynb | Run the same scenario in both modes using Waymax. Visualize the ego trajectory divergence over time. Observe how a small initial deviation at t=5 becomes a large deviation by t=40. | 45 min |
| 2 | 02_driving_metrics.ipynb | Implement the full metric suite (safety, comfort, progress, realism). Apply each metric to hand-crafted trajectories first (e.g., a trajectory that collides, one that goes off-road) to verify correctness before running on real data. | 60 min |
| 3 | 03_correlation_analysis.ipynb | Run 50+ scenarios through both modes. Compute the full correlation matrix. Identify which open-loop metrics are predictive of collision rate and which are not. Create the correlation heatmap. | 60 min |
| 4 | 04_evaluation_harness.ipynb | Full pipeline: load scenarios, evaluate 3 policies in both modes, detect rank reversals, generate the HTML dashboard. Write a 1-page summary of findings. | 60 min |
Expected Deliverables
-
Evaluation harness Python module (
eval/): A clean, importable package with separate modules for open-loop evaluation, closed-loop evaluation, metric computation, correlation analysis, policy comparison, and dashboard generation. -
Metric computation suite (
eval/metrics.py): Implementations of at least 15 distinct metrics across safety, comfort, progress, and realism dimensions, each tested against known inputs. -
Correlation analysis report: A statistical analysis showing Pearson and Spearman correlations between all open-loop and closed-loop metric pairs, with confidence intervals, significance tests, and a clear narrative identifying which open-loop metrics are trustworthy and which are misleading.
-
Policy comparison dashboard: An HTML report containing the summary table, radar plots, correlation heatmap, scatter plots, and rank reversal analysis for at least 3 baseline policies evaluated across at least 50 scenarios.
-
Written analysis of findings (1-2 pages within the final notebook): Interpretation of the correlation gap, discussion of which open-loop metrics are most predictive, explanation of observed rank reversals, and practical recommendations for when open-loop evaluation is sufficient vs when closed-loop evaluation is necessary.
Evaluation Criteria
| Criteria | Weight | Description |
|---|---|---|
| Evaluation Correctness | 30% | Metrics are computed correctly in both open-loop and closed-loop modes. Collision detection handles rotated bounding boxes. Comfort metrics use proper coordinate projections. WOSAC-style metrics follow the published specification. |
| Analysis Quality | 25% | Correlation analysis uses appropriate statistical methods (Spearman for non-linear relationships, bootstrap for confidence intervals, paired tests for policy comparison). Findings are well-supported by evidence and clearly communicated. |
| Code Architecture | 20% | Clean separation of concerns between simulation, metrics, analysis, and visualization. Functions are composable and reusable. Configuration is externalized. Code runs reproducibly given the same data and config. |
| Visualization | 15% | Dashboard communicates findings effectively. Correlation heatmap is readable. Scatter plots include regression statistics. Radar plots enable quick comparison. Figures have proper labels, legends, and titles. |
| Documentation | 10% | README explains how to run the evaluation. Each module has docstrings. Notebook markdown cells explain the reasoning behind each step. Results are interpretable by someone who did not write the code. |
Related Deep Dives
-
Waymax Deep Dive: Comprehensive guide to Waymax architecture, the simulation loop, data model, and APIs. Essential reference for understanding how
PlanningAgentEnvironment,SimulatorState, and the dynamics models work under the hood. -
WOSAC Challenge Deep Dive: Detailed coverage of the Waymo Open Sim Agents Challenge including the realism meta-metric, log-divergence computation, kinematic infeasibility rate, and how the challenge evaluates multi-agent simulation fidelity. Directly informs the realism metrics in Step 4 and the statistical methodology in Step 5.
Next Steps
After completing this project, consider these follow-up projects that build directly on your evaluation harness:
-
Sim Agent with Realism Metrics: Use your closed-loop evaluation framework to train and evaluate a learned sim agent (e.g., a behavior cloning policy or a diffusion-based trajectory generator) that optimizes for WOSAC realism scores. Your metric suite becomes the reward signal.
-
Adversarial Scenario Generator: Extend your harness to identify scenarios where a given policy fails in closed-loop but succeeds in open-loop. Then build a scenario perturbation system that generates adversarial variants (modified agent trajectories, adjusted timing, shifted initial conditions) to stress-test policies at the boundary of their capability.
-
Multi-Agent Closed-Loop Evaluation: Scale from single-ego evaluation to full multi-agent evaluation where every agent in the scene is controlled by a policy. This requires extending your metric computation to handle N-agent interactions and adapting your correlation analysis to multi-agent realism scores.