Neural Sim Quality Evaluator

Build a comprehensive metrics pipeline to measure how close neural-rendered driving scenes are to reality -- from pixel-level fidelity to downstream perception safety.

Overview

Neural rendering techniques like NeRF and 3D Gaussian Splatting promise photorealistic sensor simulation for autonomous driving. But "photorealistic" is subjective -- how do you actually measure whether a rendered image is good enough to train and test a self-driving stack? This project tackles that question head-on by building a complete evaluation pipeline that quantifies simulation quality at every level of abstraction.

You will implement a hierarchy of image quality metrics, starting from simple pixel-level measures (PSNR, MSE), progressing through structural and perceptual metrics (SSIM, LPIPS), scaling up to distributional measures (FID), and culminating in the metric that matters most for AD: downstream task performance (detection mAP/IoU). Along the way, you will discover a critical insight that drives modern simulation research -- metrics that look good on paper (high PSNR) can be misleading, while metrics aligned with human perception and task performance tell the real story.

By the end of this project, you will have a reusable Python toolkit that accepts pairs or sets of images (rendered vs. real), computes a full suite of quality metrics, and produces a visual dashboard summarizing simulation fidelity. This is the same type of evaluation pipeline used internally at companies like Waymo, NVIDIA, and Applied Intuition to validate their neural simulation systems before deploying them for safety-critical testing.

Learning Objectives

Implement classical image quality metrics (PSNR, MSE, SSIM) from scratch in NumPy and understand their mathematical foundations.
Use learned perceptual metrics (LPIPS) and understand why deep features capture human visual similarity better than raw pixels.
Build a FID computation pipeline using InceptionV3 features to evaluate the distributional quality of rendered image sets.
Connect simulation quality to downstream safety by measuring how rendering artifacts degrade object detection performance (mAP, IoU).
Critically analyze metric agreement and disagreement -- learn when PSNR is misleading, when SSIM breaks down, and why no single number captures "simulation quality."
Create publication-quality visualizations including side-by-side comparisons, error heatmaps, radar plots, and automated PDF reports.

Prerequisites

Required: Python 3.9+, NumPy, Matplotlib fundamentals (array operations, broadcasting, basic plotting).
Recommended: Basic image processing concepts (pixels, color channels, convolution, Gaussian blur). Familiarity with PyTorch tensors is helpful for Steps 4-6 but not strictly necessary.
Deep Dive Reading: Neural Rendering for AD Simulation -- read at least the Executive Summary and Background sections to understand why we need these metrics.

Key Concepts

Pixel-Level Metrics (PSNR, MSE)

Mean Squared Error (MSE) is the simplest measure of image difference: take the squared difference of every corresponding pixel, then average. Given two images I and J of dimensions H x W x C:

MSE = (1 / (H * W * C)) * sum((I[i,j,c] - J[i,j,c])^2)

Peak Signal-to-Noise Ratio (PSNR) converts MSE into a logarithmic decibel scale, making it easier to compare across images with different dynamic ranges:

PSNR = 10 * log10(MAX^2 / MSE)

where MAX is the maximum possible pixel value (255 for 8-bit images, 1.0 for normalized float images).

Intuition: PSNR tells you the ratio of the "signal" (the image content) to the "noise" (the error). Higher is better. Typical ranges: 20-25 dB is noticeable degradation, 30-35 dB is good quality, 40+ dB is nearly indistinguishable.

When it works well: PSNR is reliable for measuring simple, uniform degradations like Gaussian noise or JPEG compression at a fixed quality level. It is fast to compute and universally understood.

When it fails: PSNR is notoriously poor at capturing perceptual quality. A small spatial shift of an image by one pixel can produce a terrible PSNR score despite being visually identical. Conversely, a blurry image can have decent PSNR while looking clearly worse to a human. This is the fundamental limitation that motivates all the metrics that follow.

Structural Similarity (SSIM)

SSIM was designed to address PSNR's perceptual blindness by comparing images along three dimensions that align with the human visual system. For local image patches x and y:

SSIM(x, y) = l(x, y) * c(x, y) * s(x, y)

The three components are:

Luminance comparison l(x, y): Compares mean intensities. Humans are sensitive to absolute brightness differences.
```
l(x, y) = (2 * mu_x * mu_y + C1) / (mu_x^2 + mu_y^2 + C1)
```
Contrast comparison c(x, y): Compares standard deviations. Captures whether local texture energy is preserved.
```
c(x, y) = (2 * sigma_x * sigma_y + C2) / (sigma_x^2 + sigma_y^2 + C2)
```
Structure comparison s(x, y): Compares correlation between normalized patches. Captures whether edges and patterns are preserved.
```
s(x, y) = (sigma_xy + C3) / (sigma_x * sigma_y + C3)
```

Constants C1, C2, C3 are small stabilizers to avoid division by zero, derived from the dynamic range: C1 = (K1 * L)^2, C2 = (K2 * L)^2, with K1 = 0.01, K2 = 0.03, and L = 255 for 8-bit images.

SSIM is computed over sliding windows (typically 11x11 Gaussian-weighted) and averaged across the image. Values range from -1 to 1, with 1 indicating perfect similarity.

Why it's better than PSNR: SSIM is insensitive to uniform brightness/contrast shifts (which PSNR penalizes harshly) and focuses on structural patterns that humans actually notice. For neural rendering evaluation, SSIM catches artifacts like edge blurring and texture loss that PSNR might miss.

Multi-Scale SSIM (MS-SSIM) extends SSIM by computing it at multiple resolutions (downsampled by factors of 2), then combining the results. This captures quality at different viewing distances and is generally a stronger metric than single-scale SSIM.

Perceptual Metrics (LPIPS)

Learned Perceptual Image Patch Similarity (LPIPS) takes a fundamentally different approach: instead of hand-designing a formula, it learns what "similar" means by training on human perceptual judgments.

The idea is elegant: pass both images through a pretrained deep network (typically VGG-16 or AlexNet), extract feature activations at multiple layers, normalize them, and compute a weighted L2 distance in feature space. The weights are learned from a dataset of human "two-alternative forced choice" experiments -- people were shown a reference image and two distorted versions, then asked which distortion looks more similar to the reference.

LPIPS(I, J) = sum_l( w_l * ||phi_l(I) - phi_l(J)||_2^2 )

where phi_l extracts normalized features from layer l, and w_l are learned weights.

Why it works: Early CNN layers capture low-level features (edges, textures) while deeper layers capture high-level semantics (objects, scene layout). This multi-scale feature comparison mirrors how humans evaluate similarity -- we do not compare pixel values; we compare patterns, textures, and objects.

For neural rendering: LPIPS is now the de facto standard metric in nearly every neural rendering paper (NeRF, 3DGS, NeuRAD, SplatAD). It correlates much better with human judgments than PSNR or SSIM, and it penalizes the "blurry but pixel-accurate" failure mode that NeRF-based methods are prone to.

Lower LPIPS is better (it measures distance, not similarity).

Distribution-Level Metrics (FID)

All metrics so far operate on pairs of images. But in simulation, we often need to answer a different question: "Does my renderer produce images that as a set look like real driving images?" This is a distributional question.

Frechet Inception Distance (FID) measures the distance between two distributions of images. The procedure:

Pass all images from both sets through a pretrained InceptionV3 network and extract the 2048-dimensional feature vector from the penultimate layer.
Fit a multivariate Gaussian to each set: compute the mean vector (mu) and covariance matrix (Sigma) of the features.
Compute the Frechet distance between the two Gaussians:

FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2 * (Sigma_r * Sigma_g)^(1/2))

where subscripts r and g denote "real" and "generated" (rendered) distributions.

What FID captures that per-image metrics don't: FID measures whether the rendered set has the same diversity and statistical properties as the real set. A renderer that produces perfect copies of a few images would score well on per-image metrics but poorly on FID because it lacks diversity. Conversely, FID catches systematic biases like consistently wrong sky color or missing rare object types.

Practical notes: FID requires a reasonable sample size (at least 2,048 images per set, ideally 10,000+) to produce stable estimates. Smaller sets yield high-variance FID scores. Always report the sample size alongside FID.

Lower FID is better. Typical values: FID < 10 is excellent, 10-50 is reasonable, > 50 indicates significant distribution mismatch.

Downstream Task Metrics

The ultimate question for AD simulation is not "do the images look good?" but "do perception models perform the same on rendered images as on real images?" This is where downstream task metrics come in.

Detection mAP (mean Average Precision): Run a pretrained object detector on both real images and their rendered counterparts. Compute mAP for each set. The gap in mAP directly measures how much simulation artifacts degrade perception performance.

IoU (Intersection over Union): For corresponding detections in real vs. rendered images, compute IoU between bounding boxes. Low IoU means the renderer is distorting object appearance or position enough to shift detector outputs.

Per-category analysis: Often, renderers fail on specific categories (e.g., pedestrians rendered poorly while vehicles look fine). Breaking down mAP by category reveals exactly where the simulation pipeline needs improvement.

This is the metric that Applied Intuition, Waymo, and other companies care about most -- a renderer with mediocre PSNR but zero mAP gap is more useful than one with perfect PSNR but a 5% mAP drop.

Step-by-Step Implementation Guide

Step 1: Environment Setup (30 min)

Create and activate a virtual environment:

mkdir -p neural-sim-evaluator && cd neural-sim-evaluator
python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install torch torchvision          # PyTorch for LPIPS and detection models
pip install scikit-image               # Reference PSNR/SSIM implementations
pip install lpips                      # Learned Perceptual Image Patch Similarity
pip install scipy                      # For FID matrix square root
pip install matplotlib seaborn         # Visualization
pip install Pillow                     # Image loading
pip install tqdm                       # Progress bars
pip install pandas                     # Tabular results
pip install fpdf2                      # PDF report generation

Or create a requirements.txt:

torch>=2.0
torchvision>=0.15
scikit-image>=0.21
lpips>=0.1.4
scipy>=1.11
matplotlib>=3.7
seaborn>=0.12
Pillow>=10.0
tqdm>=4.65
pandas>=2.0
fpdf2>=2.7

Prepare sample data:

For this project, you need pairs of images: "real" driving images and their "rendered" counterparts. Options:

KITTI dataset (easiest): Download a subset of KITTI images. Create synthetic degradations (blur, noise, color shifts) to simulate rendering artifacts.
nuScenes mini split: Free, includes camera images from real driving. Pair with augmented versions.
NeRF/3DGS paper data: Some papers release their rendered outputs alongside ground-truth images (e.g., NeuRAD provides evaluation renders).

For getting started, create a simple test with synthetic degradation:

# verify_setup.py
import numpy as np
from PIL import Image
from skimage.metrics import peak_signal_noise_ratio, structural_similarity
import torch
import lpips

# Create a simple test image pair
np.random.seed(42)
img_real = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)
noise = np.random.normal(0, 25, img_real.shape).astype(np.float64)
img_rendered = np.clip(img_real.astype(np.float64) + noise, 0, 255).astype(np.uint8)

# Test PSNR
psnr = peak_signal_noise_ratio(img_real, img_rendered)
print(f"PSNR: {psnr:.2f} dB")

# Test SSIM
ssim = structural_similarity(img_real, img_rendered, channel_axis=2)
print(f"SSIM: {ssim:.4f}")

# Test LPIPS
loss_fn = lpips.LPIPS(net='vgg')
t_real = torch.from_numpy(img_real).permute(2, 0, 1).unsqueeze(0).float() / 127.5 - 1
t_rend = torch.from_numpy(img_rendered).permute(2, 0, 1).unsqueeze(0).float() / 127.5 - 1
lpips_val = loss_fn(t_real, t_rend).item()
print(f"LPIPS: {lpips_val:.4f}")

print("\nSetup verified successfully!")

Run it:

python verify_setup.py

You should see reasonable metric values (PSNR around 20 dB for sigma=25 noise, SSIM around 0.3-0.5, LPIPS around 0.3-0.5). If all three print without errors, your environment is ready.

Step 2: Implement PSNR and MSE (1.5 hours)

Goal: Implement MSE and PSNR from scratch, understand their behavior, and validate against reference implementations.

The math:

MSE(I, J) = (1/N) * sum_{i=1}^{N} (I_i - J_i)^2

PSNR(I, J) = 10 * log10(MAX^2 / MSE(I, J))

where N is the total number of pixel values (H x W x C) and MAX is the peak pixel value.

Implementation:

# metrics/pixel_metrics.py
import numpy as np


def compute_mse(img1: np.ndarray, img2: np.ndarray) -> float:
    """
    Compute Mean Squared Error between two images.

    Args:
        img1: Reference image, shape (H, W) or (H, W, C), dtype uint8 or float.
        img2: Test image, same shape and dtype as img1.

    Returns:
        MSE value (float). Lower is better. 0 means identical images.
    """
    assert img1.shape == img2.shape, f"Shape mismatch: {img1.shape} vs {img2.shape}"
    # Convert to float64 to avoid overflow with uint8
    diff = img1.astype(np.float64) - img2.astype(np.float64)
    return float(np.mean(diff ** 2))


def compute_psnr(
    img1: np.ndarray,
    img2: np.ndarray,
    max_val: float | None = None,
) -> float:
    """
    Compute Peak Signal-to-Noise Ratio between two images.

    Args:
        img1: Reference image.
        img2: Test image, same shape as img1.
        max_val: Maximum possible pixel value. Auto-detected if None
                 (255 for uint8, 1.0 for float images).

    Returns:
        PSNR in decibels (dB). Higher is better.
        Returns float('inf') if images are identical.
    """
    if max_val is None:
        if img1.dtype == np.uint8:
            max_val = 255.0
        else:
            max_val = 1.0

    mse = compute_mse(img1, img2)

    if mse == 0.0:
        return float('inf')

    return float(10.0 * np.log10(max_val ** 2 / mse))

Handle edge cases:

# Edge case tests
identical = np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8)
assert compute_mse(identical, identical) == 0.0
assert compute_psnr(identical, identical) == float('inf')

black = np.zeros((64, 64, 3), dtype=np.uint8)
white = np.full((64, 64, 3), 255, dtype=np.uint8)
assert compute_mse(black, white) == 255.0 ** 2  # Maximum possible MSE
assert compute_psnr(black, white) == 0.0          # Minimum possible PSNR

Validate against scikit-image:

from skimage.metrics import peak_signal_noise_ratio as ski_psnr

img1 = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
img2 = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)

ours = compute_psnr(img1, img2)
ref = ski_psnr(img1, img2)

print(f"Ours: {ours:.6f} dB | scikit-image: {ref:.6f} dB | Diff: {abs(ours - ref):.2e}")
# Difference should be < 1e-10

Exercise -- PSNR vs. noise level:

import matplotlib.pyplot as plt

img = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)
sigmas = np.arange(1, 101, 2)
psnr_values = []

for sigma in sigmas:
    noise = np.random.normal(0, sigma, img.shape)
    noisy = np.clip(img.astype(np.float64) + noise, 0, 255).astype(np.uint8)
    psnr_values.append(compute_psnr(img, noisy))

plt.figure(figsize=(8, 5))
plt.plot(sigmas, psnr_values, 'b-', linewidth=2)
plt.xlabel('Noise Standard Deviation (sigma)')
plt.ylabel('PSNR (dB)')
plt.title('PSNR vs. Gaussian Noise Level')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('psnr_vs_noise.png', dpi=150)
plt.show()

You should see a smooth monotonic decrease: PSNR drops roughly as 20 * log10(MAX / sigma).

Step 3: Implement SSIM (2 hours)

Goal: Implement SSIM with Gaussian-weighted windows, understand each component, and extend to multi-scale.

Gaussian window:

# metrics/ssim.py
import numpy as np
from scipy.ndimage import uniform_filter


def _gaussian_window(size: int = 11, sigma: float = 1.5) -> np.ndarray:
    """Create a 2D Gaussian kernel for SSIM windowing."""
    coords = np.arange(size) - size // 2
    g = np.exp(-(coords ** 2) / (2 * sigma ** 2))
    window = np.outer(g, g)
    return window / window.sum()

SSIM components:

def _ssim_components(
    img1: np.ndarray,
    img2: np.ndarray,
    window_size: int = 11,
    K1: float = 0.01,
    K2: float = 0.03,
    L: float = 255.0,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute luminance, contrast, and structure comparison maps.

    Returns:
        Tuple of (luminance_map, contrast_map, structure_map),
        each of shape (H, W).
    """
    C1 = (K1 * L) ** 2
    C2 = (K2 * L) ** 2
    C3 = C2 / 2

    img1 = img1.astype(np.float64)
    img2 = img2.astype(np.float64)

    # Local means (mu)
    mu1 = uniform_filter(img1, size=window_size)
    mu2 = uniform_filter(img2, size=window_size)

    # Local variances and covariance
    mu1_sq = mu1 ** 2
    mu2_sq = mu2 ** 2
    mu1_mu2 = mu1 * mu2

    sigma1_sq = uniform_filter(img1 ** 2, size=window_size) - mu1_sq
    sigma2_sq = uniform_filter(img2 ** 2, size=window_size) - mu2_sq
    sigma12 = uniform_filter(img1 * img2, size=window_size) - mu1_mu2

    # Clamp variances to avoid numerical issues
    sigma1_sq = np.maximum(sigma1_sq, 0.0)
    sigma2_sq = np.maximum(sigma2_sq, 0.0)

    sigma1 = np.sqrt(sigma1_sq)
    sigma2 = np.sqrt(sigma2_sq)

    # Three components
    luminance = (2 * mu1_mu2 + C1) / (mu1_sq + mu2_sq + C1)
    contrast = (2 * sigma1 * sigma2 + C2) / (sigma1_sq + sigma2_sq + C2)
    structure = (sigma12 + C3) / (sigma1 * sigma2 + C3)

    return luminance, contrast, structure

Full SSIM computation:

def compute_ssim(
    img1: np.ndarray,
    img2: np.ndarray,
    window_size: int = 11,
    channel_axis: int | None = 2,
) -> float:
    """
    Compute the mean Structural Similarity Index between two images.

    Args:
        img1: Reference image, shape (H, W) or (H, W, C).
        img2: Test image, same shape as img1.
        window_size: Size of the sliding window (default: 11).
        channel_axis: Axis for color channels. None for grayscale.

    Returns:
        Mean SSIM value in range [-1, 1]. 1 means identical.
    """
    if channel_axis is not None and img1.ndim == 3:
        # Compute per-channel and average
        n_channels = img1.shape[channel_axis]
        ssim_per_channel = []
        for c in range(n_channels):
            ch1 = img1[:, :, c] if channel_axis == 2 else img1[c]
            ch2 = img2[:, :, c] if channel_axis == 2 else img2[c]
            lum, con, struct = _ssim_components(ch1, ch2, window_size)
            ssim_map = lum * con * struct
            ssim_per_channel.append(np.mean(ssim_map))
        return float(np.mean(ssim_per_channel))
    else:
        lum, con, struct = _ssim_components(img1, img2, window_size)
        ssim_map = lum * con * struct
        return float(np.mean(ssim_map))

Multi-Scale SSIM (MS-SSIM):

def compute_ms_ssim(
    img1: np.ndarray,
    img2: np.ndarray,
    weights: list[float] | None = None,
    n_scales: int = 5,
) -> float:
    """
    Compute Multi-Scale SSIM.

    Iteratively downsamples images and computes SSIM components
    at each scale, then combines them with learned weights.
    """
    if weights is None:
        # Default weights from the original MS-SSIM paper
        weights = [0.0448, 0.2856, 0.3001, 0.2363, 0.1333]

    assert len(weights) == n_scales

    # Convert to grayscale for simplicity (or loop over channels)
    if img1.ndim == 3:
        img1 = np.mean(img1.astype(np.float64), axis=2)
        img2 = np.mean(img2.astype(np.float64), axis=2)

    mssim_components = []

    for scale in range(n_scales):
        lum, con, struct = _ssim_components(img1, img2)

        if scale < n_scales - 1:
            # Store contrast * structure for intermediate scales
            cs = np.mean(con * struct)
            mssim_components.append(cs)
            # Downsample by factor of 2
            img1 = img1[::2, ::2]
            img2 = img2[::2, ::2]
        else:
            # At the coarsest scale, include luminance
            mssim_components.append(np.mean(lum * con * struct))

    # Weighted product
    ms_ssim = 1.0
    for w, comp in zip(weights, mssim_components):
        ms_ssim *= comp ** w

    return float(ms_ssim)

Validate against scikit-image:

from skimage.metrics import structural_similarity as ski_ssim

img1 = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
img2 = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)

ours = compute_ssim(img1, img2)
ref = ski_ssim(img1, img2, channel_axis=2)

print(f"Ours: {ours:.6f} | scikit-image: {ref:.6f} | Diff: {abs(ours - ref):.4f}")
# Note: small differences are expected due to windowing method
# (uniform_filter vs Gaussian). Values should be within ~0.02.

Step 4: Implement LPIPS (2 hours)

Goal: Use the lpips library, understand what it does internally, and discover cases where perceptual metrics disagree with pixel metrics.

Basic usage:

# metrics/perceptual_metrics.py
import torch
import lpips
import numpy as np
from PIL import Image


class LPIPSMetric:
    """Wrapper for LPIPS perceptual similarity metric."""

    def __init__(self, net: str = 'vgg', use_gpu: bool = False):
        """
        Args:
            net: Backbone network ('vgg', 'alex', or 'squeeze').
                 'vgg' is most common in rendering papers.
            use_gpu: Whether to use CUDA.
        """
        self.device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
        self.loss_fn = lpips.LPIPS(net=net).to(self.device)
        self.loss_fn.eval()

    def _preprocess(self, img: np.ndarray) -> torch.Tensor:
        """Convert numpy image to LPIPS input format: (1, 3, H, W), range [-1, 1]."""
        if img.dtype == np.uint8:
            tensor = torch.from_numpy(img).float() / 127.5 - 1.0
        else:
            tensor = torch.from_numpy(img).float() * 2.0 - 1.0
        if tensor.ndim == 3:
            tensor = tensor.permute(2, 0, 1).unsqueeze(0)
        return tensor.to(self.device)

    def compute(self, img1: np.ndarray, img2: np.ndarray) -> float:
        """
        Compute LPIPS distance between two images.

        Returns:
            LPIPS distance (float). Lower is better. 0 means perceptually identical.
        """
        with torch.no_grad():
            t1 = self._preprocess(img1)
            t2 = self._preprocess(img2)
            return self.loss_fn(t1, t2).item()

    def compute_batch(self, imgs1: list[np.ndarray], imgs2: list[np.ndarray]) -> list[float]:
        """Compute LPIPS for a batch of image pairs."""
        results = []
        with torch.no_grad():
            for i1, i2 in zip(imgs1, imgs2):
                results.append(self.compute(i1, i2))
        return results

Key experiment -- when PSNR and LPIPS disagree:

def demonstrate_metric_disagreement():
    """
    Show cases where PSNR ranking differs from LPIPS ranking.
    This is the key insight of perceptual metrics.
    """
    from metrics.pixel_metrics import compute_psnr

    # Load a reference image (use any driving scene image)
    ref = np.array(Image.open('data/reference.jpg').resize((256, 256)))

    lpips_metric = LPIPSMetric()

    distortions = {}

    # 1. Gaussian blur (looks bad, but PSNR is moderate)
    from scipy.ndimage import gaussian_filter
    blurred = gaussian_filter(ref.astype(np.float64), sigma=[3, 3, 0])
    blurred = np.clip(blurred, 0, 255).astype(np.uint8)
    distortions['Blur (sigma=3)'] = blurred

    # 2. Small spatial shift (looks fine, but PSNR is terrible)
    shifted = np.roll(ref, shift=2, axis=1)
    distortions['Shift (2px right)'] = shifted

    # 3. Gaussian noise (looks grainy, PSNR proportional to noise level)
    noise = np.random.normal(0, 15, ref.shape)
    noisy = np.clip(ref.astype(np.float64) + noise, 0, 255).astype(np.uint8)
    distortions['Noise (sigma=15)'] = noisy

    # 4. JPEG compression (slight blocking, PSNR moderate)
    from io import BytesIO
    buf = BytesIO()
    Image.fromarray(ref).save(buf, format='JPEG', quality=10)
    buf.seek(0)
    jpeg = np.array(Image.open(buf))
    distortions['JPEG (q=10)'] = jpeg

    print(f"{'Distortion':<25} {'PSNR (dB)':>12} {'LPIPS':>12} {'PSNR Rank':>12} {'LPIPS Rank':>12}")
    print("-" * 75)

    results = []
    for name, dist in distortions.items():
        psnr = compute_psnr(ref, dist)
        lpips_val = lpips_metric.compute(ref, dist)
        results.append((name, psnr, lpips_val))

    # Rank by each metric (higher PSNR = better, lower LPIPS = better)
    psnr_ranked = sorted(results, key=lambda x: -x[1])
    lpips_ranked = sorted(results, key=lambda x: x[2])

    psnr_ranks = {name: i+1 for i, (name, _, _) in enumerate(psnr_ranked)}
    lpips_ranks = {name: i+1 for i, (name, _, _) in enumerate(lpips_ranked)}

    for name, psnr, lpips_val in results:
        print(f"{name:<25} {psnr:>12.2f} {lpips_val:>12.4f} {psnr_ranks[name]:>12} {lpips_ranks[name]:>12}")

You will typically observe that the spatial shift has terrible PSNR but excellent LPIPS, while blur has decent PSNR but poor LPIPS. This demonstrates why the rendering community adopted LPIPS -- it aligns with what humans perceive.

Step 5: Build FID Pipeline (2.5 hours)

Goal: Extract InceptionV3 features from image sets and compute the Frechet distance between their distributions.

# metrics/fid.py
import numpy as np
import torch
import torch.nn as nn
from torchvision import models, transforms
from scipy.linalg import sqrtm
from tqdm import tqdm
from PIL import Image
from pathlib import Path


class FIDComputer:
    """Compute Frechet Inception Distance between two image sets."""

    def __init__(self, use_gpu: bool = False):
        self.device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
        self.model = self._load_inception()
        self.transform = transforms.Compose([
            transforms.Resize((299, 299)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225]),
        ])

    def _load_inception(self) -> nn.Module:
        """Load InceptionV3 and remove the final classification layer."""
        inception = models.inception_v3(pretrained=True, transform_input=False)
        # We want features from the last pooling layer (2048-dim)
        inception.fc = nn.Identity()
        inception.to(self.device)
        inception.eval()
        return inception

    def extract_features(self, image_dir: str | Path, batch_size: int = 32) -> np.ndarray:
        """
        Extract 2048-dim InceptionV3 features from all images in a directory.

        Returns:
            Feature matrix of shape (N, 2048).
        """
        image_dir = Path(image_dir)
        image_paths = sorted(
            p for p in image_dir.iterdir()
            if p.suffix.lower() in ('.jpg', '.jpeg', '.png', '.bmp')
        )

        features_list = []

        with torch.no_grad():
            batch = []
            for path in tqdm(image_paths, desc="Extracting features"):
                img = Image.open(path).convert('RGB')
                tensor = self.transform(img)
                batch.append(tensor)

                if len(batch) == batch_size:
                    batch_tensor = torch.stack(batch).to(self.device)
                    feats = self.model(batch_tensor)
                    features_list.append(feats.cpu().numpy())
                    batch = []

            if batch:
                batch_tensor = torch.stack(batch).to(self.device)
                feats = self.model(batch_tensor)
                features_list.append(feats.cpu().numpy())

        return np.concatenate(features_list, axis=0)

    def compute_statistics(self, features: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
        """Compute mean and covariance of feature vectors."""
        mu = np.mean(features, axis=0)
        sigma = np.cov(features, rowvar=False)
        return mu, sigma

    def compute_fid(self, mu1, sigma1, mu2, sigma2) -> float:
        """
        Compute Frechet distance between two multivariate Gaussians.

        FID = ||mu1 - mu2||^2 + Tr(sigma1 + sigma2 - 2*sqrt(sigma1 @ sigma2))
        """
        diff = mu1 - mu2
        diff_sq = np.dot(diff, diff)

        # Compute matrix square root of product
        covmean, _ = sqrtm(sigma1 @ sigma2, disp=False)

        # Numerical stability: remove imaginary components from roundoff
        if np.iscomplexobj(covmean):
            if not np.allclose(np.imag(covmean), 0, atol=1e-3):
                raise ValueError("Imaginary component in sqrtm result is too large.")
            covmean = np.real(covmean)

        trace = np.trace(sigma1 + sigma2 - 2.0 * covmean)

        return float(diff_sq + trace)

    def evaluate(self, real_dir: str, rendered_dir: str) -> float:
        """
        Compute FID between two directories of images.

        Args:
            real_dir: Path to directory of real images.
            rendered_dir: Path to directory of rendered images.

        Returns:
            FID score (lower is better).
        """
        print("Extracting features from real images...")
        feats_real = self.extract_features(real_dir)
        print(f"  -> {feats_real.shape[0]} images, {feats_real.shape[1]}-dim features")

        print("Extracting features from rendered images...")
        feats_rendered = self.extract_features(rendered_dir)
        print(f"  -> {feats_rendered.shape[0]} images, {feats_rendered.shape[1]}-dim features")

        mu1, sigma1 = self.compute_statistics(feats_real)
        mu2, sigma2 = self.compute_statistics(feats_rendered)

        fid = self.compute_fid(mu1, sigma1, mu2, sigma2)
        print(f"\nFID: {fid:.2f}")

        return fid

Discuss sample size and confidence intervals:

def fid_sample_size_experiment(real_dir: str, rendered_dir: str):
    """
    Show how FID varies with sample size.
    Demonstrates why you need 2000+ images for reliable FID.
    """
    fid_computer = FIDComputer()
    feats_real = fid_computer.extract_features(real_dir)
    feats_rendered = fid_computer.extract_features(rendered_dir)

    sample_sizes = [50, 100, 250, 500, 1000, 2000, len(feats_real)]
    n_trials = 5

    for n in sample_sizes:
        if n > len(feats_real):
            continue
        fids = []
        for trial in range(n_trials):
            idx_r = np.random.choice(len(feats_real), size=n, replace=False)
            idx_g = np.random.choice(len(feats_rendered), size=n, replace=False)
            mu1, s1 = fid_computer.compute_statistics(feats_real[idx_r])
            mu2, s2 = fid_computer.compute_statistics(feats_rendered[idx_g])
            fids.append(fid_computer.compute_fid(mu1, s1, mu2, s2))

        print(f"N={n:>5d}: FID = {np.mean(fids):.2f} +/- {np.std(fids):.2f}")

Step 6: Downstream Detection Evaluation (2 hours)

Goal: Measure how simulation quality impacts a real perception task -- object detection.

# metrics/detection_metrics.py
import torch
import numpy as np
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2, FasterRCNN_ResNet50_FPN_V2_Weights
from torchvision import transforms
from PIL import Image
from pathlib import Path


COCO_VEHICLE_CLASSES = {3: 'car', 6: 'bus', 8: 'truck'}
COCO_PERSON_CLASS = {1: 'person'}
AD_CLASSES = {**COCO_VEHICLE_CLASSES, **COCO_PERSON_CLASS}


class DetectionEvaluator:
    """Evaluate how rendering quality affects object detection."""

    def __init__(self, score_threshold: float = 0.5, use_gpu: bool = False):
        self.device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
        self.score_threshold = score_threshold
        weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
        self.model = fasterrcnn_resnet50_fpn_v2(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()

    def detect(self, image: np.ndarray) -> dict:
        """
        Run object detection on a single image.

        Returns:
            Dict with 'boxes', 'labels', 'scores' arrays.
        """
        img_tensor = self.preprocess(
            torch.from_numpy(image).permute(2, 0, 1).float() / 255.0
        ).unsqueeze(0).to(self.device)

        with torch.no_grad():
            predictions = self.model(img_tensor)[0]

        # Filter by score and AD-relevant classes
        mask = predictions['scores'] >= self.score_threshold
        boxes = predictions['boxes'][mask].cpu().numpy()
        labels = predictions['labels'][mask].cpu().numpy()
        scores = predictions['scores'][mask].cpu().numpy()

        # Keep only AD-relevant classes
        ad_mask = np.isin(labels, list(AD_CLASSES.keys()))

        return {
            'boxes': boxes[ad_mask],
            'labels': labels[ad_mask],
            'scores': scores[ad_mask],
        }

    @staticmethod
    def compute_iou(box1: np.ndarray, box2: np.ndarray) -> float:
        """Compute Intersection over Union between two boxes [x1,y1,x2,y2]."""
        x1 = max(box1[0], box2[0])
        y1 = max(box1[1], box2[1])
        x2 = min(box1[2], box2[2])
        y2 = min(box1[3], box2[3])

        intersection = max(0, x2 - x1) * max(0, y2 - y1)
        area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
        area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
        union = area1 + area2 - intersection

        return intersection / union if union > 0 else 0.0

    def compare_detections(
        self,
        real_img: np.ndarray,
        rendered_img: np.ndarray,
        iou_threshold: float = 0.5,
    ) -> dict:
        """
        Compare detection results between real and rendered versions
        of the same scene.

        Returns:
            Dict with per-category counts of matched/missed/extra detections
            and average IoU for matched pairs.
        """
        det_real = self.detect(real_img)
        det_rendered = self.detect(rendered_img)

        results = {
            'real_count': len(det_real['boxes']),
            'rendered_count': len(det_rendered['boxes']),
            'matched': 0,
            'missed': 0,      # In real but not rendered
            'extra': 0,       # In rendered but not real
            'avg_iou': 0.0,
            'per_category': {},
        }

        if len(det_real['boxes']) == 0 and len(det_rendered['boxes']) == 0:
            return results

        # Greedy matching: for each real detection, find best-matching rendered detection
        matched_rendered = set()
        ious = []

        for i, (box_r, label_r) in enumerate(zip(det_real['boxes'], det_real['labels'])):
            best_iou = 0
            best_j = -1
            for j, (box_g, label_g) in enumerate(zip(det_rendered['boxes'], det_rendered['labels'])):
                if j in matched_rendered or label_g != label_r:
                    continue
                iou = self.compute_iou(box_r, box_g)
                if iou > best_iou:
                    best_iou = iou
                    best_j = j
            if best_iou >= iou_threshold and best_j >= 0:
                results['matched'] += 1
                matched_rendered.add(best_j)
                ious.append(best_iou)
            else:
                results['missed'] += 1

        results['extra'] = len(det_rendered['boxes']) - len(matched_rendered)
        results['avg_iou'] = float(np.mean(ious)) if ious else 0.0

        return results

    def evaluate_directory_pair(
        self,
        real_dir: str,
        rendered_dir: str,
    ) -> dict:
        """
        Evaluate detection consistency across paired image directories.
        Assumes images in both directories have matching filenames.
        """
        real_dir = Path(real_dir)
        rendered_dir = Path(rendered_dir)

        real_images = sorted(real_dir.glob('*.png')) + sorted(real_dir.glob('*.jpg'))
        all_results = []

        for real_path in real_images:
            rendered_path = rendered_dir / real_path.name
            if not rendered_path.exists():
                continue

            real_img = np.array(Image.open(real_path).convert('RGB'))
            rendered_img = np.array(Image.open(rendered_path).convert('RGB'))

            result = self.compare_detections(real_img, rendered_img)
            result['filename'] = real_path.name
            all_results.append(result)

        # Aggregate
        total_matched = sum(r['matched'] for r in all_results)
        total_missed = sum(r['missed'] for r in all_results)
        total_extra = sum(r['extra'] for r in all_results)
        total_real = sum(r['real_count'] for r in all_results)

        recall = total_matched / total_real if total_real > 0 else 0
        avg_iou = np.mean([r['avg_iou'] for r in all_results if r['avg_iou'] > 0])

        return {
            'n_images': len(all_results),
            'detection_recall': recall,
            'avg_matched_iou': float(avg_iou),
            'total_missed': total_missed,
            'total_extra': total_extra,
            'per_image': all_results,
        }

Analysis tips: After running the evaluator, examine which categories suffer most. In neural rendering for AD, common failure modes include:

Pedestrians rendered with blurry textures, causing missed detections.
Distant vehicles losing shape detail, reducing IoU.
Reflective surfaces (windshields, wet roads) introducing artifacts that create false positives.

Step 7: Build Evaluation Dashboard (1.5 hours)

Goal: Bring all metrics together into a visual dashboard and automated report.

# dashboard.py
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import json
import pandas as pd
from pathlib import Path


def create_comparison_figure(
    real_img: np.ndarray,
    rendered_img: np.ndarray,
    metrics: dict,
    save_path: str | None = None,
):
    """
    Create a side-by-side comparison with metric overlay.

    Args:
        real_img: Reference real image.
        rendered_img: Rendered image.
        metrics: Dict with keys 'psnr', 'ssim', 'lpips'.
        save_path: Optional path to save the figure.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    axes[0].imshow(real_img)
    axes[0].set_title('Real', fontsize=14, fontweight='bold')
    axes[0].axis('off')

    axes[1].imshow(rendered_img)
    axes[1].set_title('Rendered', fontsize=14, fontweight='bold')
    axes[1].axis('off')

    # Error heatmap
    error = np.mean(np.abs(
        real_img.astype(np.float64) - rendered_img.astype(np.float64)
    ), axis=2)
    im = axes[2].imshow(error, cmap='hot', vmin=0, vmax=50)
    axes[2].set_title('Absolute Error', fontsize=14, fontweight='bold')
    axes[2].axis('off')
    plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)

    # Add metrics text
    metrics_text = (
        f"PSNR: {metrics['psnr']:.2f} dB  |  "
        f"SSIM: {metrics['ssim']:.4f}  |  "
        f"LPIPS: {metrics['lpips']:.4f}"
    )
    fig.suptitle(metrics_text, fontsize=12, y=0.02)

    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def create_radar_plot(metrics: dict, save_path: str | None = None):
    """
    Create a radar/spider plot summarizing all metrics.
    Normalizes each metric to [0, 1] where 1 is best.
    """
    # Normalize metrics: map to 0-1 where 1 = best quality
    normalized = {
        'PSNR': np.clip(metrics['psnr'] / 50.0, 0, 1),            # 50 dB = perfect
        'SSIM': np.clip(metrics['ssim'], 0, 1),                     # already 0-1
        'LPIPS': np.clip(1.0 - metrics['lpips'], 0, 1),            # invert: lower = better
        'FID': np.clip(1.0 - metrics.get('fid', 50) / 100.0, 0, 1),  # invert + scale
        'Det. Recall': metrics.get('detection_recall', 0.0),
    }

    categories = list(normalized.keys())
    values = list(normalized.values())
    values += values[:1]  # Close the polygon

    angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(7, 7), subplot_kw=dict(polar=True))
    ax.plot(angles, values, 'o-', linewidth=2, color='#2563eb')
    ax.fill(angles, values, alpha=0.15, color='#2563eb')
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, fontsize=12)
    ax.set_ylim(0, 1)
    ax.set_yticks([0.25, 0.5, 0.75, 1.0])
    ax.set_yticklabels(['0.25', '0.50', '0.75', '1.00'], fontsize=9)
    ax.set_title('Neural Simulation Quality Profile', fontsize=14, fontweight='bold', pad=20)

    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def export_results(metrics: dict, output_dir: str):
    """Export evaluation results to JSON and CSV."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # JSON export
    with open(output_dir / 'metrics.json', 'w') as f:
        json.dump(metrics, f, indent=2, default=str)

    # CSV export for per-image metrics
    if 'per_image' in metrics:
        df = pd.DataFrame(metrics['per_image'])
        df.to_csv(output_dir / 'per_image_metrics.csv', index=False)

    print(f"Results exported to {output_dir}")


def generate_pdf_report(metrics: dict, output_path: str):
    """Generate a one-page PDF summary report."""
    from fpdf import FPDF

    pdf = FPDF()
    pdf.add_page()
    pdf.set_font('Helvetica', 'B', 20)
    pdf.cell(0, 15, 'Neural Simulation Quality Report', ln=True, align='C')
    pdf.ln(5)

    pdf.set_font('Helvetica', '', 11)
    pdf.cell(0, 8, f"Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}", ln=True)
    pdf.ln(5)

    # Summary table
    pdf.set_font('Helvetica', 'B', 14)
    pdf.cell(0, 10, 'Metric Summary', ln=True)
    pdf.set_font('Helvetica', '', 11)

    rows = [
        ('PSNR', f"{metrics.get('psnr', 'N/A'):.2f} dB", 'Higher is better (>30 dB = good)'),
        ('SSIM', f"{metrics.get('ssim', 'N/A'):.4f}", 'Higher is better (>0.90 = good)'),
        ('LPIPS', f"{metrics.get('lpips', 'N/A'):.4f}", 'Lower is better (<0.10 = good)'),
        ('FID', f"{metrics.get('fid', 'N/A'):.2f}", 'Lower is better (<10 = excellent)'),
        ('Det. Recall', f"{metrics.get('detection_recall', 'N/A'):.1%}", 'Higher is better (>95% = good)'),
    ]

    for metric_name, value, interpretation in rows:
        pdf.cell(50, 8, metric_name, border=1)
        pdf.cell(40, 8, str(value), border=1, align='C')
        pdf.cell(0, 8, interpretation, border=1, ln=True)

    pdf.output(output_path)
    print(f"PDF report saved to {output_path}")

Notebook Exercises

#	Notebook	Focus	Time
1	`01_metrics_basics.ipynb`	Implement PSNR and MSE from scratch; plot PSNR vs. noise level; validate against scikit-image; explore edge cases (identical, inverted, shifted images).	45 min
2	`02_ssim_deep_dive.ipynb`	Implement SSIM components individually; visualize luminance, contrast, and structure maps; implement MS-SSIM; compare SSIM vs. PSNR on blur vs. noise distortions.	60 min
3	`03_perceptual_and_fid.ipynb`	Run LPIPS on image pairs; build the PSNR-vs-LPIPS disagreement demo; implement the full FID pipeline; analyze FID stability vs. sample size.	60 min
4	`04_evaluation_dashboard.ipynb`	Run detection evaluation on paired image sets; build side-by-side visualizations with error heatmaps; create radar plots; export JSON/CSV and generate PDF report.	60 min

Expected Deliverables

neural_sim_evaluator/ Python package: A well-structured, importable Python module containing implementations of all metrics (PSNR, MSE, SSIM, MS-SSIM, LPIPS wrapper, FID pipeline, detection evaluation) with docstrings and type hints.
CLI evaluation tool: A command-line script (evaluate.py) that accepts paths to real and rendered image directories and produces a full metrics report, invokable as: python evaluate.py --real data/real/ --rendered data/rendered/ --output results/.
Visualization report: A set of generated figures (side-by-side comparisons, error heatmaps, PSNR-vs-LPIPS scatter plot, radar chart, per-category detection bar chart) and a one-page PDF summary, all saved to an output directory.
Written analysis document (1-2 pages): A short write-up explaining which metrics agree, which disagree, and why. Include at least one concrete example where PSNR is misleading and LPIPS reveals the truth. Discuss implications for neural rendering research.
Completed Jupyter notebooks (4 notebooks): All exercises completed with outputs, inline commentary explaining observations, and answers to embedded reflection questions.

Evaluation Criteria

Criteria	Weight	Description
Correctness	30%	Metric implementations match reference libraries (scikit-image, lpips) within stated tolerance. FID computation is numerically stable. Detection IoU handles edge cases.
Code Quality	20%	Clean, modular Python code. Functions have docstrings and type hints. No code duplication. Reasonable error handling. Package structure follows Python conventions.
Analysis Depth	25%	Written analysis demonstrates genuine understanding of when and why metrics agree or disagree. Includes concrete examples with images. Connects findings to implications for AD simulation.
Visualization	15%	Comparison figures are clear and publication-quality. Error heatmaps use appropriate colormaps. Radar plots are correctly normalized. Dashboard layout is logical.
Documentation	10%	README with installation and usage instructions. CLI tool has `--help` output. Notebooks include markdown cells explaining each step. Docstrings are accurate and complete.

Neural Rendering for AD Simulation: The foundational reading for this project. Covers NeRF, 3D Gaussian Splatting, NeuRAD, SplatAD, and Applied Intuition's Neural Sim architecture. Read the "Technical Deep Dive" section to understand what the metrics you are building are evaluating.
Bridging the Sim-to-Real Gap: Explores the broader challenge of making simulation useful for real-world AD development. The metrics pipeline you build in this project is a direct tool for measuring the sim-to-real gap that this deep dive discusses.

Next Steps

After completing this project, consider these natural follow-ups from Track A (Neural Simulation):

NeRF Scene Reconstructor (Intermediate): Move from evaluating neural renders to generating them. Implement a simplified NeRF pipeline for driving scenes and use your metrics toolkit to benchmark your own renders.
Gaussian Splatting for Driving Scenes (Intermediate): Implement 3D Gaussian Splatting on driving data, including dynamic object handling. Your evaluator becomes the test harness for measuring reconstruction quality.
Sim-to-Real Domain Adaptation (Advanced): Use the metrics from this project to build a feedback loop -- train a domain adaptation model that minimizes the LPIPS/FID gap between rendered and real images, and verify that downstream detection mAP improves as a result.