Dynamic Scene Decomposition

Separate static background from dynamic actors in driving logs -- the essential preprocessing step for editable, composable autonomous driving simulation.

Overview

Every driving scene is a mixture of two fundamentally different types of content: the static background (road surface, buildings, vegetation, traffic signs) that remains fixed across time, and the dynamic actors (vehicles, pedestrians, cyclists) that move, interact, and create the interesting behaviors we need to simulate. Separating these two components -- scene decomposition -- is the critical first step that enables editable simulation: once you have a clean background and individually modeled actors, you can remove actors, insert new ones, replay the scene with different behaviors, or test edge cases that never occurred in the original log.

This project builds a complete scene decomposition pipeline from scratch. You will start with raw driving data (3D bounding boxes and point clouds), implement multi-object tracking to establish consistent actor identities across frames, use those tracks to segment each frame into static and dynamic regions, and finally reconstruct a clean static background by accumulating observations across time and inpainting the holes left by removed actors. Along the way, you will encounter the core algorithmic challenges that production systems face: noisy detections, occlusion, data association ambiguity, and the gap between removing an object and plausibly filling in what was behind it.

The pipeline you build mirrors the preprocessing stage used by state-of-the-art neural driving simulators like UniSim, MARS, and NeuRAD. These systems all require decomposed scenes as input -- they cannot learn to generate editable content from raw, entangled driving logs. By completing this project, you will understand exactly what goes into the "data preparation" box that most papers gloss over with a single sentence, and you will have hands-on experience with every component.

By the end of this project, you will have a working Python pipeline that takes a driving sequence (with 3D detections and point clouds), tracks all dynamic objects, builds per-frame masks separating static from dynamic content, reconstructs a clean static background, and demonstrates composability by inserting a synthetic actor into the reconstructed scene. This is the same type of pipeline that powers scene editing capabilities at Waymo, NVIDIA, and Applied Intuition.

Learning Objectives

Represent and manipulate 3D bounding boxes -- understand center-dimension-yaw parameterization, compute corners, and transform between coordinate frames.
Implement 3D IoU computation from scratch and understand why axis-aligned IoU is insufficient for rotated boxes in driving scenes.
Build a multi-object tracker using the Hungarian algorithm for data association and Kalman filtering for state prediction, handling track birth, maintenance, and death.
Create per-frame dynamic masks by projecting 3D bounding boxes into image space and point cloud space, separating static from dynamic content.
Segment point clouds into static background and per-actor dynamic subsets using track assignments and ego-motion compensation.
Reconstruct static backgrounds by accumulating multi-frame observations of revealed background regions and inpainting remaining holes.
Evaluate reconstruction quality quantitatively using PSNR on unmasked ground-truth regions.
Demonstrate scene composability by inserting synthetic actors into the reconstructed background, validating the decomposition quality.

Prerequisites

Required: Python 3.9+, NumPy, Matplotlib fundamentals. Familiarity with 3D coordinate systems (x, y, z axes, rotation matrices).
Recommended: Basic understanding of 3D bounding boxes in autonomous driving (center, dimensions, heading angle). Point cloud basics (what a point cloud is, how LiDAR works). Introductory linear algebra (matrix multiplication, rotation matrices, homogeneous coordinates).
Deep Dive Reading: Neural Rendering for AD Simulation -- read at least the sections on scene representation and dynamic scene handling to understand the downstream use of scene decomposition.

Key Concepts

3D Object Detection in Driving Scenes

In autonomous driving, objects are represented as 3D bounding boxes -- axis-aligned or oriented cuboids in world coordinates. Each box is parameterized by seven values:

box = (cx, cy, cz, length, width, height, yaw)

where (cx, cy, cz) is the box center, (length, width, height) are the dimensions along the object's local axes, and yaw is the heading angle (rotation around the vertical z-axis). The yaw angle determines the object's orientation in the ground plane, which is critical for driving scenarios where vehicles face different directions.

To convert from center-dimension-yaw to the eight corner points of the box:

corners_local = [+/-l/2, +/-w/2, +/-h/2]   (8 combinations)

R = [[cos(yaw), -sin(yaw), 0],
     [sin(yaw),  cos(yaw), 0],
     [0,         0,         1]]

corners_world = R @ corners_local + [cx, cy, cz]

Detection systems (PointPillars, CenterPoint, etc.) output these boxes with confidence scores and class labels (vehicle, pedestrian, cyclist). For this project, we work with ground-truth or pre-computed detections rather than training a detector -- the focus is on what happens after detection.

Object types matter for decomposition: vehicles are large, rigid, and relatively easy to track; pedestrians are small, often occluded, and more challenging; cyclists are somewhere in between. Your tracker needs to handle all three with appropriate motion models.

Multi-Object Tracking (MOT)

Detection gives you a set of bounding boxes per frame, but each frame's detections are independent -- there is no identity linking "the red car in frame 5" to "the red car in frame 6." Multi-object tracking establishes these identities by solving a frame-to-frame data association problem.

The classic tracking pipeline has three stages:

Prediction: Use a motion model (typically a Kalman filter with a constant-velocity assumption) to predict where each existing track will be in the next frame.
Association: Match predicted track positions to new detections. This is formulated as an assignment problem:
- Compute a cost matrix: cost[i, j] = distance between predicted track i and detection j.
- Solve with the Hungarian algorithm (optimal assignment in O(n^3) time).
- Threshold to reject unlikely matches (cost > max_distance).
Track management:
- Birth: Unmatched detections start new tentative tracks.
- Confirmation: Tentative tracks that match detections for N consecutive frames become confirmed.
- Death: Tracks that go unmatched for M consecutive frames are terminated.

The cost metric for association is critical. Common choices:

3D IoU (Intersection over Union): Measures overlap between predicted and detected boxes. Higher overlap means more likely the same object. Computing IoU for rotated 3D boxes requires polygon intersection in the bird's-eye view (BEV) plane, multiplied by height overlap.
Center distance: Euclidean distance between box centers. Simpler but less discriminative for nearby objects of different sizes.
Combined: Use IoU for close matches and center distance for far predictions (objects that moved significantly between frames).

The Kalman filter maintains a state vector for each track:

state = [cx, cy, cz, vx, vy, vz, length, width, height, yaw]

The constant-velocity model assumes position_next = position_current + velocity * dt. The filter propagates uncertainty, which helps handle noisy detections and brief occlusions -- even when a detection is missed for a frame or two, the filter's prediction keeps the track alive.

Static/Dynamic Segmentation

Once you have tracks (consistent identities across frames), separating static from dynamic content is conceptually simple: anything inside a tracked bounding box is dynamic; everything else is static. In practice, there are several subtleties:

Point cloud segmentation: For each frame, test every 3D point against all tracked bounding boxes. A point is classified as dynamic if it falls inside any active track's box (with a small margin to account for detection noise). Remaining points are static.

is_dynamic[i] = any(point_in_box(points[i], track_box) for track_box in active_tracks)

The point_in_box test for a rotated box transforms the point into the box's local coordinate frame and checks if all coordinates are within [-l/2, l/2] x [-w/2, w/2] x [-h/2, h/2].

Image-space masks: For camera images, project the 3D bounding box corners into the image using the camera intrinsic and extrinsic matrices, then create a 2D mask from the projected polygon. The projection:

p_camera = K @ [R|t] @ p_world
p_image = p_camera[:2] / p_camera[2]

where K is the 3x3 intrinsic matrix, and [R|t] is the 4x4 extrinsic (world-to-camera) transformation.

Handling ego-motion: The ego vehicle itself is moving, so static background points appear to move in the ego frame. To build a consistent static map, transform all static points into a common reference frame (typically the first frame's ego pose) using the ego-motion transforms:

p_world = T_ego_to_world(frame) @ p_ego(frame)

Background Inpainting and Reconstruction

Removing dynamic objects leaves holes in both images and point clouds -- the regions occluded by the actors. Filling these holes is the background reconstruction problem.

Multi-frame accumulation: The simplest and most effective approach exploits the fact that the ego vehicle moves. A point on the road surface occluded by a car in frame t may be visible in frame t+10 when the ego has moved and the car is elsewhere. By accumulating static points across many frames (all transformed to a common reference frame), you naturally fill in most occluded regions:

background_points = union(transform_to_frame0(static_points[t]) for t in all_frames)

This works remarkably well for ground surfaces and building facades, but can fail for regions that are perpetually occluded (e.g., behind a parked car that never moves).

Image inpainting: For camera images, after removing dynamic objects you need to fill the resulting holes. Approaches range from simple (nearest-neighbor fill, Gaussian blur, Navier-Stokes inpainting from OpenCV) to sophisticated (deep learning-based inpainting with models like LaMa). For this project, we use OpenCV's built-in inpainting methods, which work well for small-to-medium holes:

import cv2
result = cv2.inpaint(image, mask, inpaint_radius=3, flags=cv2.INPAINT_TELEA)

The Telea method propagates known pixel values inward from the mask boundary using a fast marching approach, weighted by distance and gradient direction.

Quality evaluation: Measure reconstruction quality on regions where you have ground truth -- the unoccluded background pixels. PSNR between the reconstructed background and the original (before actor insertion) gives a quantitative measure of how well the pipeline preserves background content.

Per-Actor Reconstruction

Beyond separating the background, a complete decomposition also isolates each dynamic actor individually. This enables:

Actor re-rendering: Render individual actors at new positions or orientations.
Actor removal: Remove specific actors without affecting others.
Actor insertion: Place an actor from one scene into a different scene.

Per-actor reconstruction involves:

Using the track's bounding box sequence to crop the actor from each frame.
Building a per-actor point cloud by accumulating points inside the track's boxes (in the actor's local reference frame).
Optionally fitting a simple mesh or texture model to the accumulated observations.

The quality of per-actor reconstruction depends heavily on observation coverage -- an actor seen from many angles across the sequence will have much better reconstruction than one seen briefly from a single viewpoint.

Step-by-Step Implementation Guide

Step 1: Environment Setup (30 min)

Create and activate a virtual environment:

mkdir -p dynamic-scene-decomposition && cd dynamic-scene-decomposition
python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install torch torchvision        # For potential neural inpainting extensions
pip install numpy                     # Array operations
pip install matplotlib                # Visualization
pip install opencv-python             # Image inpainting, projections
pip install scipy                     # Hungarian algorithm, spatial operations
pip install open3d                    # 3D point cloud visualization
pip install filterpy                  # Kalman filter implementation
pip install scikit-learn              # Clustering, nearest neighbors
pip install tqdm                      # Progress bars
pip install ipykernel jupyter         # Notebook support

Or create a requirements.txt:

torch>=2.0
torchvision>=0.15
numpy>=1.24
matplotlib>=3.7
opencv-python>=4.8
scipy>=1.10
open3d>=0.17
filterpy>=1.4
scikit-learn>=1.3
tqdm>=4.65
ipykernel>=6.0
jupyter>=1.0

Prepare synthetic data:

For this project, we generate synthetic driving scenes rather than requiring a large dataset download. This keeps the project self-contained and lets you control the complexity. The first notebook sets up a synthetic scene generator that creates:

A ground plane with lane markings
Static objects (buildings, poles) at fixed positions
Dynamic actors (vehicles, pedestrians) following predefined trajectories
Simulated 3D detections with controllable noise levels
Simulated point clouds from a virtual LiDAR sensor

Verify the setup:

# verify_setup.py
import numpy as np
import matplotlib.pyplot as plt
import cv2
from scipy.optimize import linear_sum_assignment
from filterpy.kalman import KalmanFilter
import open3d as o3d

# Quick test: create a rotation matrix
yaw = np.pi / 4  # 45 degrees
R = np.array([
    [np.cos(yaw), -np.sin(yaw), 0],
    [np.sin(yaw),  np.cos(yaw), 0],
    [0, 0, 1]
])
print(f"Rotation matrix for yaw={np.degrees(yaw):.0f} deg:")
print(R)

# Quick test: Hungarian assignment
cost = np.array([[1, 3], [2, 1]], dtype=np.float64)
row_ind, col_ind = linear_sum_assignment(cost)
print(f"\nHungarian assignment: rows={row_ind}, cols={col_ind}, cost={cost[row_ind, col_ind].sum()}")

# Quick test: Kalman filter
kf = KalmanFilter(dim_x=4, dim_z=2)
print(f"\nKalman filter created with state dim={kf.dim_x}, measurement dim={kf.dim_z}")

# Quick test: OpenCV inpainting
img = np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8)
mask = np.zeros((64, 64), dtype=np.uint8)
mask[20:40, 20:40] = 255
inpainted = cv2.inpaint(img, mask, 3, cv2.INPAINT_TELEA)
print(f"\nInpainting test: input {img.shape} -> output {inpainted.shape}")

print("\nSetup verified successfully!")

Run it:

python verify_setup.py

You should see the rotation matrix, Hungarian assignment results, Kalman filter confirmation, and inpainting test complete without errors.

Step 2: 3D Detection and Tracking (Notebook 01, ~2 hours)

Goal: Implement 3D bounding box operations and build a multi-object tracker from scratch.

3D bounding box representation:

import numpy as np
from dataclasses import dataclass

@dataclass
class BBox3D:
    """3D bounding box in world coordinates."""
    cx: float      # Center x
    cy: float      # Center y
    cz: float      # Center z
    length: float  # Extent along local x-axis
    width: float   # Extent along local y-axis
    height: float  # Extent along local z-axis
    yaw: float     # Heading angle (radians, around z-axis)
    label: str = "vehicle"
    score: float = 1.0

    def corners(self) -> np.ndarray:
        """Compute the 8 corner points of the box. Returns (8, 3) array."""
        l, w, h = self.length / 2, self.width / 2, self.height / 2
        # Local corners (before rotation and translation)
        corners_local = np.array([
            [ l,  w,  h], [ l, -w,  h], [-l, -w,  h], [-l,  w,  h],  # top
            [ l,  w, -h], [ l, -w, -h], [-l, -w, -h], [-l,  w, -h],  # bottom
        ])
        # Rotation matrix (around z-axis)
        c, s = np.cos(self.yaw), np.sin(self.yaw)
        R = np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]])
        # Rotate and translate
        corners_world = (R @ corners_local.T).T + np.array([self.cx, self.cy, self.cz])
        return corners_world

3D IoU computation:

Computing IoU for oriented 3D boxes requires:

Project both boxes onto the bird's-eye view (BEV) plane (x-y).
Compute the intersection area of the two rotated rectangles (using the Sutherland-Hodgman polygon clipping algorithm or shapely).
Compute the height overlap (1D interval intersection on the z-axis).
IoU = (BEV intersection area * height overlap) / (volume_A + volume_B - intersection volume).

from shapely.geometry import Polygon

def compute_bev_iou(box_a: BBox3D, box_b: BBox3D) -> float:
    """Compute BEV (bird's-eye view) IoU between two 3D boxes."""
    corners_a = box_a.corners()[:4, :2]  # Top-4 corners, x-y only
    corners_b = box_b.corners()[:4, :2]

    poly_a = Polygon(corners_a)
    poly_b = Polygon(corners_b)

    if not poly_a.is_valid or not poly_b.is_valid:
        return 0.0

    inter = poly_a.intersection(poly_b).area
    union = poly_a.area + poly_b.area - inter

    return inter / union if union > 0 else 0.0


def compute_3d_iou(box_a: BBox3D, box_b: BBox3D) -> float:
    """Compute full 3D IoU between two oriented boxes."""
    bev_iou_val = compute_bev_iou(box_a, box_b)
    if bev_iou_val == 0:
        return 0.0

    # BEV intersection area
    corners_a = box_a.corners()[:4, :2]
    corners_b = box_b.corners()[:4, :2]
    poly_a = Polygon(corners_a)
    poly_b = Polygon(corners_b)
    bev_inter_area = poly_a.intersection(poly_b).area

    # Height overlap
    za_min = box_a.cz - box_a.height / 2
    za_max = box_a.cz + box_a.height / 2
    zb_min = box_b.cz - box_b.height / 2
    zb_max = box_b.cz + box_b.height / 2
    h_overlap = max(0, min(za_max, zb_max) - max(za_min, zb_min))

    inter_vol = bev_inter_area * h_overlap
    vol_a = box_a.length * box_a.width * box_a.height
    vol_b = box_b.length * box_b.width * box_b.height
    union_vol = vol_a + vol_b - inter_vol

    return inter_vol / union_vol if union_vol > 0 else 0.0

Hungarian assignment:

from scipy.optimize import linear_sum_assignment

def associate_detections_to_tracks(
    detections: list[BBox3D],
    predictions: list[BBox3D],
    iou_threshold: float = 0.1,
) -> tuple[list[tuple[int, int]], list[int], list[int]]:
    """
    Match detections to predicted track positions using Hungarian algorithm.

    Returns:
        matches: List of (detection_idx, track_idx) pairs.
        unmatched_dets: Detection indices with no matching track.
        unmatched_trks: Track indices with no matching detection.
    """
    if len(detections) == 0 or len(predictions) == 0:
        return [], list(range(len(detections))), list(range(len(predictions)))

    n_det = len(detections)
    n_trk = len(predictions)

    # Build cost matrix (negative IoU for minimization)
    cost_matrix = np.zeros((n_det, n_trk))
    for d in range(n_det):
        for t in range(n_trk):
            cost_matrix[d, t] = -compute_3d_iou(detections[d], predictions[t])

    # Solve assignment
    det_indices, trk_indices = linear_sum_assignment(cost_matrix)

    # Filter by IoU threshold
    matches = []
    unmatched_dets = list(range(n_det))
    unmatched_trks = list(range(n_trk))

    for d, t in zip(det_indices, trk_indices):
        iou = -cost_matrix[d, t]
        if iou >= iou_threshold:
            matches.append((d, t))
            unmatched_dets.remove(d)
            unmatched_trks.remove(t)

    return matches, unmatched_dets, unmatched_trks

Kalman filter tracker:

from filterpy.kalman import KalmanFilter

class Track:
    """A single tracked object with Kalman filter state estimation."""
    _next_id = 0

    def __init__(self, detection: BBox3D):
        self.id = Track._next_id
        Track._next_id += 1

        # State: [cx, cy, cz, vx, vy, vz]
        self.kf = KalmanFilter(dim_x=6, dim_z=3)
        dt = 0.1  # 10 Hz assumption

        # State transition (constant velocity)
        self.kf.F = np.array([
            [1, 0, 0, dt, 0,  0],
            [0, 1, 0, 0,  dt, 0],
            [0, 0, 1, 0,  0, dt],
            [0, 0, 0, 1,  0,  0],
            [0, 0, 0, 0,  1,  0],
            [0, 0, 0, 0,  0,  1],
        ])
        # Measurement: observe position only
        self.kf.H = np.array([
            [1, 0, 0, 0, 0, 0],
            [0, 1, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0],
        ])
        # Initial state
        self.kf.x[:3, 0] = [detection.cx, detection.cy, detection.cz]
        # Store box dimensions (assumed constant)
        self.length = detection.length
        self.width = detection.width
        self.height = detection.height
        self.yaw = detection.yaw
        self.label = detection.label

        self.hits = 1
        self.age = 0
        self.time_since_update = 0
        self.history = []

    def predict(self) -> BBox3D:
        """Predict next state and return predicted bounding box."""
        self.kf.predict()
        self.age += 1
        self.time_since_update += 1
        cx, cy, cz = self.kf.x[:3, 0]
        return BBox3D(cx, cy, cz, self.length, self.width, self.height,
                      self.yaw, self.label)

    def update(self, detection: BBox3D):
        """Update state with a matched detection."""
        self.kf.update(np.array([detection.cx, detection.cy, detection.cz]))
        self.yaw = detection.yaw
        self.length = detection.length
        self.width = detection.width
        self.height = detection.height
        self.hits += 1
        self.time_since_update = 0

    def get_box(self) -> BBox3D:
        """Return current estimated bounding box."""
        cx, cy, cz = self.kf.x[:3, 0]
        return BBox3D(cx, cy, cz, self.length, self.width, self.height,
                      self.yaw, self.label)

Multi-object tracker:

class MultiObjectTracker:
    """Simple multi-object tracker with Hungarian assignment and Kalman filter."""

    def __init__(self, max_age: int = 5, min_hits: int = 3, iou_threshold: float = 0.1):
        self.max_age = max_age
        self.min_hits = min_hits
        self.iou_threshold = iou_threshold
        self.tracks: list[Track] = []

    def update(self, detections: list[BBox3D]) -> list[Track]:
        """Process one frame of detections. Returns list of confirmed tracks."""
        # Step 1: Predict existing tracks
        predictions = [t.predict() for t in self.tracks]

        # Step 2: Associate detections to tracks
        matches, unmatched_dets, unmatched_trks = associate_detections_to_tracks(
            detections, predictions, self.iou_threshold
        )

        # Step 3: Update matched tracks
        for det_idx, trk_idx in matches:
            self.tracks[trk_idx].update(detections[det_idx])

        # Step 4: Create new tracks for unmatched detections
        for det_idx in unmatched_dets:
            self.tracks.append(Track(detections[det_idx]))

        # Step 5: Remove dead tracks
        self.tracks = [t for t in self.tracks if t.time_since_update <= self.max_age]

        # Return confirmed tracks only
        return [t for t in self.tracks if t.hits >= self.min_hits]

Bird's-eye view visualization:

def plot_bev(
    boxes: list[BBox3D],
    ax=None,
    color="blue",
    label_prefix="",
    xlim=(-50, 50),
    ylim=(-50, 50),
):
    """Plot 3D bounding boxes in bird's-eye view."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 8))

    for i, box in enumerate(boxes):
        corners = box.corners()[:4, :2]  # Top 4 corners, BEV
        polygon = plt.Polygon(corners, fill=False, edgecolor=color, linewidth=1.5)
        ax.add_patch(polygon)
        ax.text(box.cx, box.cy, f"{label_prefix}{i}", fontsize=7,
                ha="center", va="center", color=color)

    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.set_aspect("equal")
    ax.grid(True, alpha=0.3)
    ax.set_xlabel("X (m)")
    ax.set_ylabel("Y (m)")
    return ax

Exercise -- Track visualization: Generate a 20-frame sequence where 3 vehicles move at constant velocity with noisy detections (add Gaussian noise to center positions with sigma=0.3m). Run the tracker and plot:

Ground-truth trajectories (dashed lines).
Tracked trajectories (solid lines).
Raw detection positions (scattered dots).

Measure the average position error between tracked and ground-truth trajectories.

Step 3: Scene Decomposition Pipeline (Notebook 02, ~2.5 hours)

Goal: Use tracks from the previous step to create per-frame dynamic masks and segment point clouds into static and dynamic components.

Point-in-box test:

The core operation for decomposition is testing whether a 3D point lies inside a rotated bounding box. Transform the point to the box's local frame, then check axis-aligned bounds:

def points_in_box(points: np.ndarray, box: BBox3D, margin: float = 0.1) -> np.ndarray:
    """
    Test which points fall inside a 3D bounding box.

    Args:
        points: (N, 3) array of 3D points.
        box: The bounding box to test against.
        margin: Extra margin (meters) around the box to account for detection noise.

    Returns:
        Boolean array of shape (N,) -- True for points inside the box.
    """
    # Translate points to box center
    centered = points - np.array([box.cx, box.cy, box.cz])

    # Rotate points into box-local frame (inverse rotation)
    c, s = np.cos(-box.yaw), np.sin(-box.yaw)
    R_inv = np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]])
    local = (R_inv @ centered.T).T

    # Check axis-aligned bounds with margin
    half_l = box.length / 2 + margin
    half_w = box.width / 2 + margin
    half_h = box.height / 2 + margin

    inside = (
        (np.abs(local[:, 0]) <= half_l) &
        (np.abs(local[:, 1]) <= half_w) &
        (np.abs(local[:, 2]) <= half_h)
    )
    return inside

Per-frame dynamic mask:

def create_dynamic_mask(
    points: np.ndarray,
    tracked_boxes: list[BBox3D],
    margin: float = 0.15,
) -> np.ndarray:
    """
    Create a boolean mask indicating which points belong to dynamic objects.

    Args:
        points: (N, 3) point cloud in the ego frame.
        tracked_boxes: List of tracked bounding boxes for this frame.
        margin: Extra margin around boxes.

    Returns:
        Boolean array (N,) -- True for dynamic points.
    """
    mask = np.zeros(len(points), dtype=bool)
    for box in tracked_boxes:
        mask |= points_in_box(points, box, margin)
    return mask

3D box to image projection:

For camera-based decomposition, project 3D bounding box corners into the image:

def project_box_to_image(
    box: BBox3D,
    K: np.ndarray,         # (3, 3) intrinsic matrix
    T_cam_world: np.ndarray,  # (4, 4) world-to-camera transform
    image_shape: tuple,    # (H, W)
) -> np.ndarray | None:
    """
    Project a 3D bounding box into image space.

    Returns:
        (N, 2) array of projected 2D corner points, or None if behind camera.
    """
    corners_3d = box.corners()  # (8, 3)
    # To homogeneous
    corners_h = np.hstack([corners_3d, np.ones((8, 1))])  # (8, 4)
    # Transform to camera frame
    corners_cam = (T_cam_world @ corners_h.T).T  # (8, 4)
    corners_cam = corners_cam[:, :3]  # (8, 3)

    # Check if any corner is behind camera
    if np.any(corners_cam[:, 2] <= 0):
        return None

    # Project to image
    corners_2d = (K @ corners_cam.T).T  # (8, 3)
    corners_2d = corners_2d[:, :2] / corners_2d[:, 2:3]  # (8, 2)

    # Clip to image bounds
    H, W = image_shape
    corners_2d[:, 0] = np.clip(corners_2d[:, 0], 0, W - 1)
    corners_2d[:, 1] = np.clip(corners_2d[:, 1], 0, H - 1)

    return corners_2d


def create_image_mask(
    boxes: list[BBox3D],
    K: np.ndarray,
    T_cam_world: np.ndarray,
    image_shape: tuple,
) -> np.ndarray:
    """Create a binary mask of dynamic objects in image space."""
    H, W = image_shape
    mask = np.zeros((H, W), dtype=np.uint8)

    for box in boxes:
        corners_2d = project_box_to_image(box, K, T_cam_world, image_shape)
        if corners_2d is None:
            continue
        # Use convex hull of projected corners
        hull = cv2.convexHull(corners_2d.astype(np.int32))
        cv2.fillConvexPoly(mask, hull, 255)

    return mask

Ego-motion compensation:

To accumulate static points across frames into a common reference frame, you need the ego-vehicle's pose at each timestep:

def transform_points(
    points: np.ndarray,
    T_target_source: np.ndarray,
) -> np.ndarray:
    """
    Transform (N, 3) points from source frame to target frame.

    Args:
        points: (N, 3) array of 3D points.
        T_target_source: (4, 4) homogeneous transformation matrix.

    Returns:
        (N, 3) transformed points.
    """
    # To homogeneous coordinates
    points_h = np.hstack([points, np.ones((len(points), 1))])  # (N, 4)
    # Transform
    transformed = (T_target_source @ points_h.T).T  # (N, 4)
    return transformed[:, :3]


def accumulate_static_points(
    frames: list[dict],  # Each has 'points', 'dynamic_mask', 'ego_pose'
    reference_frame: int = 0,
) -> np.ndarray:
    """
    Accumulate static points from all frames into a common reference frame.

    Args:
        frames: List of frame dicts with keys 'points' (N,3), 'dynamic_mask' (N,),
                and 'ego_pose' (4,4) world-to-ego transform.
        reference_frame: Which frame's coordinate system to use as reference.

    Returns:
        (M, 3) accumulated static point cloud.
    """
    T_ref_world = frames[reference_frame]['ego_pose']
    all_static = []

    for frame in frames:
        static_mask = ~frame['dynamic_mask']
        static_points = frame['points'][static_mask]

        # Transform: ego -> world -> reference
        T_world_ego = np.linalg.inv(frame['ego_pose'])
        T_ref_ego = T_ref_world @ T_world_ego
        static_ref = transform_points(static_points, T_ref_ego)
        all_static.append(static_ref)

    return np.vstack(all_static)

Visualization:

def visualize_decomposition_bev(
    static_points: np.ndarray,
    dynamic_points: np.ndarray,
    tracked_boxes: list[BBox3D],
    title: str = "Scene Decomposition (BEV)",
):
    """Visualize static vs dynamic point cloud separation in bird's-eye view."""
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Full scene
    all_points = np.vstack([static_points, dynamic_points])
    axes[0].scatter(all_points[:, 0], all_points[:, 1], s=0.5, c='gray', alpha=0.3)
    axes[0].set_title("Full Scene")

    # Static only
    axes[1].scatter(static_points[:, 0], static_points[:, 1], s=0.5, c='steelblue', alpha=0.3)
    axes[1].set_title(f"Static ({len(static_points):,} pts)")

    # Dynamic only with boxes
    axes[2].scatter(dynamic_points[:, 0], dynamic_points[:, 1], s=1.0, c='coral', alpha=0.5)
    for box in tracked_boxes:
        corners = box.corners()[:4, :2]
        polygon = plt.Polygon(corners, fill=False, edgecolor='red', linewidth=1.5)
        axes[2].add_patch(polygon)
    axes[2].set_title(f"Dynamic ({len(dynamic_points):,} pts)")

    for ax in axes:
        ax.set_xlim(-60, 60)
        ax.set_ylim(-60, 60)
        ax.set_aspect('equal')
        ax.grid(True, alpha=0.2)

    plt.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

Exercise -- Per-actor point clouds: Extend the decomposition to extract per-actor point clouds. For each track, accumulate only the dynamic points that fall inside that track's bounding box, transformed into the actor's local reference frame (centered at the actor, aligned with the actor's heading). Visualize each actor's accumulated point cloud separately. Which actors have the best reconstruction? Why?

Step 4: Background Reconstruction (Notebook 03, ~2.5 hours)

Goal: Reconstruct the static background by removing dynamic objects and filling holes through multi-frame accumulation and image inpainting.

Multi-frame background accumulation:

The core idea is to voxelize the accumulated static point cloud and use it to create a clean background representation:

def voxelize_points(
    points: np.ndarray,
    voxel_size: float = 0.1,
) -> tuple[np.ndarray, np.ndarray]:
    """
    Voxelize a point cloud by averaging points within each voxel.

    Returns:
        voxel_centers: (M, 3) array of voxel center positions.
        voxel_counts: (M,) number of points per voxel.
    """
    # Quantize to voxel grid
    voxel_indices = np.floor(points / voxel_size).astype(np.int64)

    # Use dictionary to accumulate points per voxel
    voxel_dict = {}
    for i, idx in enumerate(map(tuple, voxel_indices)):
        if idx not in voxel_dict:
            voxel_dict[idx] = []
        voxel_dict[idx].append(points[i])

    centers = []
    counts = []
    for idx, pts in voxel_dict.items():
        centers.append(np.mean(pts, axis=0))
        counts.append(len(pts))

    return np.array(centers), np.array(counts)

Image inpainting pipeline:

import cv2

def remove_dynamic_objects(
    image: np.ndarray,
    mask: np.ndarray,
    method: str = "telea",
    inpaint_radius: int = 5,
) -> np.ndarray:
    """
    Remove dynamic objects from an image using inpainting.

    Args:
        image: (H, W, 3) BGR image.
        mask: (H, W) binary mask where 255 = dynamic object region.
        method: 'telea' or 'ns' (Navier-Stokes).
        inpaint_radius: Radius of the neighborhood for inpainting.

    Returns:
        Inpainted image with dynamic objects removed.
    """
    # Dilate mask slightly to cover edges
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
    mask_dilated = cv2.dilate(mask, kernel, iterations=2)

    if method == "telea":
        result = cv2.inpaint(image, mask_dilated, inpaint_radius, cv2.INPAINT_TELEA)
    elif method == "ns":
        result = cv2.inpaint(image, mask_dilated, inpaint_radius, cv2.INPAINT_NS)
    else:
        raise ValueError(f"Unknown method: {method}")

    return result

Multi-frame background image:

When the ego vehicle moves, previously occluded background regions become visible. We can warp and blend background observations from multiple frames:

def reconstruct_background_multiframe(
    frames: list[dict],
    target_frame: int,
    K: np.ndarray,
    image_shape: tuple,
) -> np.ndarray:
    """
    Reconstruct background for a target frame by warping observations from nearby frames.

    This is a simplified version -- it accumulates unmasked pixels from neighboring
    frames, projected into the target frame's viewpoint using a ground-plane homography.

    Args:
        frames: List of dicts with 'image', 'mask', 'ego_pose'.
        target_frame: Index of the frame to reconstruct.
        K: Camera intrinsic matrix (3, 3).
        image_shape: (H, W).

    Returns:
        Reconstructed background image.
    """
    H, W = image_shape
    bg_accum = np.zeros((H, W, 3), dtype=np.float64)
    bg_count = np.zeros((H, W), dtype=np.float64)

    target_pose = frames[target_frame]['ego_pose']
    target_mask = frames[target_frame]['mask']

    for i, frame in enumerate(frames):
        if i == target_frame:
            # Use the target frame's unmasked pixels directly
            unmasked = frame['mask'] == 0
            bg_accum[unmasked] += frame['image'][unmasked].astype(np.float64)
            bg_count[unmasked] += 1.0
            continue

        # For other frames, use their unmasked pixels
        # (In a full implementation, you would warp these using a homography
        # or depth-based reprojection. Here we use the simplified assumption
        # that nearby frames have similar viewpoints.)
        unmasked = frame['mask'] == 0
        bg_accum[unmasked] += frame['image'][unmasked].astype(np.float64)
        bg_count[unmasked] += 1.0

    # Average where we have observations
    valid = bg_count > 0
    result = np.zeros((H, W, 3), dtype=np.uint8)
    result[valid] = (bg_accum[valid] / bg_count[valid, np.newaxis]).astype(np.uint8)

    # Inpaint remaining holes
    hole_mask = (~valid).astype(np.uint8) * 255
    if hole_mask.any():
        result = cv2.inpaint(result, hole_mask, 10, cv2.INPAINT_TELEA)

    return result

Quality evaluation:

def evaluate_reconstruction(
    reconstructed: np.ndarray,
    ground_truth: np.ndarray,
    mask: np.ndarray,
) -> dict:
    """
    Evaluate background reconstruction quality on unmasked regions.

    Args:
        reconstructed: The reconstructed background image.
        ground_truth: The original image (with actors present, used as reference
                      for background regions that were never occluded).
        mask: Binary mask of dynamic object regions (these are excluded from evaluation).

    Returns:
        Dictionary with PSNR, MSE, and per-channel metrics.
    """
    # Evaluate only on background (unmasked) regions
    bg_mask = mask == 0

    if not bg_mask.any():
        return {"psnr": 0.0, "mse": float('inf')}

    rec_bg = reconstructed[bg_mask].astype(np.float64)
    gt_bg = ground_truth[bg_mask].astype(np.float64)

    mse = np.mean((rec_bg - gt_bg) ** 2)
    if mse == 0:
        psnr = float('inf')
    else:
        psnr = 10.0 * np.log10(255.0 ** 2 / mse)

    return {
        "psnr": psnr,
        "mse": mse,
        "n_pixels_evaluated": int(bg_mask.sum()),
        "n_pixels_masked": int((~bg_mask).sum()),
        "coverage": float(bg_mask.mean()),
    }

Actor insertion demo:

The ultimate test of decomposition quality -- can you insert a new actor into the reconstructed background and have it look plausible?

def insert_synthetic_actor(
    background: np.ndarray,
    actor_image: np.ndarray,
    actor_mask: np.ndarray,
    position: tuple[int, int],
    scale: float = 1.0,
) -> np.ndarray:
    """
    Insert a synthetic actor into a reconstructed background image.

    Args:
        background: (H, W, 3) reconstructed background.
        actor_image: (h, w, 3) actor image (cropped from another frame/scene).
        actor_mask: (h, w) binary mask of the actor.
        position: (y, x) top-left corner for insertion.
        scale: Resize factor for the actor.

    Returns:
        Composed image with actor inserted.
    """
    if scale != 1.0:
        new_h = int(actor_image.shape[0] * scale)
        new_w = int(actor_image.shape[1] * scale)
        actor_image = cv2.resize(actor_image, (new_w, new_h))
        actor_mask = cv2.resize(actor_mask, (new_w, new_h))

    y, x = position
    h, w = actor_image.shape[:2]
    H, W = background.shape[:2]

    # Clip to image bounds
    y1 = max(0, y)
    x1 = max(0, x)
    y2 = min(H, y + h)
    x2 = min(W, x + w)

    # Corresponding region in actor image
    ay1 = y1 - y
    ax1 = x1 - x
    ay2 = ay1 + (y2 - y1)
    ax2 = ax1 + (x2 - x1)

    result = background.copy()
    mask_region = actor_mask[ay1:ay2, ax1:ax2] > 128
    if mask_region.ndim == 2:
        mask_region = mask_region[:, :, np.newaxis]
    result[y1:y2, x1:x2] = np.where(
        mask_region,
        actor_image[ay1:ay2, ax1:ax2],
        result[y1:y2, x1:x2]
    )

    return result

Visualization -- Full pipeline:

def visualize_full_pipeline(
    original: np.ndarray,
    mask: np.ndarray,
    inpainted: np.ndarray,
    composed: np.ndarray,
):
    """Visualize the complete decomposition-to-composition pipeline."""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    axes[0, 0].imshow(cv2.cvtColor(original, cv2.COLOR_BGR2RGB))
    axes[0, 0].set_title("Original Scene")

    axes[0, 1].imshow(mask, cmap='gray')
    axes[0, 1].set_title("Dynamic Object Mask")

    axes[1, 0].imshow(cv2.cvtColor(inpainted, cv2.COLOR_BGR2RGB))
    axes[1, 0].set_title("Background Reconstruction")

    axes[1, 1].imshow(cv2.cvtColor(composed, cv2.COLOR_BGR2RGB))
    axes[1, 1].set_title("New Actor Inserted")

    for ax in axes.ravel():
        ax.axis('off')

    plt.suptitle("Dynamic Scene Decomposition Pipeline", fontsize=14)
    plt.tight_layout()
    plt.show()

Exercise -- Compare inpainting methods: Apply both OpenCV inpainting methods (Telea and Navier-Stokes) to the same masked images. Also implement a simple baseline: fill masked regions with the mean color of surrounding pixels. Compare all three using PSNR on the unmasked regions. Create a table and visualization showing which method works best for:

Small masks (single pedestrian)
Medium masks (single vehicle)
Large masks (multiple overlapping vehicles)

Which method handles large occlusions best? Why?

Timeline and Milestones

Week	Focus	Deliverable
1	Setup + Notebook 01	Working tracker with BEV visualization
2	Notebook 02	Per-frame decomposition with static/dynamic separation
3	Notebook 03	Background reconstruction + actor insertion demo
4	Polish + extensions	Evaluation report, improved inpainting, per-actor models

Total estimated time: 20-25 hours of focused work.

Extensions and Next Steps

After completing the core project, consider these extensions:

Neural inpainting: Replace OpenCV inpainting with a deep learning model (e.g., LaMa) for more realistic background completion, especially for large occluded regions.
Per-actor neural reconstruction: Use the per-actor point clouds and images to build small 3DGS models of individual actors, enabling novel-view rendering of each actor independently.
Real dataset integration: Apply the pipeline to real driving data from nuScenes or Waymo Open Dataset, handling the challenges of real (noisy, incomplete) detections.
Video-consistent decomposition: Extend image inpainting to be temporally consistent across frames, avoiding flickering in the reconstructed background video.
Semantic-aware decomposition: Go beyond bounding boxes by using instance segmentation masks (e.g., from Mask R-CNN) for tighter dynamic masks that better separate actors from background at pixel level.
Integration with 3DGS: Feed the decomposed static background into a 3DGS reconstruction pipeline (connect with the "3DGS Scene Reconstruction" project) to build a fully editable neural scene representation.

References

Weng, X. & Kitani, K. (2020). "A Baseline for 3D Multi-Object Tracking." arXiv:1907.03961.
Yang, Z. et al. (2023). "UniSim: A Neural Closed-Loop Sensor Simulator." CVPR 2023.
Wu, Z. et al. (2023). "MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving." CICAI 2023.
Tonderski, A. et al. (2024). "NeuRAD: Neural Rendering for Autonomous Driving." CVPR 2024.
Kuhn, H. W. (1955). "The Hungarian Method for the Assignment Problem." Naval Research Logistics.
Kalman, R. E. (1960). "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering.
Bertalmio, M. et al. (2001). "Navier-Stokes, Fluid Dynamics, and Image and Video Inpainting." CVPR 2001.
Telea, A. (2004). "An Image Inpainting Technique Based on the Fast Marching Method." Journal of Graphics Tools.