Back to all projects
Track A: Neural SimulationAdvanced30-40 hours

Multi-Sensor Neural Renderer

Extend Gaussian Splatting to output both camera images and lidar point clouds.

Neural RenderingSensor ModelingCUDA

Multi-Sensor Neural Renderer

Build a unified 3D Gaussian Splatting scene representation that renders both photorealistic camera images and physically plausible lidar point clouds -- enabling multi-modal sensor simulation from a single learned model.

Track: A -- Neural Simulation | Level: Advanced | Total Time: ~30-40 hours


Overview

Modern autonomous driving systems consume data from multiple sensor modalities simultaneously -- typically 6-10 cameras for visual perception and 1-5 lidar sensors for 3D geometry. When building neural simulation pipelines, the standard approach has been to train separate models for each modality: one neural radiance field or Gaussian Splatting model for camera images, and a separate point cloud generation model for lidar. This is wasteful for two reasons. First, cameras and lidar observe the same physical scene -- surfaces that reflect photons also reflect laser pulses. Second, maintaining two models means two separate training pipelines, two sets of hyperparameters, and no guarantee that the rendered camera image and lidar sweep are geometrically consistent with each other.

This project addresses that gap by building a multi-sensor neural renderer on top of 3D Gaussian Splatting (3DGS). You will extend the standard 3DGS representation -- which already stores position, covariance, opacity, and spherical harmonics coefficients for each Gaussian -- with additional per-Gaussian attributes for lidar simulation: surface normal direction, material reflectance (for lidar intensity), and a learned raydrop probability. The same set of Gaussians is used to render camera RGB images via the standard differentiable rasterization pipeline and to produce lidar depth, intensity, and raydrop predictions via a differentiable ray-casting formulation. Both modalities share the scene geometry, ensuring spatial consistency, while sensor-specific parameters allow each output to match its respective ground-truth distribution.

By the end of this project you will have a working PyTorch implementation that takes a set of posed camera images and lidar sweeps as input, jointly optimizes a shared Gaussian scene representation, and renders novel camera views and novel lidar sweeps from arbitrary sensor poses. This is the same architectural pattern used in state-of-the-art multi-sensor neural simulators like UniSim, NeuRAD, and LiDAR-GS, and it is the foundation for production sensor simulation at companies like Waymo, NVIDIA, and Applied Intuition.


Learning Objectives

After completing this project, you will be able to:

  • Formulate lidar ray casting through a Gaussian scene -- derive the intersection equations, accumulate depth along rays using alpha compositing, and compute per-point intensity from surface normals and material reflectance.
  • Implement a differentiable lidar rendering pipeline that produces depth maps, intensity maps, and raydrop masks from a 3DGS representation, with gradients flowing back to Gaussian parameters.
  • Model lidar raydrop (miss) patterns using a learned network conditioned on ray geometry and accumulated features, capturing the physics of beam divergence, retroreflection, and surface absorption.
  • Unify camera and lidar rendering from a shared scene representation, handling the different coordinate systems, projection models, and output formats of each sensor.
  • Implement multi-sensor calibration transforms -- convert between camera, lidar, and ego-vehicle coordinate frames using extrinsic rotation/translation matrices.
  • Design a joint training objective that balances camera losses (L1 + SSIM + LPIPS) with lidar losses (depth MAE + intensity MSE + raydrop BCE), including loss weighting strategies.
  • Evaluate multi-sensor reconstruction quality using both camera metrics (PSNR, SSIM, LPIPS) and lidar metrics (depth error, intensity error, raydrop F1 score, Chamfer distance).
  • Synthesize novel sensor data -- render camera images and lidar sweeps from viewpoints not present in the training set, demonstrating the model's ability to generalize.

Prerequisites

Required

  • 3D Gaussian Splatting fundamentals -- understanding of the core 3DGS representation (position, covariance, opacity, spherical harmonics), the differentiable rasterization pipeline, and the densification/pruning training strategy. If you have not implemented basic 3DGS before, start with the "3DGS Scene Reconstruction" project in this track.
  • PyTorch proficiency -- comfortable writing custom nn.Module classes, autograd-compatible functions, custom training loops, and loss functions.
  • 3D geometry -- rotation matrices, quaternions, homogeneous transforms, ray-plane intersection, coordinate frame conversions.
  • CUDA basics -- while the core implementation is in PyTorch (not raw CUDA), understanding GPU memory layout and parallel execution helps with performance reasoning.
  • Lidar sensor physics -- basic understanding of time-of-flight measurement, beam divergence, retroreflection, and point cloud representation. The project teaches this, but prior exposure accelerates learning.
  • Camera projection model -- pinhole camera intrinsics (focal length, principal point) and extrinsics (rotation, translation).

Deep Dive Reading

Before starting, read the companion deep dives for theoretical background:

  • Neural Rendering for AD Simulation -- covers NeRF, 3DGS, NeuRAD, SplatAD, and multi-sensor rendering architectures. Read the "3D Gaussian Splatting" and "Multi-Sensor Extensions" sections carefully.
  • Sensor Simulation for AD -- covers lidar physics, sensor modeling, intensity/raydrop simulation, and the role of neural methods in bridging the simulation gap.

Key Concepts

Lidar Ray Casting Fundamentals

A spinning lidar sensor emits laser pulses along a set of known directions and measures the time-of-flight for each returned pulse. For a sensor with H elevation channels and W azimuth steps, each frame produces up to H x W range measurements. In our renderer, we need to compute these range (depth) values by casting rays through the Gaussian scene.

Ray definition: Each lidar ray is defined by an origin o (the sensor position in world coordinates) and a direction d (a unit vector determined by the beam's azimuth and elevation angles):

azimuth_angles = linspace(0, 2*pi, W)       # horizontal scan
elevation_angles = [el_0, el_1, ..., el_H]   # vertical beam pattern (non-uniform)

d_x = cos(elevation) * cos(azimuth)
d_y = cos(elevation) * sin(azimuth)
d_z = sin(elevation)

ray(t) = o + t * d,  for t >= 0

Ray-Gaussian intersection: Each 3D Gaussian in the scene is parameterized by its mean mu (3D position), covariance Sigma (3x3 positive definite matrix, often stored as a rotation quaternion + 3D scale), opacity alpha, and additional attributes. To compute the contribution of Gaussian k to a ray, we evaluate the Gaussian density along the ray. The key insight is that a 3D Gaussian evaluated along a 1D ray produces a 1D Gaussian:

G_k(t) = alpha_k * exp(-0.5 * (o + t*d - mu_k)^T Sigma_k^{-1} (o + t*d - mu_k))

Let delta_k = mu_k - o be the vector from ray origin to Gaussian center. The depth of maximum contribution along the ray is:

t_k = (d^T Sigma_k^{-1} delta_k) / (d^T Sigma_k^{-1} d)

And the "power" (peak density) at that point determines how much this Gaussian contributes to the depth measurement. The effective 1D variance along the ray direction is:

sigma_ray_k^2 = 1 / (d^T Sigma_k^{-1} d)

Alpha compositing for depth: Just as camera rendering alpha-composites colors along viewing rays, lidar rendering alpha-composites depths along sensor rays. Sort the Gaussians by t_k (front to back) and accumulate:

T_k = prod_{j < k} (1 - w_j)          # transmittance up to Gaussian k
w_k = alpha_k * G_ray_k               # ray-space weight for Gaussian k

depth_rendered = sum_k (T_k * w_k * t_k)

where G_ray_k is the Gaussian evaluated along the ray at its peak. This produces a differentiable depth estimate for each lidar ray.

Implementation in PyTorch:

import torch
import torch.nn as nn

def compute_ray_gaussian_intersection(
    ray_origins: torch.Tensor,      # (N_rays, 3)
    ray_directions: torch.Tensor,   # (N_rays, 3)
    means: torch.Tensor,            # (N_gaussians, 3)
    covariances_inv: torch.Tensor,  # (N_gaussians, 3, 3)
    opacities: torch.Tensor,        # (N_gaussians,)
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute depth contributions of all Gaussians along all rays.

    Returns:
        depths: (N_rays, N_gaussians) -- depth of each Gaussian along each ray
        weights: (N_rays, N_gaussians) -- alpha-compositing weight of each Gaussian
    """
    # delta = mu - o: (N_rays, N_gaussians, 3)
    delta = means.unsqueeze(0) - ray_origins.unsqueeze(1)

    # d^T Sigma^{-1} delta and d^T Sigma^{-1} d for each ray-Gaussian pair
    # Sigma_inv @ d: (N_gaussians, 3, 3) @ (N_rays, 1, 3, 1) -> needs broadcasting
    d = ray_directions.unsqueeze(1)  # (N_rays, 1, 3)

    # For each Gaussian, compute Sigma_inv @ d
    # d_Sinv: (N_rays, N_gaussians, 3)
    d_Sinv = torch.einsum('gij,rnj->rni', covariances_inv, d.expand(-1, means.shape[0], -1))

    # t_peak = (d^T Sigma^{-1} delta) / (d^T Sigma^{-1} d)
    numerator = torch.sum(d_Sinv * delta, dim=-1)    # (N_rays, N_gaussians)
    denominator = torch.sum(d_Sinv * d.expand(-1, means.shape[0], -1), dim=-1)  # (N_rays, N_gaussians)

    t_peak = numerator / (denominator + 1e-8)         # (N_rays, N_gaussians)

    # Compute ray-space variance: sigma_ray^2 = 1 / (d^T Sigma^{-1} d)
    sigma_ray_sq = 1.0 / (denominator + 1e-8)

    # Gaussian weight along ray at peak
    power = opacities.unsqueeze(0) * torch.exp(
        -0.5 * (t_peak - numerator / (denominator + 1e-8))**2 / (sigma_ray_sq + 1e-8)
    )

    # Mask out Gaussians behind the ray origin
    power = power * (t_peak > 0).float()

    return t_peak, power


def alpha_composite_depth(
    depths: torch.Tensor,    # (N_rays, N_gaussians)
    weights: torch.Tensor,   # (N_rays, N_gaussians)
) -> torch.Tensor:
    """
    Alpha-composite depths front-to-back along each ray.

    Returns:
        rendered_depth: (N_rays,) -- expected depth for each ray
    """
    # Sort by depth along each ray
    sorted_indices = torch.argsort(depths, dim=1)
    sorted_depths = torch.gather(depths, 1, sorted_indices)
    sorted_weights = torch.gather(weights, 1, sorted_indices)

    # Compute transmittance: T_k = prod_{j<k} (1 - w_j)
    # Use cumprod of (1 - w) shifted by one position
    one_minus_w = 1.0 - sorted_weights
    # Exclusive cumprod: T_k = prod_{j=0}^{k-1} (1 - w_j)
    transmittance = torch.cat([
        torch.ones_like(sorted_weights[:, :1]),
        torch.cumprod(one_minus_w[:, :-1], dim=1)
    ], dim=1)

    # Final weight = T_k * w_k
    final_weights = transmittance * sorted_weights

    # Rendered depth = weighted sum
    rendered_depth = torch.sum(final_weights * sorted_depths, dim=1)

    return rendered_depth

Intensity Modeling

Lidar sensors do not just measure range -- they also record the intensity of the returned pulse. Intensity depends on several physical factors:

  1. Material reflectance (rho): Different surfaces reflect different fractions of the incident laser energy. Retroreflective materials (lane markings, traffic signs) return very high intensity; dark surfaces (asphalt, black cars) return low intensity.

  2. Incidence angle (theta): A surface struck at a steep angle reflects less energy back to the sensor than one struck head-on. The reflected intensity falls off as cos(theta), where theta is the angle between the incoming ray and the surface normal.

  3. Range attenuation: The returned signal strength follows an inverse-square law with distance: I ~ 1/r^2. Most lidar sensors apply internal gain compensation, but residual range effects remain.

Combining these factors, the physical model for returned intensity is:

I = rho * cos(theta) / r^2 + noise

In our Gaussian representation, we model this by adding per-Gaussian attributes:

  • Reflectance (rho_k): A learned scalar in [0, 1] for each Gaussian, representing the surface reflectance at that location.
  • Surface normal (n_k): A unit vector indicating the local surface orientation. This can be derived from the Gaussian covariance (the shortest axis of the ellipsoid) or learned explicitly.

The incidence angle for ray direction d hitting Gaussian k with normal n_k is:

cos(theta_k) = |dot(d, n_k)|

The rendered intensity for a lidar ray is then alpha-composited analogously to depth:

intensity_rendered = sum_k (T_k * w_k * rho_k * cos(theta_k) / t_k^2)

Implementation:

class GaussianLidarAttributes(nn.Module):
    """Per-Gaussian attributes for lidar rendering."""

    def __init__(self, n_gaussians: int):
        super().__init__()
        # Reflectance: logit-space, sigmoid to [0, 1]
        self.reflectance_logit = nn.Parameter(torch.zeros(n_gaussians))
        # Surface normals: raw 3-vectors, normalized in forward pass
        self.normals_raw = nn.Parameter(torch.randn(n_gaussians, 3))

    @property
    def reflectance(self) -> torch.Tensor:
        return torch.sigmoid(self.reflectance_logit)

    @property
    def normals(self) -> torch.Tensor:
        return nn.functional.normalize(self.normals_raw, dim=-1)

    def compute_intensity(
        self,
        ray_directions: torch.Tensor,   # (N_rays, 3)
        gaussian_indices: torch.Tensor,  # indices of contributing Gaussians
        depths: torch.Tensor,            # (N_rays, N_contributing) depths
        weights: torch.Tensor,           # (N_rays, N_contributing) alpha weights
    ) -> torch.Tensor:
        """Compute rendered lidar intensity for each ray."""
        rho = self.reflectance[gaussian_indices]       # (N_contributing,)
        n = self.normals[gaussian_indices]              # (N_contributing, 3)

        # Incidence angle: |cos(theta)| = |dot(d, n)|
        d = ray_directions.unsqueeze(1)                # (N_rays, 1, 3)
        cos_theta = torch.abs(torch.sum(d * n.unsqueeze(0), dim=-1))  # (N_rays, N_contributing)

        # Range attenuation: 1/r^2, clamped for numerical stability
        range_atten = 1.0 / (depths**2 + 1e-4)

        # Per-Gaussian intensity contribution
        intensity_per_gaussian = rho.unsqueeze(0) * cos_theta * range_atten

        # Alpha-composite
        rendered_intensity = torch.sum(weights * intensity_per_gaussian, dim=1)

        return rendered_intensity

Intuition: Intensity modeling matters because downstream perception algorithms (especially those processing lidar data) use intensity as a feature for classification. Lane markings are bright, road surface is dark, vehicles have moderate reflectance with sharp gradients at edges. If your simulator produces flat, uniform intensity, the perception model will behave differently than on real data.

Raydrop Prediction

Not every lidar ray returns a measurement. In real sensors, rays can be "dropped" (no return) for several reasons:

  • Absorption: The surface absorbs most of the laser energy (e.g., dark, matte surfaces at long range).
  • Specular reflection: A mirror-like surface deflects the beam away from the sensor (e.g., wet roads, glass, polished metal).
  • Max range exceeded: The target is beyond the sensor's maximum detection range.
  • Beam divergence: At long range, the beam footprint is large and the returned energy is spread over a wide area.
  • Occlusion: An object blocks the ray before it reaches the expected target.

Raydrop is critical for realism. Real lidar data has characteristic "holes" -- missing points on dark vehicles, glass surfaces, and at range boundaries. A simulator that returns a point for every ray produces unnaturally dense, clean point clouds that perception models have never seen.

We model raydrop as a learned binary classification problem. For each ray, a small MLP predicts the probability of the ray returning a valid measurement, conditioned on features accumulated along the ray:

p_drop = sigmoid(MLP(f_ray))

where f_ray is a feature vector for the ray, constructed from:

  • Accumulated depth: depth_rendered
  • Accumulated intensity: intensity_rendered
  • Accumulated opacity: sum_k (T_k * w_k) -- total accumulated alpha along the ray
  • Ray geometry: elevation angle, range to nearest surface

During training, we supervise with binary cross-entropy against the ground-truth raydrop mask (1 = valid point, 0 = dropped).

Implementation:

class RaydropPredictor(nn.Module):
    """Predict which lidar rays return valid measurements."""

    def __init__(self, feature_dim: int = 8, hidden_dim: int = 64):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(
        self,
        rendered_depth: torch.Tensor,     # (N_rays,)
        rendered_intensity: torch.Tensor,  # (N_rays,)
        accumulated_alpha: torch.Tensor,   # (N_rays,)
        ray_elevation: torch.Tensor,       # (N_rays,)
    ) -> torch.Tensor:
        """
        Predict raydrop probability for each ray.

        Returns:
            p_valid: (N_rays,) -- probability that the ray returns a valid point.
                     Values near 1 = high confidence the point exists.
                     Values near 0 = likely a dropped ray.
        """
        features = torch.stack([
            rendered_depth,
            rendered_intensity,
            accumulated_alpha,
            ray_elevation,
            torch.log(rendered_depth + 1e-4),   # log depth (helps with scale)
            torch.sin(ray_elevation),            # trigonometric elevation features
            torch.cos(ray_elevation),
            rendered_depth * accumulated_alpha,  # interaction feature
        ], dim=-1)

        logits = self.mlp(features).squeeze(-1)
        return torch.sigmoid(logits)

Training detail: Raydrop masks are extracted from the ground-truth lidar data. In the Waymo Open Dataset, a ray is considered "dropped" if no point exists at the corresponding (azimuth, elevation) bin. The raydrop rate varies from 5-30% depending on the scene (more drops in scenes with lots of glass, vegetation, or long-range content).

Multi-Sensor Calibration

A key requirement for multi-sensor rendering is precise knowledge of how each sensor is positioned relative to the vehicle. This is captured by extrinsic calibration matrices -- 4x4 homogeneous transforms that convert points between coordinate frames.

Coordinate frames in play:

FrameOriginAxesUse
WorldArbitrary map originRight-handed, Z-upScene representation, Gaussian positions
EgoCenter of rear axleX-forward, Y-left, Z-upVehicle reference frame
CameraCamera optical centerX-right, Y-down, Z-forwardImage rendering
LidarLidar sensor centerX-forward, Y-left, Z-upPoint cloud generation

The transform chain for converting a world-frame Gaussian position to a camera pixel:

p_ego = T_ego_world @ p_world             # world -> ego
p_cam = T_cam_ego @ p_ego                 # ego -> camera
p_pixel = K @ p_cam[:3] / p_cam[2]        # camera -> pixel (pinhole projection)

For lidar rendering, the chain is:

p_ego = T_ego_world @ p_world             # world -> ego
p_lidar = T_lidar_ego @ p_ego             # ego -> lidar frame
# Then ray casting is performed in lidar frame

Implementation:

class SensorCalibration:
    """Stores and applies sensor extrinsic/intrinsic calibration."""

    def __init__(
        self,
        T_ego_sensor: torch.Tensor,    # (4, 4) transform from sensor to ego
        intrinsics: torch.Tensor = None,  # (3, 3) camera intrinsic matrix (cameras only)
        image_size: tuple[int, int] = None,  # (H, W) for cameras
    ):
        self.T_ego_sensor = T_ego_sensor
        self.T_sensor_ego = torch.inverse(T_ego_sensor)
        self.intrinsics = intrinsics
        self.image_size = image_size

    def world_to_sensor(
        self,
        points_world: torch.Tensor,     # (N, 3)
        T_ego_world: torch.Tensor,       # (4, 4) ego pose in world frame
    ) -> torch.Tensor:
        """Transform points from world frame to sensor frame."""
        T_sensor_world = self.T_sensor_ego @ torch.inverse(T_ego_world)

        # Homogeneous coordinates
        ones = torch.ones(*points_world.shape[:-1], 1, device=points_world.device)
        points_h = torch.cat([points_world, ones], dim=-1)  # (N, 4)

        points_sensor = (T_sensor_world @ points_h.T).T[:, :3]  # (N, 3)
        return points_sensor

    def project_to_image(
        self,
        points_cam: torch.Tensor,   # (N, 3) in camera frame
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Project 3D points in camera frame to 2D pixel coordinates.

        Returns:
            pixels: (N, 2) -- (u, v) pixel coordinates
            valid: (N,) -- boolean mask for points in front of camera and within image bounds
        """
        assert self.intrinsics is not None, "No intrinsics for this sensor"

        z = points_cam[:, 2]
        valid = z > 0.1  # Must be in front of camera

        pixels = (self.intrinsics @ points_cam.T).T  # (N, 3)
        pixels = pixels[:, :2] / (pixels[:, 2:3] + 1e-8)  # (N, 2)

        if self.image_size is not None:
            H, W = self.image_size
            in_bounds = (
                (pixels[:, 0] >= 0) & (pixels[:, 0] < W) &
                (pixels[:, 1] >= 0) & (pixels[:, 1] < H)
            )
            valid = valid & in_bounds

        return pixels, valid

    def generate_lidar_rays(
        self,
        elevation_angles: torch.Tensor,   # (H,) elevation angles in radians
        azimuth_angles: torch.Tensor,      # (W,) azimuth angles in radians
        T_ego_world: torch.Tensor,         # (4, 4) ego pose
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Generate lidar ray origins and directions in world frame.

        Returns:
            origins: (H*W, 3) -- ray origins in world frame
            directions: (H*W, 3) -- unit ray directions in world frame
        """
        # Build direction vectors in sensor frame
        el = elevation_angles.unsqueeze(1).expand(-1, len(azimuth_angles))  # (H, W)
        az = azimuth_angles.unsqueeze(0).expand(len(elevation_angles), -1)  # (H, W)

        dx = torch.cos(el) * torch.cos(az)
        dy = torch.cos(el) * torch.sin(az)
        dz = torch.sin(el)

        dirs_sensor = torch.stack([dx, dy, dz], dim=-1).reshape(-1, 3)  # (H*W, 3)

        # Transform to world frame
        T_sensor_world_inv = T_ego_world @ self.T_ego_sensor
        R = T_sensor_world_inv[:3, :3]
        t = T_sensor_world_inv[:3, 3]

        dirs_world = (R @ dirs_sensor.T).T   # (H*W, 3)
        origins_world = t.unsqueeze(0).expand(dirs_sensor.shape[0], -1)  # (H*W, 3)

        return origins_world, dirs_world

Calibration consistency is critical: If the camera-lidar extrinsics are even slightly wrong, projecting lidar points onto camera images will show misalignment. In real datasets (Waymo, nuScenes), calibration is provided; in your training pipeline, you should verify alignment before training by overlaying projected lidar depth onto camera images.

Joint Optimization

The power of a multi-sensor neural renderer comes from joint training: camera and lidar supervisions are complementary. Camera images provide dense color and texture information but have limited depth accuracy. Lidar provides precise depth and geometry but is sparse and has no color. By optimizing a shared scene representation against both modalities, each sensor helps regularize the other.

Multi-sensor loss function:

L_total = lambda_rgb * L_camera + lambda_depth * L_depth + lambda_intensity * L_intensity + lambda_drop * L_raydrop

where:

  • Camera loss (L_camera): A combination of L1 pixel loss and structural similarity:

    L_camera = (1 - lambda_ssim) * |I_rendered - I_gt|_1 + lambda_ssim * (1 - SSIM(I_rendered, I_gt))
    

    This is the standard loss from 3DGS training. Optionally add LPIPS for perceptual quality.

  • Lidar depth loss (L_depth): L1 or Huber loss on rendered vs. ground-truth depth, applied only to rays with valid returns:

    L_depth = (1 / N_valid) * sum_{valid rays} |depth_rendered - depth_gt|
    

    Using Huber loss instead of L1 is more robust to depth outliers at scene boundaries.

  • Lidar intensity loss (L_intensity): MSE between rendered and ground-truth intensity values:

    L_intensity = (1 / N_valid) * sum_{valid rays} (intensity_rendered - intensity_gt)^2
    
  • Raydrop loss (L_raydrop): Binary cross-entropy between predicted and actual raydrop:

    L_raydrop = -sum_k [y_k * log(p_k) + (1 - y_k) * log(1 - p_k)]
    

    where y_k = 1 for rays with valid returns and y_k = 0 for dropped rays.

Loss weighting strategy: The loss weights (lambda_*) control the relative importance of each modality. A common starting point:

LossWeightRationale
lambda_rgb1.0Anchor loss -- camera is the densest supervision
lambda_ssim0.2SSIM component within camera loss
lambda_depth0.1Lidar depth -- strong geometric constraint
lambda_intensity0.05Intensity -- secondary lidar signal
lambda_drop0.01Raydrop -- auxiliary classification task

These weights are tuned empirically. The key insight is that camera loss should dominate early in training (to establish scene structure), and lidar losses become more important later (to refine geometry and add sensor-specific realism).

Implementation:

class MultiSensorLoss(nn.Module):
    """Combined loss for joint camera + lidar training."""

    def __init__(
        self,
        lambda_rgb: float = 1.0,
        lambda_ssim: float = 0.2,
        lambda_depth: float = 0.1,
        lambda_intensity: float = 0.05,
        lambda_raydrop: float = 0.01,
    ):
        super().__init__()
        self.lambda_rgb = lambda_rgb
        self.lambda_ssim = lambda_ssim
        self.lambda_depth = lambda_depth
        self.lambda_intensity = lambda_intensity
        self.lambda_raydrop = lambda_raydrop

    def camera_loss(
        self,
        rendered: torch.Tensor,   # (B, 3, H, W)
        target: torch.Tensor,     # (B, 3, H, W)
    ) -> torch.Tensor:
        """Compute camera rendering loss (L1 + SSIM)."""
        l1 = torch.mean(torch.abs(rendered - target))
        ssim_val = self._ssim(rendered, target)
        return (1 - self.lambda_ssim) * l1 + self.lambda_ssim * (1 - ssim_val)

    def lidar_depth_loss(
        self,
        rendered_depth: torch.Tensor,   # (N_rays,)
        target_depth: torch.Tensor,     # (N_rays,)
        valid_mask: torch.Tensor,       # (N_rays,) bool
    ) -> torch.Tensor:
        """Huber loss on lidar depth for valid rays only."""
        if valid_mask.sum() == 0:
            return torch.tensor(0.0, device=rendered_depth.device)
        diff = rendered_depth[valid_mask] - target_depth[valid_mask]
        return nn.functional.huber_loss(
            rendered_depth[valid_mask],
            target_depth[valid_mask],
            delta=1.0,
        )

    def lidar_intensity_loss(
        self,
        rendered_intensity: torch.Tensor,  # (N_rays,)
        target_intensity: torch.Tensor,    # (N_rays,)
        valid_mask: torch.Tensor,          # (N_rays,)
    ) -> torch.Tensor:
        """MSE loss on lidar intensity for valid rays."""
        if valid_mask.sum() == 0:
            return torch.tensor(0.0, device=rendered_intensity.device)
        return nn.functional.mse_loss(
            rendered_intensity[valid_mask],
            target_intensity[valid_mask],
        )

    def raydrop_loss(
        self,
        predicted_prob: torch.Tensor,  # (N_rays,)
        target_valid: torch.Tensor,    # (N_rays,) float: 1.0 = valid, 0.0 = dropped
    ) -> torch.Tensor:
        """Binary cross-entropy loss for raydrop prediction."""
        return nn.functional.binary_cross_entropy(
            predicted_prob,
            target_valid,
        )

    def forward(
        self,
        camera_rendered: torch.Tensor,
        camera_target: torch.Tensor,
        lidar_depth_rendered: torch.Tensor,
        lidar_depth_target: torch.Tensor,
        lidar_intensity_rendered: torch.Tensor,
        lidar_intensity_target: torch.Tensor,
        raydrop_pred: torch.Tensor,
        raydrop_target: torch.Tensor,
        valid_mask: torch.Tensor,
    ) -> dict[str, torch.Tensor]:
        """Compute all losses and return individually + total."""
        l_cam = self.camera_loss(camera_rendered, camera_target)
        l_depth = self.lidar_depth_loss(lidar_depth_rendered, lidar_depth_target, valid_mask)
        l_intensity = self.lidar_intensity_loss(
            lidar_intensity_rendered, lidar_intensity_target, valid_mask
        )
        l_drop = self.raydrop_loss(raydrop_pred, raydrop_target)

        total = (
            self.lambda_rgb * l_cam +
            self.lambda_depth * l_depth +
            self.lambda_intensity * l_intensity +
            self.lambda_raydrop * l_drop
        )

        return {
            'total': total,
            'camera': l_cam,
            'depth': l_depth,
            'intensity': l_intensity,
            'raydrop': l_drop,
        }

    @staticmethod
    def _ssim(
        img1: torch.Tensor,
        img2: torch.Tensor,
        window_size: int = 11,
    ) -> torch.Tensor:
        """Compute SSIM between two image tensors (B, C, H, W)."""
        C1 = 0.01 ** 2
        C2 = 0.03 ** 2

        # Create Gaussian window
        sigma = 1.5
        coords = torch.arange(window_size, dtype=torch.float32, device=img1.device)
        coords -= window_size // 2
        g = torch.exp(-(coords ** 2) / (2 * sigma ** 2))
        window = g.unsqueeze(0) * g.unsqueeze(1)
        window = window / window.sum()
        window = window.unsqueeze(0).unsqueeze(0)

        C = img1.shape[1]
        window = window.expand(C, 1, -1, -1)

        mu1 = nn.functional.conv2d(img1, window, padding=window_size//2, groups=C)
        mu2 = nn.functional.conv2d(img2, window, padding=window_size//2, groups=C)

        mu1_sq = mu1 ** 2
        mu2_sq = mu2 ** 2
        mu12 = mu1 * mu2

        sigma1_sq = nn.functional.conv2d(img1 * img1, window, padding=window_size//2, groups=C) - mu1_sq
        sigma2_sq = nn.functional.conv2d(img2 * img2, window, padding=window_size//2, groups=C) - mu2_sq
        sigma12 = nn.functional.conv2d(img1 * img2, window, padding=window_size//2, groups=C) - mu12

        ssim_map = ((2 * mu12 + C1) * (2 * sigma12 + C2)) / \
                   ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))

        return ssim_map.mean()

Step-by-Step Implementation Guide

Step 1: Environment Setup (30 min)

Goal: Set up the development environment with all required packages and verify GPU availability.

Create and activate a virtual environment:

mkdir -p multi-sensor-neural-renderer && cd multi-sensor-neural-renderer
python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install torch torchvision              # Core framework
pip install numpy matplotlib               # Numerics and visualization
pip install open3d                         # 3D point cloud visualization
pip install scipy                          # Linear algebra utilities
pip install opencv-python                  # Image I/O and processing
pip install tqdm                           # Progress bars
pip install scikit-image                   # Image quality metrics (SSIM, PSNR)
pip install lpips                          # Perceptual similarity metric
pip install plyfile                        # PLY file I/O for point clouds
pip install ipykernel jupyter              # Notebook support

Or install from requirements.txt:

torch>=2.0.0
torchvision>=0.15.0
numpy>=1.24.0
matplotlib>=3.7.0
open3d>=0.17.0
scipy>=1.10.0
opencv-python>=4.8.0
tqdm>=4.65.0
scikit-image>=0.21.0
lpips>=0.1.4
plyfile>=1.0.0
ipykernel>=6.0.0
jupyter>=1.0.0

Verify the environment:

# verify_setup.py
import torch
import numpy as np
import open3d as o3d
import matplotlib.pyplot as plt
from scipy.spatial.transform import Rotation

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# Test basic 3D operations
means = torch.randn(100, 3)
print(f"\nCreated {means.shape[0]} random 3D Gaussians")
print(f"Bounding box: [{means.min(0).values.numpy()}] to [{means.max(0).values.numpy()}]")

# Test Open3D
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(means.numpy())
print(f"Open3D point cloud: {len(pcd.points)} points")

# Test rotation
R = Rotation.from_euler('xyz', [10, 20, 30], degrees=True).as_matrix()
print(f"\nRotation matrix shape: {R.shape}")
print(f"Determinant: {np.linalg.det(R):.6f} (should be 1.0)")

print("\n=== Setup verified successfully! ===")

Prepare synthetic test data:

For initial development and testing, we create a synthetic Gaussian scene rather than loading real driving data. This lets you validate the rendering math before dealing with dataset complexities.

# create_synthetic_scene.py
import torch
import numpy as np
import json

def create_synthetic_gaussian_scene(
    n_gaussians: int = 500,
    scene_extent: float = 20.0,
    save_path: str = "data/synthetic_scene.pt",
):
    """
    Create a synthetic scene with Gaussians arranged to resemble
    a simplified driving environment (ground plane + objects).
    """
    means = []
    scales = []
    rotations = []
    opacities = []
    sh_coeffs = []
    reflectances = []
    normals = []

    # Ground plane: flat Gaussians spread across XY plane
    n_ground = n_gaussians // 2
    ground_xy = torch.rand(n_ground, 2) * scene_extent - scene_extent / 2
    ground_z = torch.zeros(n_ground, 1) + torch.randn(n_ground, 1) * 0.02
    ground_means = torch.cat([ground_xy, ground_z], dim=1)
    ground_scales = torch.tensor([0.5, 0.5, 0.01]).unsqueeze(0).expand(n_ground, -1)
    ground_normals = torch.tensor([0.0, 0.0, 1.0]).unsqueeze(0).expand(n_ground, -1)
    ground_reflectances = torch.full((n_ground,), 0.3)  # Asphalt-like

    means.append(ground_means)
    scales.append(ground_scales)
    normals.append(ground_normals)
    reflectances.append(ground_reflectances)

    # Box objects: cubes of Gaussians representing vehicles
    n_objects = 5
    n_per_object = (n_gaussians - n_ground) // n_objects

    for i in range(n_objects):
        center = torch.tensor([
            (torch.rand(1).item() - 0.5) * scene_extent * 0.6,
            (torch.rand(1).item() - 0.5) * scene_extent * 0.6,
            0.8 + torch.rand(1).item() * 0.5,
        ])
        obj_means = center.unsqueeze(0) + torch.randn(n_per_object, 3) * torch.tensor([1.0, 0.5, 0.4])
        obj_scales = torch.rand(n_per_object, 3) * 0.2 + 0.05

        # Normals point outward from center
        obj_normals = obj_means - center.unsqueeze(0)
        obj_normals = obj_normals / (obj_normals.norm(dim=-1, keepdim=True) + 1e-8)

        obj_reflectances = torch.full((n_per_object,), 0.5 + torch.rand(1).item() * 0.3)

        means.append(obj_means)
        scales.append(obj_scales)
        normals.append(obj_normals)
        reflectances.append(obj_reflectances)

    scene = {
        'means': torch.cat(means, dim=0),
        'scales': torch.cat(scales, dim=0),
        'normals': torch.cat(normals, dim=0),
        'reflectances': torch.cat(reflectances, dim=0),
        'opacities': torch.ones(n_gaussians) * 0.9,
        'sh_coeffs': torch.randn(n_gaussians, 3, 16),  # SH degree 3
    }

    import os
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    torch.save(scene, save_path)
    print(f"Saved synthetic scene with {n_gaussians} Gaussians to {save_path}")
    return scene

if __name__ == "__main__":
    create_synthetic_gaussian_scene()

Step 2: Lidar Ray Casting Through Gaussians (Notebook 01, ~3 hours)

Goal: Implement a complete lidar rendering pipeline that casts rays through a Gaussian scene and produces depth, intensity, and raydrop outputs.

Lidar beam pattern: A typical spinning lidar has a fixed set of elevation angles (non-uniformly spaced, denser near the horizon) and sweeps 360 degrees in azimuth:

def create_lidar_beam_pattern(
    n_beams: int = 64,
    n_azimuth: int = 2048,
    fov_up: float = 3.0,      # degrees above horizon
    fov_down: float = -25.0,   # degrees below horizon
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Create a realistic lidar beam pattern.

    Returns:
        elevation_angles: (n_beams,) in radians
        azimuth_angles: (n_azimuth,) in radians
    """
    # Non-uniform elevation spacing: denser near horizon
    # This mimics real Velodyne/Ouster beam patterns
    elevation_deg = torch.linspace(fov_down, fov_up, n_beams)

    # Apply non-linear spacing: compress near horizon
    center = (fov_up + fov_down) / 2
    elevation_deg = center + (elevation_deg - center) * torch.abs(
        (elevation_deg - center) / max(abs(fov_up - center), abs(fov_down - center))
    ) ** 0.7 * torch.sign(elevation_deg - center) * max(abs(fov_up - center), abs(fov_down - center))

    elevation_rad = torch.deg2rad(elevation_deg)
    azimuth_rad = torch.linspace(0, 2 * np.pi, n_azimuth + 1)[:-1]  # Exclude 2*pi (same as 0)

    return elevation_rad, azimuth_rad

Full lidar rendering function:

def render_lidar(
    ray_origins: torch.Tensor,       # (N_rays, 3)
    ray_directions: torch.Tensor,    # (N_rays, 3)
    means: torch.Tensor,             # (M, 3)
    covariances_inv: torch.Tensor,   # (M, 3, 3)
    opacities: torch.Tensor,         # (M,)
    reflectances: torch.Tensor,      # (M,)
    normals: torch.Tensor,           # (M, 3)
    max_range: float = 75.0,
    n_nearest: int = 64,
) -> dict[str, torch.Tensor]:
    """
    Render lidar depth, intensity, and accumulated alpha for all rays.

    For efficiency, only consider the N nearest Gaussians to each ray.

    Returns:
        dict with keys:
            'depth': (N_rays,) rendered depth
            'intensity': (N_rays,) rendered intensity
            'alpha': (N_rays,) accumulated opacity
    """
    N = ray_origins.shape[0]
    M = means.shape[0]

    # Step 1: Find nearest Gaussians to each ray (approximate with distance to ray origin)
    # In production, use spatial data structures; here we use brute force on the subset
    dists = torch.cdist(ray_origins[:1], means)  # (1, M) -- same origin for all rays in a sweep
    _, nearest_idx = torch.topk(dists[0], k=min(n_nearest, M), largest=False)

    # Subset scene
    sub_means = means[nearest_idx]                    # (K, 3)
    sub_cov_inv = covariances_inv[nearest_idx]        # (K, 3, 3)
    sub_opacities = opacities[nearest_idx]            # (K,)
    sub_reflectances = reflectances[nearest_idx]      # (K,)
    sub_normals = normals[nearest_idx]                # (K, 3)

    K = sub_means.shape[0]

    # Step 2: Compute ray-Gaussian intersection depths
    delta = sub_means.unsqueeze(0) - ray_origins.unsqueeze(1)  # (N, K, 3)
    d = ray_directions.unsqueeze(1).expand(-1, K, -1)          # (N, K, 3)

    # d^T Sigma^{-1}: (N, K, 3)
    d_Sinv = torch.einsum('kij,nkj->nki', sub_cov_inv, d)

    # t_peak = (d^T Sigma^{-1} delta) / (d^T Sigma^{-1} d)
    numerator = torch.sum(d_Sinv * delta, dim=-1)      # (N, K)
    denominator = torch.sum(d_Sinv * d, dim=-1)        # (N, K)
    t_peak = numerator / (denominator + 1e-8)           # (N, K)

    # Ray-space variance
    sigma_ray_sq = 1.0 / (denominator + 1e-8)          # (N, K)

    # Gaussian density at peak
    power = sub_opacities.unsqueeze(0) * torch.exp(
        -0.5 * torch.zeros_like(t_peak)  # At peak, exponent is 0
    )

    # Mask: only count Gaussians in front of ray and within max range
    valid_mask = (t_peak > 0.1) & (t_peak < max_range)
    power = power * valid_mask.float()

    # Step 3: Alpha compositing (sort by depth, accumulate front-to-back)
    # Set invalid depths to large value so they sort last
    t_for_sort = t_peak.clone()
    t_for_sort[~valid_mask] = max_range + 1.0

    sorted_indices = torch.argsort(t_for_sort, dim=1)
    sorted_t = torch.gather(t_for_sort, 1, sorted_indices)
    sorted_power = torch.gather(power, 1, sorted_indices)

    # Transmittance
    one_minus_alpha = 1.0 - sorted_power
    transmittance = torch.cat([
        torch.ones(N, 1, device=ray_origins.device),
        torch.cumprod(one_minus_alpha[:, :-1], dim=1),
    ], dim=1)

    final_weights = transmittance * sorted_power  # (N, K)

    # Rendered depth
    rendered_depth = torch.sum(final_weights * sorted_t, dim=1)  # (N,)

    # Step 4: Intensity (need to gather reflectances and normals in sorted order)
    sorted_refl_idx = torch.gather(
        nearest_idx.unsqueeze(0).expand(N, -1), 1, sorted_indices
    )
    sorted_reflectances = sub_reflectances[
        torch.gather(torch.arange(K, device=ray_origins.device).unsqueeze(0).expand(N, -1), 1, sorted_indices)
    ]
    sorted_normals = sub_normals[
        torch.gather(torch.arange(K, device=ray_origins.device).unsqueeze(0).expand(N, -1), 1, sorted_indices)
    ]  # (N, K, 3)

    # Incidence angle
    cos_theta = torch.abs(torch.sum(
        ray_directions.unsqueeze(1) * sorted_normals, dim=-1
    ))  # (N, K)

    # Range attenuation
    range_atten = 1.0 / (sorted_t ** 2 + 1.0)

    intensity_per_gaussian = sorted_reflectances * cos_theta * range_atten
    rendered_intensity = torch.sum(final_weights * intensity_per_gaussian, dim=1)

    # Accumulated alpha (total opacity along ray)
    accumulated_alpha = torch.sum(final_weights, dim=1)

    return {
        'depth': rendered_depth,
        'intensity': rendered_intensity,
        'alpha': accumulated_alpha,
    }

Visualizing the rendered point cloud:

def visualize_lidar_rendering(
    depth: torch.Tensor,              # (H*W,) or (H, W)
    intensity: torch.Tensor,          # same shape
    elevation_angles: torch.Tensor,   # (H,)
    azimuth_angles: torch.Tensor,     # (W,)
    max_range: float = 75.0,
):
    """Convert rendered depth/intensity to 3D point cloud and visualize."""
    H = len(elevation_angles)
    W = len(azimuth_angles)

    depth_2d = depth.reshape(H, W)
    intensity_2d = intensity.reshape(H, W)

    # Build direction vectors
    el = elevation_angles.unsqueeze(1).expand(H, W)
    az = azimuth_angles.unsqueeze(0).expand(H, W)

    x = depth_2d * torch.cos(el) * torch.cos(az)
    y = depth_2d * torch.cos(el) * torch.sin(az)
    z = depth_2d * torch.sin(el)

    points = torch.stack([x, y, z], dim=-1).reshape(-1, 3)
    intensities = intensity_2d.reshape(-1)

    # Filter by range
    valid = depth.reshape(-1) < max_range * 0.95
    points = points[valid].numpy()
    intensities = intensities[valid].numpy()

    # Create Open3D point cloud
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points)

    # Color by intensity
    colors = plt.cm.viridis(intensities / (intensities.max() + 1e-8))[:, :3]
    pcd.colors = o3d.utility.Vector3dVector(colors)

    # Also create matplotlib visualization
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Range image
    axes[0].imshow(depth_2d.numpy(), cmap='turbo', vmin=0, vmax=max_range)
    axes[0].set_title('Range Image (depth)')
    axes[0].set_xlabel('Azimuth')
    axes[0].set_ylabel('Elevation')
    plt.colorbar(axes[0].images[0], ax=axes[0], label='Depth (m)')

    # Intensity image
    axes[1].imshow(intensity_2d.numpy(), cmap='gray')
    axes[1].set_title('Intensity Image')
    axes[1].set_xlabel('Azimuth')
    axes[1].set_ylabel('Elevation')
    plt.colorbar(axes[1].images[0], ax=axes[1], label='Intensity')

    # Bird's eye view
    bev_valid = points[:, 2] > -2.0  # Filter underground points
    axes[2].scatter(
        points[bev_valid, 0], points[bev_valid, 1],
        c=intensities[bev_valid], cmap='viridis', s=0.1, alpha=0.5
    )
    axes[2].set_title("Bird's Eye View")
    axes[2].set_xlabel('X (m)')
    axes[2].set_ylabel('Y (m)')
    axes[2].set_aspect('equal')
    axes[2].set_xlim(-40, 40)
    axes[2].set_ylim(-40, 40)

    plt.tight_layout()
    plt.show()

    return pcd

Exercise 1: Modify the beam pattern to simulate a Velodyne VLP-32C (32 beams, non-uniform spacing concentrated near the horizon, 360-degree azimuth with 0.2-degree resolution). How does reducing beam count from 64 to 32 affect the rendered point cloud density?

Exercise 2: Add Gaussian noise to the rendered depth values (sigma = 0.02m) and random dropout (2% of points). How does this compare visually to the clean rendering?

Step 3: Multi-Sensor Joint Rendering (Notebook 02, ~4 hours)

Goal: Build a unified scene representation that renders both camera RGB images and lidar point clouds, with proper coordinate frame handling.

Unified Gaussian scene representation:

class MultiSensorGaussianScene(nn.Module):
    """
    A 3D Gaussian Splatting scene with attributes for both
    camera (color) and lidar (reflectance, normal) rendering.
    """

    def __init__(self, n_gaussians: int):
        super().__init__()
        self.n_gaussians = n_gaussians

        # Shared geometry
        self.means = nn.Parameter(torch.randn(n_gaussians, 3))
        self.scales_raw = nn.Parameter(torch.zeros(n_gaussians, 3))  # log-scale
        self.rotations = nn.Parameter(torch.zeros(n_gaussians, 4))   # quaternions
        self.rotations.data[:, 0] = 1.0  # Initialize as identity
        self.opacities_raw = nn.Parameter(torch.zeros(n_gaussians))  # logit-space

        # Camera-specific: spherical harmonics for view-dependent color
        self.sh_coeffs = nn.Parameter(torch.zeros(n_gaussians, 3, 16))  # SH degree 3

        # Lidar-specific
        self.reflectances_raw = nn.Parameter(torch.zeros(n_gaussians))  # logit-space
        self.normals_raw = nn.Parameter(torch.randn(n_gaussians, 3))

        # Raydrop predictor
        self.raydrop_net = RaydropPredictor(feature_dim=8, hidden_dim=64)

    @property
    def scales(self) -> torch.Tensor:
        return torch.exp(self.scales_raw)

    @property
    def opacities(self) -> torch.Tensor:
        return torch.sigmoid(self.opacities_raw)

    @property
    def reflectances(self) -> torch.Tensor:
        return torch.sigmoid(self.reflectances_raw)

    @property
    def normals(self) -> torch.Tensor:
        return nn.functional.normalize(self.normals_raw, dim=-1)

    def get_covariance_matrices(self) -> torch.Tensor:
        """
        Build 3x3 covariance matrices from quaternion rotations and scales.

        Sigma = R @ S @ S^T @ R^T

        Returns:
            covariances: (N, 3, 3)
        """
        R = quaternion_to_rotation_matrix(self.rotations)  # (N, 3, 3)
        S = torch.diag_embed(self.scales)                  # (N, 3, 3)
        return R @ S @ S.transpose(-1, -2) @ R.transpose(-1, -2)

    def get_covariance_inv(self) -> torch.Tensor:
        """Inverse covariance matrices."""
        R = quaternion_to_rotation_matrix(self.rotations)
        S_inv = torch.diag_embed(1.0 / (self.scales + 1e-8))
        return R @ S_inv @ S_inv.transpose(-1, -2) @ R.transpose(-1, -2)

    def render_camera(
        self,
        camera_calib: 'SensorCalibration',
        ego_pose: torch.Tensor,  # (4, 4)
    ) -> torch.Tensor:
        """
        Render an RGB image using standard 3DGS rasterization.

        This is a simplified version. In production, use the cuda-based
        differentiable rasterizer from the original 3DGS codebase.

        Returns:
            image: (3, H, W) rendered RGB image
        """
        H, W = camera_calib.image_size

        # Transform Gaussians to camera frame
        points_cam = camera_calib.world_to_sensor(self.means, ego_pose)

        # Project to pixels
        pixels, valid = camera_calib.project_to_image(points_cam)

        # For each valid Gaussian, compute its 2D projected covariance
        # and splat its color onto the image
        # (Simplified: use point splatting instead of full Gaussian splatting)

        image = torch.zeros(3, H, W, device=self.means.device)
        depth_buffer = torch.full((H, W), float('inf'), device=self.means.device)

        valid_indices = torch.where(valid)[0]
        if len(valid_indices) == 0:
            return image

        # Sort by depth for proper compositing
        depths = points_cam[valid_indices, 2]
        sort_idx = torch.argsort(depths)
        valid_indices = valid_indices[sort_idx]

        for idx in valid_indices:
            u, v = int(pixels[idx, 0].item()), int(pixels[idx, 1].item())
            if 0 <= u < W and 0 <= v < H:
                # Evaluate SH at viewing direction for color
                color = self._eval_sh(idx, -points_cam[idx] / (points_cam[idx].norm() + 1e-8))
                alpha = self.opacities[idx]

                # Simple alpha blending (not full Gaussian splatting)
                image[:, v, u] = image[:, v, u] * (1 - alpha) + color * alpha

        return image

    def render_lidar(
        self,
        lidar_calib: 'SensorCalibration',
        ego_pose: torch.Tensor,
        elevation_angles: torch.Tensor,
        azimuth_angles: torch.Tensor,
        max_range: float = 75.0,
    ) -> dict[str, torch.Tensor]:
        """
        Render lidar depth, intensity, and raydrop prediction.

        Returns:
            dict with 'depth', 'intensity', 'raydrop_prob', 'alpha'
        """
        # Generate rays in world frame
        origins, directions = lidar_calib.generate_lidar_rays(
            elevation_angles, azimuth_angles, ego_pose
        )

        # Cast rays through Gaussian scene
        result = render_lidar(
            ray_origins=origins,
            ray_directions=directions,
            means=self.means,
            covariances_inv=self.get_covariance_inv(),
            opacities=self.opacities,
            reflectances=self.reflectances,
            normals=self.normals,
            max_range=max_range,
        )

        # Predict raydrop
        H = len(elevation_angles)
        W = len(azimuth_angles)
        ray_el = elevation_angles.unsqueeze(1).expand(H, W).reshape(-1)

        raydrop_prob = self.raydrop_net(
            rendered_depth=result['depth'],
            rendered_intensity=result['intensity'],
            accumulated_alpha=result['alpha'],
            ray_elevation=ray_el,
        )

        result['raydrop_prob'] = raydrop_prob
        return result

    def _eval_sh(self, idx: int, view_dir: torch.Tensor) -> torch.Tensor:
        """Evaluate spherical harmonics for Gaussian idx at view direction."""
        # Simplified: return first SH coefficient (DC term = base color)
        return torch.sigmoid(self.sh_coeffs[idx, :, 0])


def quaternion_to_rotation_matrix(q: torch.Tensor) -> torch.Tensor:
    """
    Convert quaternions to rotation matrices.

    Args:
        q: (N, 4) quaternions [w, x, y, z]

    Returns:
        R: (N, 3, 3) rotation matrices
    """
    q = nn.functional.normalize(q, dim=-1)
    w, x, y, z = q[:, 0], q[:, 1], q[:, 2], q[:, 3]

    R = torch.stack([
        1 - 2*y*y - 2*z*z,  2*x*y - 2*w*z,      2*x*z + 2*w*y,
        2*x*y + 2*w*z,      1 - 2*x*x - 2*z*z,  2*y*z - 2*w*x,
        2*x*z - 2*w*y,      2*y*z + 2*w*x,      1 - 2*x*x - 2*y*y,
    ], dim=-1).reshape(-1, 3, 3)

    return R

Side-by-side rendering demonstration:

def demonstrate_joint_rendering(scene, camera_calib, lidar_calib, ego_pose):
    """Render both camera and lidar from the same scene and display side by side."""

    # Camera rendering
    camera_image = scene.render_camera(camera_calib, ego_pose)

    # Lidar rendering
    elevation_angles, azimuth_angles = create_lidar_beam_pattern(
        n_beams=64, n_azimuth=1024
    )
    lidar_result = scene.render_lidar(
        lidar_calib, ego_pose, elevation_angles, azimuth_angles
    )

    # Visualize
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))

    # Camera image
    cam_img = camera_image.detach().permute(1, 2, 0).clamp(0, 1).numpy()
    axes[0, 0].imshow(cam_img)
    axes[0, 0].set_title('Camera RGB Rendering')
    axes[0, 0].axis('off')

    # Lidar range image
    H, W = len(elevation_angles), len(azimuth_angles)
    depth_img = lidar_result['depth'].detach().reshape(H, W).numpy()
    im1 = axes[0, 1].imshow(depth_img, cmap='turbo', vmin=0, vmax=75)
    axes[0, 1].set_title('Lidar Range Image')
    plt.colorbar(im1, ax=axes[0, 1])

    # Lidar intensity image
    int_img = lidar_result['intensity'].detach().reshape(H, W).numpy()
    im2 = axes[1, 0].imshow(int_img, cmap='gray')
    axes[1, 0].set_title('Lidar Intensity Image')
    plt.colorbar(im2, ax=axes[1, 0])

    # Raydrop probability
    drop_img = lidar_result['raydrop_prob'].detach().reshape(H, W).numpy()
    im3 = axes[1, 1].imshow(drop_img, cmap='RdYlGn', vmin=0, vmax=1)
    axes[1, 1].set_title('Raydrop Probability (green=valid)')
    plt.colorbar(im3, ax=axes[1, 1])

    plt.suptitle('Multi-Sensor Neural Rendering -- Joint Output', fontsize=14)
    plt.tight_layout()
    plt.show()

Projecting lidar onto camera for consistency check:

def project_lidar_onto_camera(
    lidar_depth: torch.Tensor,
    elevation_angles: torch.Tensor,
    azimuth_angles: torch.Tensor,
    lidar_calib: 'SensorCalibration',
    camera_calib: 'SensorCalibration',
    ego_pose: torch.Tensor,
    camera_image: torch.Tensor,  # (3, H, W)
):
    """
    Project lidar points onto the camera image to verify calibration consistency.
    Overlays colored depth points on the camera rendering.
    """
    H_l = len(elevation_angles)
    W_l = len(azimuth_angles)

    # Reconstruct 3D points from lidar depth
    el = elevation_angles.unsqueeze(1).expand(H_l, W_l)
    az = azimuth_angles.unsqueeze(0).expand(H_l, W_l)
    depth_2d = lidar_depth.reshape(H_l, W_l)

    x = depth_2d * torch.cos(el) * torch.cos(az)
    y = depth_2d * torch.cos(el) * torch.sin(az)
    z = depth_2d * torch.sin(el)

    points_lidar = torch.stack([x, y, z], dim=-1).reshape(-1, 3)
    depths = lidar_depth.reshape(-1)

    # Transform: lidar frame -> ego frame -> world frame -> camera frame
    T_ego_lidar = lidar_calib.T_ego_sensor
    T_cam_ego = camera_calib.T_sensor_ego

    ones = torch.ones(points_lidar.shape[0], 1)
    p_h = torch.cat([points_lidar, ones], dim=1)

    # lidar -> ego -> camera
    p_ego = (T_ego_lidar @ p_h.T).T[:, :3]
    p_cam = (T_cam_ego @ torch.cat([p_ego, ones], dim=1).T).T[:, :3]

    # Project to image
    pixels, valid = camera_calib.project_to_image(p_cam)

    # Overlay on camera image
    fig, ax = plt.subplots(1, 1, figsize=(12, 8))
    cam_img = camera_image.detach().permute(1, 2, 0).clamp(0, 1).numpy()
    ax.imshow(cam_img)

    valid_pixels = pixels[valid].detach().numpy()
    valid_depths = depths[valid].detach().numpy()

    scatter = ax.scatter(
        valid_pixels[:, 0], valid_pixels[:, 1],
        c=valid_depths, cmap='turbo', s=1, alpha=0.7,
        vmin=0, vmax=75,
    )
    plt.colorbar(scatter, ax=ax, label='Depth (m)')
    ax.set_title('Lidar Points Projected onto Camera Image')
    ax.axis('off')
    plt.tight_layout()
    plt.show()

Exercise 3: Create a scene with a retroreflective traffic sign (reflectance = 0.95) and a matte black vehicle (reflectance = 0.15). Render lidar from the same viewpoint and verify that the intensity values differ by the expected ratio. Overlay both the camera image and lidar points to confirm geometric alignment.

Exercise 4: Rotate the ego vehicle by 45 degrees and re-render both sensors. Verify that the same scene objects appear in consistent positions across both modalities.

Step 4: Joint Training and Evaluation (Notebook 03, ~4 hours)

Goal: Train the multi-sensor Gaussian scene from paired camera images and lidar sweeps, evaluate quality with modality-specific metrics, and demonstrate novel view/sweep synthesis.

Training loop:

def train_multi_sensor(
    scene: MultiSensorGaussianScene,
    camera_data: list[dict],     # [{'image': Tensor, 'calib': SensorCalibration, 'ego_pose': Tensor}, ...]
    lidar_data: list[dict],      # [{'depth': Tensor, 'intensity': Tensor, 'valid': Tensor, 'calib': ..., 'ego_pose': ...}, ...]
    n_iterations: int = 5000,
    lr: float = 1e-3,
    lr_means: float = 1e-4,
) -> dict:
    """
    Joint training loop for multi-sensor Gaussian scene.

    Returns:
        Training history dict with per-iteration losses.
    """
    # Separate learning rates for different parameter groups
    optimizer = torch.optim.Adam([
        {'params': [scene.means], 'lr': lr_means},
        {'params': [scene.scales_raw, scene.rotations, scene.opacities_raw], 'lr': lr},
        {'params': [scene.sh_coeffs], 'lr': lr},
        {'params': [scene.reflectances_raw, scene.normals_raw], 'lr': lr * 0.5},
        {'params': scene.raydrop_net.parameters(), 'lr': lr * 0.1},
    ])

    loss_fn = MultiSensorLoss(
        lambda_rgb=1.0,
        lambda_depth=0.1,
        lambda_intensity=0.05,
        lambda_raydrop=0.01,
    )

    # Learning rate schedule: warm up lidar losses
    def get_lidar_weight_multiplier(iteration: int) -> float:
        """Ramp up lidar loss contribution over first 1000 iterations."""
        return min(1.0, iteration / 1000.0)

    elevation_angles, azimuth_angles = create_lidar_beam_pattern(n_beams=32, n_azimuth=512)

    history = {'total': [], 'camera': [], 'depth': [], 'intensity': [], 'raydrop': []}

    for it in tqdm(range(n_iterations), desc="Training"):
        optimizer.zero_grad()

        # Sample a random training frame
        cam_idx = np.random.randint(len(camera_data))
        lidar_idx = np.random.randint(len(lidar_data))

        cam = camera_data[cam_idx]
        lid = lidar_data[lidar_idx]

        # Render camera
        rendered_image = scene.render_camera(cam['calib'], cam['ego_pose'])

        # Render lidar
        lidar_result = scene.render_lidar(
            lid['calib'], lid['ego_pose'],
            elevation_angles, azimuth_angles,
        )

        # Compute losses
        lidar_mult = get_lidar_weight_multiplier(it)

        losses = loss_fn(
            camera_rendered=rendered_image.unsqueeze(0),
            camera_target=cam['image'].unsqueeze(0),
            lidar_depth_rendered=lidar_result['depth'],
            lidar_depth_target=lid['depth'],
            lidar_intensity_rendered=lidar_result['intensity'],
            lidar_intensity_target=lid['intensity'],
            raydrop_pred=lidar_result['raydrop_prob'],
            raydrop_target=lid['valid'].float(),
            valid_mask=lid['valid'],
        )

        # Apply lidar weight ramp
        total = (
            losses['camera'] +
            lidar_mult * loss_fn.lambda_depth * losses['depth'] +
            lidar_mult * loss_fn.lambda_intensity * losses['intensity'] +
            lidar_mult * loss_fn.lambda_raydrop * losses['raydrop']
        )

        total.backward()

        # Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(scene.parameters(), max_norm=1.0)

        optimizer.step()

        # Log
        for key in history:
            history[key].append(losses[key].item() if key != 'total' else total.item())

        if (it + 1) % 500 == 0:
            print(f"[{it+1:>5d}] total={total.item():.4f}  "
                  f"cam={losses['camera'].item():.4f}  "
                  f"depth={losses['depth'].item():.4f}  "
                  f"intensity={losses['intensity'].item():.4f}  "
                  f"raydrop={losses['raydrop'].item():.4f}")

    return history

Evaluation metrics:

def evaluate_camera_metrics(
    scene: MultiSensorGaussianScene,
    test_frames: list[dict],
) -> dict:
    """
    Evaluate camera rendering quality on held-out test views.

    Returns:
        dict with mean PSNR, SSIM, LPIPS across test set.
    """
    import lpips as lpips_module
    from skimage.metrics import peak_signal_noise_ratio, structural_similarity

    lpips_fn = lpips_module.LPIPS(net='vgg')

    psnrs, ssims, lpips_vals = [], [], []

    for frame in test_frames:
        with torch.no_grad():
            rendered = scene.render_camera(frame['calib'], frame['ego_pose'])

        # Convert to numpy HWC for skimage metrics
        rendered_np = rendered.detach().permute(1, 2, 0).clamp(0, 1).numpy()
        target_np = frame['image'].permute(1, 2, 0).numpy()

        psnr = peak_signal_noise_ratio(target_np, rendered_np, data_range=1.0)
        ssim = structural_similarity(target_np, rendered_np, channel_axis=2, data_range=1.0)

        # LPIPS
        r_tensor = rendered.unsqueeze(0) * 2 - 1  # [-1, 1]
        t_tensor = frame['image'].unsqueeze(0) * 2 - 1
        with torch.no_grad():
            lpips_val = lpips_fn(r_tensor, t_tensor).item()

        psnrs.append(psnr)
        ssims.append(ssim)
        lpips_vals.append(lpips_val)

    return {
        'psnr_mean': np.mean(psnrs),
        'psnr_std': np.std(psnrs),
        'ssim_mean': np.mean(ssims),
        'ssim_std': np.std(ssims),
        'lpips_mean': np.mean(lpips_vals),
        'lpips_std': np.std(lpips_vals),
    }


def evaluate_lidar_metrics(
    scene: MultiSensorGaussianScene,
    test_frames: list[dict],
    elevation_angles: torch.Tensor,
    azimuth_angles: torch.Tensor,
) -> dict:
    """
    Evaluate lidar rendering quality on held-out test sweeps.

    Returns:
        dict with depth MAE, intensity MAE, raydrop F1 score.
    """
    depth_errors = []
    intensity_errors = []
    raydrop_tp, raydrop_fp, raydrop_fn = 0, 0, 0

    for frame in test_frames:
        with torch.no_grad():
            result = scene.render_lidar(
                frame['calib'], frame['ego_pose'],
                elevation_angles, azimuth_angles,
            )

        valid = frame['valid']

        # Depth MAE (only on valid rays)
        if valid.sum() > 0:
            depth_mae = torch.abs(
                result['depth'][valid] - frame['depth'][valid]
            ).mean().item()
            depth_errors.append(depth_mae)

            # Intensity MAE
            int_mae = torch.abs(
                result['intensity'][valid] - frame['intensity'][valid]
            ).mean().item()
            intensity_errors.append(int_mae)

        # Raydrop classification metrics
        pred_valid = result['raydrop_prob'] > 0.5
        gt_valid = frame['valid']

        raydrop_tp += (pred_valid & gt_valid).sum().item()
        raydrop_fp += (pred_valid & ~gt_valid).sum().item()
        raydrop_fn += (~pred_valid & gt_valid).sum().item()

    # F1 score
    precision = raydrop_tp / (raydrop_tp + raydrop_fp + 1e-8)
    recall = raydrop_tp / (raydrop_tp + raydrop_fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)

    return {
        'depth_mae_mean': np.mean(depth_errors),
        'depth_mae_std': np.std(depth_errors),
        'intensity_mae_mean': np.mean(intensity_errors),
        'intensity_mae_std': np.std(intensity_errors),
        'raydrop_precision': precision,
        'raydrop_recall': recall,
        'raydrop_f1': f1,
    }

Visualization of training progress:

def plot_training_curves(history: dict):
    """Plot loss curves for all components."""
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    iterations = range(len(history['total']))

    # Total loss
    axes[0, 0].plot(iterations, history['total'], 'b-', alpha=0.3)
    axes[0, 0].plot(
        iterations,
        np.convolve(history['total'], np.ones(50)/50, mode='valid'),
        'b-', linewidth=2
    )
    axes[0, 0].set_title('Total Loss')
    axes[0, 0].set_xlabel('Iteration')
    axes[0, 0].set_yscale('log')
    axes[0, 0].grid(True, alpha=0.3)

    # Camera loss
    axes[0, 1].plot(iterations, history['camera'], 'g-', alpha=0.3)
    axes[0, 1].set_title('Camera Loss (L1 + SSIM)')
    axes[0, 1].set_xlabel('Iteration')
    axes[0, 1].grid(True, alpha=0.3)

    # Depth loss
    axes[0, 2].plot(iterations, history['depth'], 'r-', alpha=0.3)
    axes[0, 2].set_title('Lidar Depth Loss')
    axes[0, 2].set_xlabel('Iteration')
    axes[0, 2].grid(True, alpha=0.3)

    # Intensity loss
    axes[1, 0].plot(iterations, history['intensity'], 'm-', alpha=0.3)
    axes[1, 0].set_title('Lidar Intensity Loss')
    axes[1, 0].set_xlabel('Iteration')
    axes[1, 0].grid(True, alpha=0.3)

    # Raydrop loss
    axes[1, 1].plot(iterations, history['raydrop'], 'c-', alpha=0.3)
    axes[1, 1].set_title('Raydrop BCE Loss')
    axes[1, 1].set_xlabel('Iteration')
    axes[1, 1].grid(True, alpha=0.3)

    # Leave last subplot for metrics summary
    axes[1, 2].axis('off')

    plt.suptitle('Multi-Sensor Training Progress', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

Novel view and novel sweep synthesis:

def synthesize_novel_views(
    scene: MultiSensorGaussianScene,
    camera_calib: 'SensorCalibration',
    lidar_calib: 'SensorCalibration',
    base_ego_pose: torch.Tensor,
    n_views: int = 8,
    orbit_radius: float = 5.0,
):
    """
    Generate camera images and lidar sweeps from novel viewpoints
    by orbiting around the base ego pose.
    """
    elevation_angles, azimuth_angles = create_lidar_beam_pattern(n_beams=32, n_azimuth=512)

    angles = torch.linspace(0, 2 * np.pi, n_views + 1)[:-1]

    fig, axes = plt.subplots(2, n_views, figsize=(4 * n_views, 8))

    for i, angle in enumerate(angles):
        # Create novel ego pose (orbit around scene center)
        offset = torch.tensor([
            orbit_radius * torch.cos(angle),
            orbit_radius * torch.sin(angle),
            0.0,
        ])
        novel_pose = base_ego_pose.clone()
        novel_pose[:3, 3] += offset

        with torch.no_grad():
            # Camera
            cam_img = scene.render_camera(camera_calib, novel_pose)
            cam_np = cam_img.permute(1, 2, 0).clamp(0, 1).numpy()

            # Lidar
            lidar_out = scene.render_lidar(
                lidar_calib, novel_pose,
                elevation_angles, azimuth_angles,
            )

        axes[0, i].imshow(cam_np)
        axes[0, i].set_title(f'Camera {i+1}')
        axes[0, i].axis('off')

        H = len(elevation_angles)
        W = len(azimuth_angles)
        depth_img = lidar_out['depth'].reshape(H, W).numpy()
        axes[1, i].imshow(depth_img, cmap='turbo', vmin=0, vmax=50)
        axes[1, i].set_title(f'Lidar {i+1}')
        axes[1, i].axis('off')

    plt.suptitle('Novel View Synthesis -- Camera + Lidar', fontsize=14)
    plt.tight_layout()
    plt.show()

Exercise 5: Train the model for 5000 iterations and plot the loss curves. At which iteration does each loss component converge? Does the lidar loss ramp-up strategy help compared to constant weighting?

Exercise 6: Compare the evaluation metrics when training with camera-only loss vs. joint camera+lidar loss. Does adding lidar supervision improve the camera metrics (PSNR, SSIM)? Does adding camera supervision improve the lidar metrics (depth MAE)?

Exercise 7: Generate 8 novel views at increasing distances from the training viewpoints. How do the metrics degrade as you move further from the training distribution? Plot PSNR and depth MAE as a function of distance from the nearest training view.


Notebook Exercises

#NotebookFocusTime
101_lidar_ray_casting.ipynbLidar beam pattern; ray-Gaussian intersection math; depth rendering via alpha compositing; intensity from normals/reflectance; raydrop modeling; point cloud visualization.90 min
202_multi_sensor_rendering.ipynbUnified scene class; camera RGB rendering from SH; lidar rendering (depth, intensity, raydrop); sensor calibration and coordinate transforms; consistency verification via lidar-camera projection.90 min
303_joint_training.ipynbMulti-sensor loss function; joint training loop with loss ramp; camera metrics (PSNR/SSIM/LPIPS); lidar metrics (depth MAE, intensity MAE, raydrop F1); novel view + novel sweep synthesis.90 min

Expected Deliverables

  1. Multi-sensor Gaussian scene module: A well-structured PyTorch nn.Module containing the shared Gaussian representation with both camera (SH coefficients) and lidar (reflectance, normals) attributes, plus the raydrop predictor network.

  2. Lidar rendering pipeline: A differentiable lidar renderer that casts rays through Gaussian scenes and produces depth, intensity, and accumulated alpha outputs with proper gradients for training.

  3. Joint training script: A training loop that optimizes the shared scene from both camera images and lidar sweeps simultaneously, with configurable loss weights and a lidar loss ramp-up schedule.

  4. Evaluation pipeline: Functions to compute camera metrics (PSNR, SSIM, LPIPS) and lidar metrics (depth MAE, intensity MAE, raydrop precision/recall/F1) on held-out test data.

  5. Novel sensor synthesis demonstration: Rendered camera images and lidar sweeps from viewpoints not in the training set, with qualitative and quantitative analysis of extrapolation quality.

  6. Completed Jupyter notebooks (3 notebooks): All code cells executed, exercises completed with inline commentary, and visualizations showing the progression from individual components to the full joint system.


Evaluation Criteria

CriteriaWeightDescription
Correctness30%Ray-Gaussian intersection math is correct. Alpha compositing produces valid depths. Intensity model accounts for incidence angle and reflectance. Coordinate transforms are invertible and consistent.
Code Quality20%Clean, modular PyTorch code. Functions have docstrings and type hints. Scene module follows nn.Module conventions. No code duplication between camera and lidar paths.
Analysis Depth25%Written analysis demonstrates understanding of the camera-lidar complementarity. Includes comparison of joint vs. single-modality training. Discusses failure modes (e.g., thin structures, glass, vegetation).
Visualization15%Range images and point clouds are correctly rendered. Camera-lidar projection overlay demonstrates calibration consistency. Training curves are smooth and show expected convergence patterns.
Documentation10%README with clear installation and usage instructions. Notebooks include markdown cells explaining each step. Exercise solutions include written interpretation of results.

  • Neural Rendering for AD Simulation: Covers the foundational neural rendering techniques (NeRF, 3DGS) and their extensions for multi-sensor AD simulation. The "Multi-Sensor Extensions" section directly informs this project's architecture.

  • Sensor Simulation for AD: Deep dive into lidar physics, intensity modeling, raydrop patterns, and the role of neural methods in achieving sensor-realistic simulation. Essential reading for understanding why each component of the lidar pipeline matters.


Next Steps

After completing this project, consider these natural follow-ups:

  1. Dynamic Scene Decomposition (Track A, Advanced): Extend the multi-sensor renderer to handle dynamic objects -- separate Gaussians for static background and moving actors, with per-actor rigid-body transforms over time.

  2. Physics-Based Sensor Simulator (Track B, Advanced): Move from learned sensor models to physics-based ones -- implement a full lidar sensor simulator with beam divergence, atmospheric scattering, and material BRDF models.

  3. 3DGS Scene Reconstruction (Track A, Intermediate): If you want to deepen your understanding of the underlying 3DGS training pipeline, this project focuses on the camera-only case with full densification and pruning strategies.