Multi-Sensor Neural Renderer
Build a unified 3D Gaussian Splatting scene representation that renders both photorealistic camera images and physically plausible lidar point clouds -- enabling multi-modal sensor simulation from a single learned model.
Track: A -- Neural Simulation | Level: Advanced | Total Time: ~30-40 hours
Overview
Modern autonomous driving systems consume data from multiple sensor modalities simultaneously -- typically 6-10 cameras for visual perception and 1-5 lidar sensors for 3D geometry. When building neural simulation pipelines, the standard approach has been to train separate models for each modality: one neural radiance field or Gaussian Splatting model for camera images, and a separate point cloud generation model for lidar. This is wasteful for two reasons. First, cameras and lidar observe the same physical scene -- surfaces that reflect photons also reflect laser pulses. Second, maintaining two models means two separate training pipelines, two sets of hyperparameters, and no guarantee that the rendered camera image and lidar sweep are geometrically consistent with each other.
This project addresses that gap by building a multi-sensor neural renderer on top of 3D Gaussian Splatting (3DGS). You will extend the standard 3DGS representation -- which already stores position, covariance, opacity, and spherical harmonics coefficients for each Gaussian -- with additional per-Gaussian attributes for lidar simulation: surface normal direction, material reflectance (for lidar intensity), and a learned raydrop probability. The same set of Gaussians is used to render camera RGB images via the standard differentiable rasterization pipeline and to produce lidar depth, intensity, and raydrop predictions via a differentiable ray-casting formulation. Both modalities share the scene geometry, ensuring spatial consistency, while sensor-specific parameters allow each output to match its respective ground-truth distribution.
By the end of this project you will have a working PyTorch implementation that takes a set of posed camera images and lidar sweeps as input, jointly optimizes a shared Gaussian scene representation, and renders novel camera views and novel lidar sweeps from arbitrary sensor poses. This is the same architectural pattern used in state-of-the-art multi-sensor neural simulators like UniSim, NeuRAD, and LiDAR-GS, and it is the foundation for production sensor simulation at companies like Waymo, NVIDIA, and Applied Intuition.
Learning Objectives
After completing this project, you will be able to:
- Formulate lidar ray casting through a Gaussian scene -- derive the intersection equations, accumulate depth along rays using alpha compositing, and compute per-point intensity from surface normals and material reflectance.
- Implement a differentiable lidar rendering pipeline that produces depth maps, intensity maps, and raydrop masks from a 3DGS representation, with gradients flowing back to Gaussian parameters.
- Model lidar raydrop (miss) patterns using a learned network conditioned on ray geometry and accumulated features, capturing the physics of beam divergence, retroreflection, and surface absorption.
- Unify camera and lidar rendering from a shared scene representation, handling the different coordinate systems, projection models, and output formats of each sensor.
- Implement multi-sensor calibration transforms -- convert between camera, lidar, and ego-vehicle coordinate frames using extrinsic rotation/translation matrices.
- Design a joint training objective that balances camera losses (L1 + SSIM + LPIPS) with lidar losses (depth MAE + intensity MSE + raydrop BCE), including loss weighting strategies.
- Evaluate multi-sensor reconstruction quality using both camera metrics (PSNR, SSIM, LPIPS) and lidar metrics (depth error, intensity error, raydrop F1 score, Chamfer distance).
- Synthesize novel sensor data -- render camera images and lidar sweeps from viewpoints not present in the training set, demonstrating the model's ability to generalize.
Prerequisites
Required
- 3D Gaussian Splatting fundamentals -- understanding of the core 3DGS representation (position, covariance, opacity, spherical harmonics), the differentiable rasterization pipeline, and the densification/pruning training strategy. If you have not implemented basic 3DGS before, start with the "3DGS Scene Reconstruction" project in this track.
- PyTorch proficiency -- comfortable writing custom
nn.Moduleclasses, autograd-compatible functions, custom training loops, and loss functions. - 3D geometry -- rotation matrices, quaternions, homogeneous transforms, ray-plane intersection, coordinate frame conversions.
Recommended
- CUDA basics -- while the core implementation is in PyTorch (not raw CUDA), understanding GPU memory layout and parallel execution helps with performance reasoning.
- Lidar sensor physics -- basic understanding of time-of-flight measurement, beam divergence, retroreflection, and point cloud representation. The project teaches this, but prior exposure accelerates learning.
- Camera projection model -- pinhole camera intrinsics (focal length, principal point) and extrinsics (rotation, translation).
Deep Dive Reading
Before starting, read the companion deep dives for theoretical background:
- Neural Rendering for AD Simulation -- covers NeRF, 3DGS, NeuRAD, SplatAD, and multi-sensor rendering architectures. Read the "3D Gaussian Splatting" and "Multi-Sensor Extensions" sections carefully.
- Sensor Simulation for AD -- covers lidar physics, sensor modeling, intensity/raydrop simulation, and the role of neural methods in bridging the simulation gap.
Key Concepts
Lidar Ray Casting Fundamentals
A spinning lidar sensor emits laser pulses along a set of known directions and measures the time-of-flight for each returned pulse. For a sensor with H elevation channels and W azimuth steps, each frame produces up to H x W range measurements. In our renderer, we need to compute these range (depth) values by casting rays through the Gaussian scene.
Ray definition: Each lidar ray is defined by an origin o (the sensor position in world coordinates) and a direction d (a unit vector determined by the beam's azimuth and elevation angles):
azimuth_angles = linspace(0, 2*pi, W) # horizontal scan
elevation_angles = [el_0, el_1, ..., el_H] # vertical beam pattern (non-uniform)
d_x = cos(elevation) * cos(azimuth)
d_y = cos(elevation) * sin(azimuth)
d_z = sin(elevation)
ray(t) = o + t * d, for t >= 0
Ray-Gaussian intersection: Each 3D Gaussian in the scene is parameterized by its mean mu (3D position), covariance Sigma (3x3 positive definite matrix, often stored as a rotation quaternion + 3D scale), opacity alpha, and additional attributes. To compute the contribution of Gaussian k to a ray, we evaluate the Gaussian density along the ray. The key insight is that a 3D Gaussian evaluated along a 1D ray produces a 1D Gaussian:
G_k(t) = alpha_k * exp(-0.5 * (o + t*d - mu_k)^T Sigma_k^{-1} (o + t*d - mu_k))
Let delta_k = mu_k - o be the vector from ray origin to Gaussian center. The depth of maximum contribution along the ray is:
t_k = (d^T Sigma_k^{-1} delta_k) / (d^T Sigma_k^{-1} d)
And the "power" (peak density) at that point determines how much this Gaussian contributes to the depth measurement. The effective 1D variance along the ray direction is:
sigma_ray_k^2 = 1 / (d^T Sigma_k^{-1} d)
Alpha compositing for depth: Just as camera rendering alpha-composites colors along viewing rays, lidar rendering alpha-composites depths along sensor rays. Sort the Gaussians by t_k (front to back) and accumulate:
T_k = prod_{j < k} (1 - w_j) # transmittance up to Gaussian k
w_k = alpha_k * G_ray_k # ray-space weight for Gaussian k
depth_rendered = sum_k (T_k * w_k * t_k)
where G_ray_k is the Gaussian evaluated along the ray at its peak. This produces a differentiable depth estimate for each lidar ray.
Implementation in PyTorch:
import torch
import torch.nn as nn
def compute_ray_gaussian_intersection(
ray_origins: torch.Tensor, # (N_rays, 3)
ray_directions: torch.Tensor, # (N_rays, 3)
means: torch.Tensor, # (N_gaussians, 3)
covariances_inv: torch.Tensor, # (N_gaussians, 3, 3)
opacities: torch.Tensor, # (N_gaussians,)
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Compute depth contributions of all Gaussians along all rays.
Returns:
depths: (N_rays, N_gaussians) -- depth of each Gaussian along each ray
weights: (N_rays, N_gaussians) -- alpha-compositing weight of each Gaussian
"""
# delta = mu - o: (N_rays, N_gaussians, 3)
delta = means.unsqueeze(0) - ray_origins.unsqueeze(1)
# d^T Sigma^{-1} delta and d^T Sigma^{-1} d for each ray-Gaussian pair
# Sigma_inv @ d: (N_gaussians, 3, 3) @ (N_rays, 1, 3, 1) -> needs broadcasting
d = ray_directions.unsqueeze(1) # (N_rays, 1, 3)
# For each Gaussian, compute Sigma_inv @ d
# d_Sinv: (N_rays, N_gaussians, 3)
d_Sinv = torch.einsum('gij,rnj->rni', covariances_inv, d.expand(-1, means.shape[0], -1))
# t_peak = (d^T Sigma^{-1} delta) / (d^T Sigma^{-1} d)
numerator = torch.sum(d_Sinv * delta, dim=-1) # (N_rays, N_gaussians)
denominator = torch.sum(d_Sinv * d.expand(-1, means.shape[0], -1), dim=-1) # (N_rays, N_gaussians)
t_peak = numerator / (denominator + 1e-8) # (N_rays, N_gaussians)
# Compute ray-space variance: sigma_ray^2 = 1 / (d^T Sigma^{-1} d)
sigma_ray_sq = 1.0 / (denominator + 1e-8)
# Gaussian weight along ray at peak
power = opacities.unsqueeze(0) * torch.exp(
-0.5 * (t_peak - numerator / (denominator + 1e-8))**2 / (sigma_ray_sq + 1e-8)
)
# Mask out Gaussians behind the ray origin
power = power * (t_peak > 0).float()
return t_peak, power
def alpha_composite_depth(
depths: torch.Tensor, # (N_rays, N_gaussians)
weights: torch.Tensor, # (N_rays, N_gaussians)
) -> torch.Tensor:
"""
Alpha-composite depths front-to-back along each ray.
Returns:
rendered_depth: (N_rays,) -- expected depth for each ray
"""
# Sort by depth along each ray
sorted_indices = torch.argsort(depths, dim=1)
sorted_depths = torch.gather(depths, 1, sorted_indices)
sorted_weights = torch.gather(weights, 1, sorted_indices)
# Compute transmittance: T_k = prod_{j<k} (1 - w_j)
# Use cumprod of (1 - w) shifted by one position
one_minus_w = 1.0 - sorted_weights
# Exclusive cumprod: T_k = prod_{j=0}^{k-1} (1 - w_j)
transmittance = torch.cat([
torch.ones_like(sorted_weights[:, :1]),
torch.cumprod(one_minus_w[:, :-1], dim=1)
], dim=1)
# Final weight = T_k * w_k
final_weights = transmittance * sorted_weights
# Rendered depth = weighted sum
rendered_depth = torch.sum(final_weights * sorted_depths, dim=1)
return rendered_depth
Intensity Modeling
Lidar sensors do not just measure range -- they also record the intensity of the returned pulse. Intensity depends on several physical factors:
-
Material reflectance (
rho): Different surfaces reflect different fractions of the incident laser energy. Retroreflective materials (lane markings, traffic signs) return very high intensity; dark surfaces (asphalt, black cars) return low intensity. -
Incidence angle (
theta): A surface struck at a steep angle reflects less energy back to the sensor than one struck head-on. The reflected intensity falls off ascos(theta), wherethetais the angle between the incoming ray and the surface normal. -
Range attenuation: The returned signal strength follows an inverse-square law with distance:
I ~ 1/r^2. Most lidar sensors apply internal gain compensation, but residual range effects remain.
Combining these factors, the physical model for returned intensity is:
I = rho * cos(theta) / r^2 + noise
In our Gaussian representation, we model this by adding per-Gaussian attributes:
- Reflectance (
rho_k): A learned scalar in [0, 1] for each Gaussian, representing the surface reflectance at that location. - Surface normal (
n_k): A unit vector indicating the local surface orientation. This can be derived from the Gaussian covariance (the shortest axis of the ellipsoid) or learned explicitly.
The incidence angle for ray direction d hitting Gaussian k with normal n_k is:
cos(theta_k) = |dot(d, n_k)|
The rendered intensity for a lidar ray is then alpha-composited analogously to depth:
intensity_rendered = sum_k (T_k * w_k * rho_k * cos(theta_k) / t_k^2)
Implementation:
class GaussianLidarAttributes(nn.Module):
"""Per-Gaussian attributes for lidar rendering."""
def __init__(self, n_gaussians: int):
super().__init__()
# Reflectance: logit-space, sigmoid to [0, 1]
self.reflectance_logit = nn.Parameter(torch.zeros(n_gaussians))
# Surface normals: raw 3-vectors, normalized in forward pass
self.normals_raw = nn.Parameter(torch.randn(n_gaussians, 3))
@property
def reflectance(self) -> torch.Tensor:
return torch.sigmoid(self.reflectance_logit)
@property
def normals(self) -> torch.Tensor:
return nn.functional.normalize(self.normals_raw, dim=-1)
def compute_intensity(
self,
ray_directions: torch.Tensor, # (N_rays, 3)
gaussian_indices: torch.Tensor, # indices of contributing Gaussians
depths: torch.Tensor, # (N_rays, N_contributing) depths
weights: torch.Tensor, # (N_rays, N_contributing) alpha weights
) -> torch.Tensor:
"""Compute rendered lidar intensity for each ray."""
rho = self.reflectance[gaussian_indices] # (N_contributing,)
n = self.normals[gaussian_indices] # (N_contributing, 3)
# Incidence angle: |cos(theta)| = |dot(d, n)|
d = ray_directions.unsqueeze(1) # (N_rays, 1, 3)
cos_theta = torch.abs(torch.sum(d * n.unsqueeze(0), dim=-1)) # (N_rays, N_contributing)
# Range attenuation: 1/r^2, clamped for numerical stability
range_atten = 1.0 / (depths**2 + 1e-4)
# Per-Gaussian intensity contribution
intensity_per_gaussian = rho.unsqueeze(0) * cos_theta * range_atten
# Alpha-composite
rendered_intensity = torch.sum(weights * intensity_per_gaussian, dim=1)
return rendered_intensity
Intuition: Intensity modeling matters because downstream perception algorithms (especially those processing lidar data) use intensity as a feature for classification. Lane markings are bright, road surface is dark, vehicles have moderate reflectance with sharp gradients at edges. If your simulator produces flat, uniform intensity, the perception model will behave differently than on real data.
Raydrop Prediction
Not every lidar ray returns a measurement. In real sensors, rays can be "dropped" (no return) for several reasons:
- Absorption: The surface absorbs most of the laser energy (e.g., dark, matte surfaces at long range).
- Specular reflection: A mirror-like surface deflects the beam away from the sensor (e.g., wet roads, glass, polished metal).
- Max range exceeded: The target is beyond the sensor's maximum detection range.
- Beam divergence: At long range, the beam footprint is large and the returned energy is spread over a wide area.
- Occlusion: An object blocks the ray before it reaches the expected target.
Raydrop is critical for realism. Real lidar data has characteristic "holes" -- missing points on dark vehicles, glass surfaces, and at range boundaries. A simulator that returns a point for every ray produces unnaturally dense, clean point clouds that perception models have never seen.
We model raydrop as a learned binary classification problem. For each ray, a small MLP predicts the probability of the ray returning a valid measurement, conditioned on features accumulated along the ray:
p_drop = sigmoid(MLP(f_ray))
where f_ray is a feature vector for the ray, constructed from:
- Accumulated depth:
depth_rendered - Accumulated intensity:
intensity_rendered - Accumulated opacity:
sum_k (T_k * w_k)-- total accumulated alpha along the ray - Ray geometry: elevation angle, range to nearest surface
During training, we supervise with binary cross-entropy against the ground-truth raydrop mask (1 = valid point, 0 = dropped).
Implementation:
class RaydropPredictor(nn.Module):
"""Predict which lidar rays return valid measurements."""
def __init__(self, feature_dim: int = 8, hidden_dim: int = 64):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(feature_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1),
)
def forward(
self,
rendered_depth: torch.Tensor, # (N_rays,)
rendered_intensity: torch.Tensor, # (N_rays,)
accumulated_alpha: torch.Tensor, # (N_rays,)
ray_elevation: torch.Tensor, # (N_rays,)
) -> torch.Tensor:
"""
Predict raydrop probability for each ray.
Returns:
p_valid: (N_rays,) -- probability that the ray returns a valid point.
Values near 1 = high confidence the point exists.
Values near 0 = likely a dropped ray.
"""
features = torch.stack([
rendered_depth,
rendered_intensity,
accumulated_alpha,
ray_elevation,
torch.log(rendered_depth + 1e-4), # log depth (helps with scale)
torch.sin(ray_elevation), # trigonometric elevation features
torch.cos(ray_elevation),
rendered_depth * accumulated_alpha, # interaction feature
], dim=-1)
logits = self.mlp(features).squeeze(-1)
return torch.sigmoid(logits)
Training detail: Raydrop masks are extracted from the ground-truth lidar data. In the Waymo Open Dataset, a ray is considered "dropped" if no point exists at the corresponding (azimuth, elevation) bin. The raydrop rate varies from 5-30% depending on the scene (more drops in scenes with lots of glass, vegetation, or long-range content).
Multi-Sensor Calibration
A key requirement for multi-sensor rendering is precise knowledge of how each sensor is positioned relative to the vehicle. This is captured by extrinsic calibration matrices -- 4x4 homogeneous transforms that convert points between coordinate frames.
Coordinate frames in play:
| Frame | Origin | Axes | Use |
|---|---|---|---|
| World | Arbitrary map origin | Right-handed, Z-up | Scene representation, Gaussian positions |
| Ego | Center of rear axle | X-forward, Y-left, Z-up | Vehicle reference frame |
| Camera | Camera optical center | X-right, Y-down, Z-forward | Image rendering |
| Lidar | Lidar sensor center | X-forward, Y-left, Z-up | Point cloud generation |
The transform chain for converting a world-frame Gaussian position to a camera pixel:
p_ego = T_ego_world @ p_world # world -> ego
p_cam = T_cam_ego @ p_ego # ego -> camera
p_pixel = K @ p_cam[:3] / p_cam[2] # camera -> pixel (pinhole projection)
For lidar rendering, the chain is:
p_ego = T_ego_world @ p_world # world -> ego
p_lidar = T_lidar_ego @ p_ego # ego -> lidar frame
# Then ray casting is performed in lidar frame
Implementation:
class SensorCalibration:
"""Stores and applies sensor extrinsic/intrinsic calibration."""
def __init__(
self,
T_ego_sensor: torch.Tensor, # (4, 4) transform from sensor to ego
intrinsics: torch.Tensor = None, # (3, 3) camera intrinsic matrix (cameras only)
image_size: tuple[int, int] = None, # (H, W) for cameras
):
self.T_ego_sensor = T_ego_sensor
self.T_sensor_ego = torch.inverse(T_ego_sensor)
self.intrinsics = intrinsics
self.image_size = image_size
def world_to_sensor(
self,
points_world: torch.Tensor, # (N, 3)
T_ego_world: torch.Tensor, # (4, 4) ego pose in world frame
) -> torch.Tensor:
"""Transform points from world frame to sensor frame."""
T_sensor_world = self.T_sensor_ego @ torch.inverse(T_ego_world)
# Homogeneous coordinates
ones = torch.ones(*points_world.shape[:-1], 1, device=points_world.device)
points_h = torch.cat([points_world, ones], dim=-1) # (N, 4)
points_sensor = (T_sensor_world @ points_h.T).T[:, :3] # (N, 3)
return points_sensor
def project_to_image(
self,
points_cam: torch.Tensor, # (N, 3) in camera frame
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Project 3D points in camera frame to 2D pixel coordinates.
Returns:
pixels: (N, 2) -- (u, v) pixel coordinates
valid: (N,) -- boolean mask for points in front of camera and within image bounds
"""
assert self.intrinsics is not None, "No intrinsics for this sensor"
z = points_cam[:, 2]
valid = z > 0.1 # Must be in front of camera
pixels = (self.intrinsics @ points_cam.T).T # (N, 3)
pixels = pixels[:, :2] / (pixels[:, 2:3] + 1e-8) # (N, 2)
if self.image_size is not None:
H, W = self.image_size
in_bounds = (
(pixels[:, 0] >= 0) & (pixels[:, 0] < W) &
(pixels[:, 1] >= 0) & (pixels[:, 1] < H)
)
valid = valid & in_bounds
return pixels, valid
def generate_lidar_rays(
self,
elevation_angles: torch.Tensor, # (H,) elevation angles in radians
azimuth_angles: torch.Tensor, # (W,) azimuth angles in radians
T_ego_world: torch.Tensor, # (4, 4) ego pose
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Generate lidar ray origins and directions in world frame.
Returns:
origins: (H*W, 3) -- ray origins in world frame
directions: (H*W, 3) -- unit ray directions in world frame
"""
# Build direction vectors in sensor frame
el = elevation_angles.unsqueeze(1).expand(-1, len(azimuth_angles)) # (H, W)
az = azimuth_angles.unsqueeze(0).expand(len(elevation_angles), -1) # (H, W)
dx = torch.cos(el) * torch.cos(az)
dy = torch.cos(el) * torch.sin(az)
dz = torch.sin(el)
dirs_sensor = torch.stack([dx, dy, dz], dim=-1).reshape(-1, 3) # (H*W, 3)
# Transform to world frame
T_sensor_world_inv = T_ego_world @ self.T_ego_sensor
R = T_sensor_world_inv[:3, :3]
t = T_sensor_world_inv[:3, 3]
dirs_world = (R @ dirs_sensor.T).T # (H*W, 3)
origins_world = t.unsqueeze(0).expand(dirs_sensor.shape[0], -1) # (H*W, 3)
return origins_world, dirs_world
Calibration consistency is critical: If the camera-lidar extrinsics are even slightly wrong, projecting lidar points onto camera images will show misalignment. In real datasets (Waymo, nuScenes), calibration is provided; in your training pipeline, you should verify alignment before training by overlaying projected lidar depth onto camera images.
Joint Optimization
The power of a multi-sensor neural renderer comes from joint training: camera and lidar supervisions are complementary. Camera images provide dense color and texture information but have limited depth accuracy. Lidar provides precise depth and geometry but is sparse and has no color. By optimizing a shared scene representation against both modalities, each sensor helps regularize the other.
Multi-sensor loss function:
L_total = lambda_rgb * L_camera + lambda_depth * L_depth + lambda_intensity * L_intensity + lambda_drop * L_raydrop
where:
-
Camera loss (
L_camera): A combination of L1 pixel loss and structural similarity:L_camera = (1 - lambda_ssim) * |I_rendered - I_gt|_1 + lambda_ssim * (1 - SSIM(I_rendered, I_gt))This is the standard loss from 3DGS training. Optionally add LPIPS for perceptual quality.
-
Lidar depth loss (
L_depth): L1 or Huber loss on rendered vs. ground-truth depth, applied only to rays with valid returns:L_depth = (1 / N_valid) * sum_{valid rays} |depth_rendered - depth_gt|Using Huber loss instead of L1 is more robust to depth outliers at scene boundaries.
-
Lidar intensity loss (
L_intensity): MSE between rendered and ground-truth intensity values:L_intensity = (1 / N_valid) * sum_{valid rays} (intensity_rendered - intensity_gt)^2 -
Raydrop loss (
L_raydrop): Binary cross-entropy between predicted and actual raydrop:L_raydrop = -sum_k [y_k * log(p_k) + (1 - y_k) * log(1 - p_k)]where
y_k = 1for rays with valid returns andy_k = 0for dropped rays.
Loss weighting strategy: The loss weights (lambda_*) control the relative importance of each modality. A common starting point:
| Loss | Weight | Rationale |
|---|---|---|
lambda_rgb | 1.0 | Anchor loss -- camera is the densest supervision |
lambda_ssim | 0.2 | SSIM component within camera loss |
lambda_depth | 0.1 | Lidar depth -- strong geometric constraint |
lambda_intensity | 0.05 | Intensity -- secondary lidar signal |
lambda_drop | 0.01 | Raydrop -- auxiliary classification task |
These weights are tuned empirically. The key insight is that camera loss should dominate early in training (to establish scene structure), and lidar losses become more important later (to refine geometry and add sensor-specific realism).
Implementation:
class MultiSensorLoss(nn.Module):
"""Combined loss for joint camera + lidar training."""
def __init__(
self,
lambda_rgb: float = 1.0,
lambda_ssim: float = 0.2,
lambda_depth: float = 0.1,
lambda_intensity: float = 0.05,
lambda_raydrop: float = 0.01,
):
super().__init__()
self.lambda_rgb = lambda_rgb
self.lambda_ssim = lambda_ssim
self.lambda_depth = lambda_depth
self.lambda_intensity = lambda_intensity
self.lambda_raydrop = lambda_raydrop
def camera_loss(
self,
rendered: torch.Tensor, # (B, 3, H, W)
target: torch.Tensor, # (B, 3, H, W)
) -> torch.Tensor:
"""Compute camera rendering loss (L1 + SSIM)."""
l1 = torch.mean(torch.abs(rendered - target))
ssim_val = self._ssim(rendered, target)
return (1 - self.lambda_ssim) * l1 + self.lambda_ssim * (1 - ssim_val)
def lidar_depth_loss(
self,
rendered_depth: torch.Tensor, # (N_rays,)
target_depth: torch.Tensor, # (N_rays,)
valid_mask: torch.Tensor, # (N_rays,) bool
) -> torch.Tensor:
"""Huber loss on lidar depth for valid rays only."""
if valid_mask.sum() == 0:
return torch.tensor(0.0, device=rendered_depth.device)
diff = rendered_depth[valid_mask] - target_depth[valid_mask]
return nn.functional.huber_loss(
rendered_depth[valid_mask],
target_depth[valid_mask],
delta=1.0,
)
def lidar_intensity_loss(
self,
rendered_intensity: torch.Tensor, # (N_rays,)
target_intensity: torch.Tensor, # (N_rays,)
valid_mask: torch.Tensor, # (N_rays,)
) -> torch.Tensor:
"""MSE loss on lidar intensity for valid rays."""
if valid_mask.sum() == 0:
return torch.tensor(0.0, device=rendered_intensity.device)
return nn.functional.mse_loss(
rendered_intensity[valid_mask],
target_intensity[valid_mask],
)
def raydrop_loss(
self,
predicted_prob: torch.Tensor, # (N_rays,)
target_valid: torch.Tensor, # (N_rays,) float: 1.0 = valid, 0.0 = dropped
) -> torch.Tensor:
"""Binary cross-entropy loss for raydrop prediction."""
return nn.functional.binary_cross_entropy(
predicted_prob,
target_valid,
)
def forward(
self,
camera_rendered: torch.Tensor,
camera_target: torch.Tensor,
lidar_depth_rendered: torch.Tensor,
lidar_depth_target: torch.Tensor,
lidar_intensity_rendered: torch.Tensor,
lidar_intensity_target: torch.Tensor,
raydrop_pred: torch.Tensor,
raydrop_target: torch.Tensor,
valid_mask: torch.Tensor,
) -> dict[str, torch.Tensor]:
"""Compute all losses and return individually + total."""
l_cam = self.camera_loss(camera_rendered, camera_target)
l_depth = self.lidar_depth_loss(lidar_depth_rendered, lidar_depth_target, valid_mask)
l_intensity = self.lidar_intensity_loss(
lidar_intensity_rendered, lidar_intensity_target, valid_mask
)
l_drop = self.raydrop_loss(raydrop_pred, raydrop_target)
total = (
self.lambda_rgb * l_cam +
self.lambda_depth * l_depth +
self.lambda_intensity * l_intensity +
self.lambda_raydrop * l_drop
)
return {
'total': total,
'camera': l_cam,
'depth': l_depth,
'intensity': l_intensity,
'raydrop': l_drop,
}
@staticmethod
def _ssim(
img1: torch.Tensor,
img2: torch.Tensor,
window_size: int = 11,
) -> torch.Tensor:
"""Compute SSIM between two image tensors (B, C, H, W)."""
C1 = 0.01 ** 2
C2 = 0.03 ** 2
# Create Gaussian window
sigma = 1.5
coords = torch.arange(window_size, dtype=torch.float32, device=img1.device)
coords -= window_size // 2
g = torch.exp(-(coords ** 2) / (2 * sigma ** 2))
window = g.unsqueeze(0) * g.unsqueeze(1)
window = window / window.sum()
window = window.unsqueeze(0).unsqueeze(0)
C = img1.shape[1]
window = window.expand(C, 1, -1, -1)
mu1 = nn.functional.conv2d(img1, window, padding=window_size//2, groups=C)
mu2 = nn.functional.conv2d(img2, window, padding=window_size//2, groups=C)
mu1_sq = mu1 ** 2
mu2_sq = mu2 ** 2
mu12 = mu1 * mu2
sigma1_sq = nn.functional.conv2d(img1 * img1, window, padding=window_size//2, groups=C) - mu1_sq
sigma2_sq = nn.functional.conv2d(img2 * img2, window, padding=window_size//2, groups=C) - mu2_sq
sigma12 = nn.functional.conv2d(img1 * img2, window, padding=window_size//2, groups=C) - mu12
ssim_map = ((2 * mu12 + C1) * (2 * sigma12 + C2)) / \
((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
return ssim_map.mean()
Step-by-Step Implementation Guide
Step 1: Environment Setup (30 min)
Goal: Set up the development environment with all required packages and verify GPU availability.
Create and activate a virtual environment:
mkdir -p multi-sensor-neural-renderer && cd multi-sensor-neural-renderer
python -m venv .venv
source .venv/bin/activate
Install dependencies:
pip install torch torchvision # Core framework
pip install numpy matplotlib # Numerics and visualization
pip install open3d # 3D point cloud visualization
pip install scipy # Linear algebra utilities
pip install opencv-python # Image I/O and processing
pip install tqdm # Progress bars
pip install scikit-image # Image quality metrics (SSIM, PSNR)
pip install lpips # Perceptual similarity metric
pip install plyfile # PLY file I/O for point clouds
pip install ipykernel jupyter # Notebook support
Or install from requirements.txt:
torch>=2.0.0
torchvision>=0.15.0
numpy>=1.24.0
matplotlib>=3.7.0
open3d>=0.17.0
scipy>=1.10.0
opencv-python>=4.8.0
tqdm>=4.65.0
scikit-image>=0.21.0
lpips>=0.1.4
plyfile>=1.0.0
ipykernel>=6.0.0
jupyter>=1.0.0
Verify the environment:
# verify_setup.py
import torch
import numpy as np
import open3d as o3d
import matplotlib.pyplot as plt
from scipy.spatial.transform import Rotation
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
# Test basic 3D operations
means = torch.randn(100, 3)
print(f"\nCreated {means.shape[0]} random 3D Gaussians")
print(f"Bounding box: [{means.min(0).values.numpy()}] to [{means.max(0).values.numpy()}]")
# Test Open3D
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(means.numpy())
print(f"Open3D point cloud: {len(pcd.points)} points")
# Test rotation
R = Rotation.from_euler('xyz', [10, 20, 30], degrees=True).as_matrix()
print(f"\nRotation matrix shape: {R.shape}")
print(f"Determinant: {np.linalg.det(R):.6f} (should be 1.0)")
print("\n=== Setup verified successfully! ===")
Prepare synthetic test data:
For initial development and testing, we create a synthetic Gaussian scene rather than loading real driving data. This lets you validate the rendering math before dealing with dataset complexities.
# create_synthetic_scene.py
import torch
import numpy as np
import json
def create_synthetic_gaussian_scene(
n_gaussians: int = 500,
scene_extent: float = 20.0,
save_path: str = "data/synthetic_scene.pt",
):
"""
Create a synthetic scene with Gaussians arranged to resemble
a simplified driving environment (ground plane + objects).
"""
means = []
scales = []
rotations = []
opacities = []
sh_coeffs = []
reflectances = []
normals = []
# Ground plane: flat Gaussians spread across XY plane
n_ground = n_gaussians // 2
ground_xy = torch.rand(n_ground, 2) * scene_extent - scene_extent / 2
ground_z = torch.zeros(n_ground, 1) + torch.randn(n_ground, 1) * 0.02
ground_means = torch.cat([ground_xy, ground_z], dim=1)
ground_scales = torch.tensor([0.5, 0.5, 0.01]).unsqueeze(0).expand(n_ground, -1)
ground_normals = torch.tensor([0.0, 0.0, 1.0]).unsqueeze(0).expand(n_ground, -1)
ground_reflectances = torch.full((n_ground,), 0.3) # Asphalt-like
means.append(ground_means)
scales.append(ground_scales)
normals.append(ground_normals)
reflectances.append(ground_reflectances)
# Box objects: cubes of Gaussians representing vehicles
n_objects = 5
n_per_object = (n_gaussians - n_ground) // n_objects
for i in range(n_objects):
center = torch.tensor([
(torch.rand(1).item() - 0.5) * scene_extent * 0.6,
(torch.rand(1).item() - 0.5) * scene_extent * 0.6,
0.8 + torch.rand(1).item() * 0.5,
])
obj_means = center.unsqueeze(0) + torch.randn(n_per_object, 3) * torch.tensor([1.0, 0.5, 0.4])
obj_scales = torch.rand(n_per_object, 3) * 0.2 + 0.05
# Normals point outward from center
obj_normals = obj_means - center.unsqueeze(0)
obj_normals = obj_normals / (obj_normals.norm(dim=-1, keepdim=True) + 1e-8)
obj_reflectances = torch.full((n_per_object,), 0.5 + torch.rand(1).item() * 0.3)
means.append(obj_means)
scales.append(obj_scales)
normals.append(obj_normals)
reflectances.append(obj_reflectances)
scene = {
'means': torch.cat(means, dim=0),
'scales': torch.cat(scales, dim=0),
'normals': torch.cat(normals, dim=0),
'reflectances': torch.cat(reflectances, dim=0),
'opacities': torch.ones(n_gaussians) * 0.9,
'sh_coeffs': torch.randn(n_gaussians, 3, 16), # SH degree 3
}
import os
os.makedirs(os.path.dirname(save_path), exist_ok=True)
torch.save(scene, save_path)
print(f"Saved synthetic scene with {n_gaussians} Gaussians to {save_path}")
return scene
if __name__ == "__main__":
create_synthetic_gaussian_scene()
Step 2: Lidar Ray Casting Through Gaussians (Notebook 01, ~3 hours)
Goal: Implement a complete lidar rendering pipeline that casts rays through a Gaussian scene and produces depth, intensity, and raydrop outputs.
Lidar beam pattern: A typical spinning lidar has a fixed set of elevation angles (non-uniformly spaced, denser near the horizon) and sweeps 360 degrees in azimuth:
def create_lidar_beam_pattern(
n_beams: int = 64,
n_azimuth: int = 2048,
fov_up: float = 3.0, # degrees above horizon
fov_down: float = -25.0, # degrees below horizon
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Create a realistic lidar beam pattern.
Returns:
elevation_angles: (n_beams,) in radians
azimuth_angles: (n_azimuth,) in radians
"""
# Non-uniform elevation spacing: denser near horizon
# This mimics real Velodyne/Ouster beam patterns
elevation_deg = torch.linspace(fov_down, fov_up, n_beams)
# Apply non-linear spacing: compress near horizon
center = (fov_up + fov_down) / 2
elevation_deg = center + (elevation_deg - center) * torch.abs(
(elevation_deg - center) / max(abs(fov_up - center), abs(fov_down - center))
) ** 0.7 * torch.sign(elevation_deg - center) * max(abs(fov_up - center), abs(fov_down - center))
elevation_rad = torch.deg2rad(elevation_deg)
azimuth_rad = torch.linspace(0, 2 * np.pi, n_azimuth + 1)[:-1] # Exclude 2*pi (same as 0)
return elevation_rad, azimuth_rad
Full lidar rendering function:
def render_lidar(
ray_origins: torch.Tensor, # (N_rays, 3)
ray_directions: torch.Tensor, # (N_rays, 3)
means: torch.Tensor, # (M, 3)
covariances_inv: torch.Tensor, # (M, 3, 3)
opacities: torch.Tensor, # (M,)
reflectances: torch.Tensor, # (M,)
normals: torch.Tensor, # (M, 3)
max_range: float = 75.0,
n_nearest: int = 64,
) -> dict[str, torch.Tensor]:
"""
Render lidar depth, intensity, and accumulated alpha for all rays.
For efficiency, only consider the N nearest Gaussians to each ray.
Returns:
dict with keys:
'depth': (N_rays,) rendered depth
'intensity': (N_rays,) rendered intensity
'alpha': (N_rays,) accumulated opacity
"""
N = ray_origins.shape[0]
M = means.shape[0]
# Step 1: Find nearest Gaussians to each ray (approximate with distance to ray origin)
# In production, use spatial data structures; here we use brute force on the subset
dists = torch.cdist(ray_origins[:1], means) # (1, M) -- same origin for all rays in a sweep
_, nearest_idx = torch.topk(dists[0], k=min(n_nearest, M), largest=False)
# Subset scene
sub_means = means[nearest_idx] # (K, 3)
sub_cov_inv = covariances_inv[nearest_idx] # (K, 3, 3)
sub_opacities = opacities[nearest_idx] # (K,)
sub_reflectances = reflectances[nearest_idx] # (K,)
sub_normals = normals[nearest_idx] # (K, 3)
K = sub_means.shape[0]
# Step 2: Compute ray-Gaussian intersection depths
delta = sub_means.unsqueeze(0) - ray_origins.unsqueeze(1) # (N, K, 3)
d = ray_directions.unsqueeze(1).expand(-1, K, -1) # (N, K, 3)
# d^T Sigma^{-1}: (N, K, 3)
d_Sinv = torch.einsum('kij,nkj->nki', sub_cov_inv, d)
# t_peak = (d^T Sigma^{-1} delta) / (d^T Sigma^{-1} d)
numerator = torch.sum(d_Sinv * delta, dim=-1) # (N, K)
denominator = torch.sum(d_Sinv * d, dim=-1) # (N, K)
t_peak = numerator / (denominator + 1e-8) # (N, K)
# Ray-space variance
sigma_ray_sq = 1.0 / (denominator + 1e-8) # (N, K)
# Gaussian density at peak
power = sub_opacities.unsqueeze(0) * torch.exp(
-0.5 * torch.zeros_like(t_peak) # At peak, exponent is 0
)
# Mask: only count Gaussians in front of ray and within max range
valid_mask = (t_peak > 0.1) & (t_peak < max_range)
power = power * valid_mask.float()
# Step 3: Alpha compositing (sort by depth, accumulate front-to-back)
# Set invalid depths to large value so they sort last
t_for_sort = t_peak.clone()
t_for_sort[~valid_mask] = max_range + 1.0
sorted_indices = torch.argsort(t_for_sort, dim=1)
sorted_t = torch.gather(t_for_sort, 1, sorted_indices)
sorted_power = torch.gather(power, 1, sorted_indices)
# Transmittance
one_minus_alpha = 1.0 - sorted_power
transmittance = torch.cat([
torch.ones(N, 1, device=ray_origins.device),
torch.cumprod(one_minus_alpha[:, :-1], dim=1),
], dim=1)
final_weights = transmittance * sorted_power # (N, K)
# Rendered depth
rendered_depth = torch.sum(final_weights * sorted_t, dim=1) # (N,)
# Step 4: Intensity (need to gather reflectances and normals in sorted order)
sorted_refl_idx = torch.gather(
nearest_idx.unsqueeze(0).expand(N, -1), 1, sorted_indices
)
sorted_reflectances = sub_reflectances[
torch.gather(torch.arange(K, device=ray_origins.device).unsqueeze(0).expand(N, -1), 1, sorted_indices)
]
sorted_normals = sub_normals[
torch.gather(torch.arange(K, device=ray_origins.device).unsqueeze(0).expand(N, -1), 1, sorted_indices)
] # (N, K, 3)
# Incidence angle
cos_theta = torch.abs(torch.sum(
ray_directions.unsqueeze(1) * sorted_normals, dim=-1
)) # (N, K)
# Range attenuation
range_atten = 1.0 / (sorted_t ** 2 + 1.0)
intensity_per_gaussian = sorted_reflectances * cos_theta * range_atten
rendered_intensity = torch.sum(final_weights * intensity_per_gaussian, dim=1)
# Accumulated alpha (total opacity along ray)
accumulated_alpha = torch.sum(final_weights, dim=1)
return {
'depth': rendered_depth,
'intensity': rendered_intensity,
'alpha': accumulated_alpha,
}
Visualizing the rendered point cloud:
def visualize_lidar_rendering(
depth: torch.Tensor, # (H*W,) or (H, W)
intensity: torch.Tensor, # same shape
elevation_angles: torch.Tensor, # (H,)
azimuth_angles: torch.Tensor, # (W,)
max_range: float = 75.0,
):
"""Convert rendered depth/intensity to 3D point cloud and visualize."""
H = len(elevation_angles)
W = len(azimuth_angles)
depth_2d = depth.reshape(H, W)
intensity_2d = intensity.reshape(H, W)
# Build direction vectors
el = elevation_angles.unsqueeze(1).expand(H, W)
az = azimuth_angles.unsqueeze(0).expand(H, W)
x = depth_2d * torch.cos(el) * torch.cos(az)
y = depth_2d * torch.cos(el) * torch.sin(az)
z = depth_2d * torch.sin(el)
points = torch.stack([x, y, z], dim=-1).reshape(-1, 3)
intensities = intensity_2d.reshape(-1)
# Filter by range
valid = depth.reshape(-1) < max_range * 0.95
points = points[valid].numpy()
intensities = intensities[valid].numpy()
# Create Open3D point cloud
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
# Color by intensity
colors = plt.cm.viridis(intensities / (intensities.max() + 1e-8))[:, :3]
pcd.colors = o3d.utility.Vector3dVector(colors)
# Also create matplotlib visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Range image
axes[0].imshow(depth_2d.numpy(), cmap='turbo', vmin=0, vmax=max_range)
axes[0].set_title('Range Image (depth)')
axes[0].set_xlabel('Azimuth')
axes[0].set_ylabel('Elevation')
plt.colorbar(axes[0].images[0], ax=axes[0], label='Depth (m)')
# Intensity image
axes[1].imshow(intensity_2d.numpy(), cmap='gray')
axes[1].set_title('Intensity Image')
axes[1].set_xlabel('Azimuth')
axes[1].set_ylabel('Elevation')
plt.colorbar(axes[1].images[0], ax=axes[1], label='Intensity')
# Bird's eye view
bev_valid = points[:, 2] > -2.0 # Filter underground points
axes[2].scatter(
points[bev_valid, 0], points[bev_valid, 1],
c=intensities[bev_valid], cmap='viridis', s=0.1, alpha=0.5
)
axes[2].set_title("Bird's Eye View")
axes[2].set_xlabel('X (m)')
axes[2].set_ylabel('Y (m)')
axes[2].set_aspect('equal')
axes[2].set_xlim(-40, 40)
axes[2].set_ylim(-40, 40)
plt.tight_layout()
plt.show()
return pcd
Exercise 1: Modify the beam pattern to simulate a Velodyne VLP-32C (32 beams, non-uniform spacing concentrated near the horizon, 360-degree azimuth with 0.2-degree resolution). How does reducing beam count from 64 to 32 affect the rendered point cloud density?
Exercise 2: Add Gaussian noise to the rendered depth values (sigma = 0.02m) and random dropout (2% of points). How does this compare visually to the clean rendering?
Step 3: Multi-Sensor Joint Rendering (Notebook 02, ~4 hours)
Goal: Build a unified scene representation that renders both camera RGB images and lidar point clouds, with proper coordinate frame handling.
Unified Gaussian scene representation:
class MultiSensorGaussianScene(nn.Module):
"""
A 3D Gaussian Splatting scene with attributes for both
camera (color) and lidar (reflectance, normal) rendering.
"""
def __init__(self, n_gaussians: int):
super().__init__()
self.n_gaussians = n_gaussians
# Shared geometry
self.means = nn.Parameter(torch.randn(n_gaussians, 3))
self.scales_raw = nn.Parameter(torch.zeros(n_gaussians, 3)) # log-scale
self.rotations = nn.Parameter(torch.zeros(n_gaussians, 4)) # quaternions
self.rotations.data[:, 0] = 1.0 # Initialize as identity
self.opacities_raw = nn.Parameter(torch.zeros(n_gaussians)) # logit-space
# Camera-specific: spherical harmonics for view-dependent color
self.sh_coeffs = nn.Parameter(torch.zeros(n_gaussians, 3, 16)) # SH degree 3
# Lidar-specific
self.reflectances_raw = nn.Parameter(torch.zeros(n_gaussians)) # logit-space
self.normals_raw = nn.Parameter(torch.randn(n_gaussians, 3))
# Raydrop predictor
self.raydrop_net = RaydropPredictor(feature_dim=8, hidden_dim=64)
@property
def scales(self) -> torch.Tensor:
return torch.exp(self.scales_raw)
@property
def opacities(self) -> torch.Tensor:
return torch.sigmoid(self.opacities_raw)
@property
def reflectances(self) -> torch.Tensor:
return torch.sigmoid(self.reflectances_raw)
@property
def normals(self) -> torch.Tensor:
return nn.functional.normalize(self.normals_raw, dim=-1)
def get_covariance_matrices(self) -> torch.Tensor:
"""
Build 3x3 covariance matrices from quaternion rotations and scales.
Sigma = R @ S @ S^T @ R^T
Returns:
covariances: (N, 3, 3)
"""
R = quaternion_to_rotation_matrix(self.rotations) # (N, 3, 3)
S = torch.diag_embed(self.scales) # (N, 3, 3)
return R @ S @ S.transpose(-1, -2) @ R.transpose(-1, -2)
def get_covariance_inv(self) -> torch.Tensor:
"""Inverse covariance matrices."""
R = quaternion_to_rotation_matrix(self.rotations)
S_inv = torch.diag_embed(1.0 / (self.scales + 1e-8))
return R @ S_inv @ S_inv.transpose(-1, -2) @ R.transpose(-1, -2)
def render_camera(
self,
camera_calib: 'SensorCalibration',
ego_pose: torch.Tensor, # (4, 4)
) -> torch.Tensor:
"""
Render an RGB image using standard 3DGS rasterization.
This is a simplified version. In production, use the cuda-based
differentiable rasterizer from the original 3DGS codebase.
Returns:
image: (3, H, W) rendered RGB image
"""
H, W = camera_calib.image_size
# Transform Gaussians to camera frame
points_cam = camera_calib.world_to_sensor(self.means, ego_pose)
# Project to pixels
pixels, valid = camera_calib.project_to_image(points_cam)
# For each valid Gaussian, compute its 2D projected covariance
# and splat its color onto the image
# (Simplified: use point splatting instead of full Gaussian splatting)
image = torch.zeros(3, H, W, device=self.means.device)
depth_buffer = torch.full((H, W), float('inf'), device=self.means.device)
valid_indices = torch.where(valid)[0]
if len(valid_indices) == 0:
return image
# Sort by depth for proper compositing
depths = points_cam[valid_indices, 2]
sort_idx = torch.argsort(depths)
valid_indices = valid_indices[sort_idx]
for idx in valid_indices:
u, v = int(pixels[idx, 0].item()), int(pixels[idx, 1].item())
if 0 <= u < W and 0 <= v < H:
# Evaluate SH at viewing direction for color
color = self._eval_sh(idx, -points_cam[idx] / (points_cam[idx].norm() + 1e-8))
alpha = self.opacities[idx]
# Simple alpha blending (not full Gaussian splatting)
image[:, v, u] = image[:, v, u] * (1 - alpha) + color * alpha
return image
def render_lidar(
self,
lidar_calib: 'SensorCalibration',
ego_pose: torch.Tensor,
elevation_angles: torch.Tensor,
azimuth_angles: torch.Tensor,
max_range: float = 75.0,
) -> dict[str, torch.Tensor]:
"""
Render lidar depth, intensity, and raydrop prediction.
Returns:
dict with 'depth', 'intensity', 'raydrop_prob', 'alpha'
"""
# Generate rays in world frame
origins, directions = lidar_calib.generate_lidar_rays(
elevation_angles, azimuth_angles, ego_pose
)
# Cast rays through Gaussian scene
result = render_lidar(
ray_origins=origins,
ray_directions=directions,
means=self.means,
covariances_inv=self.get_covariance_inv(),
opacities=self.opacities,
reflectances=self.reflectances,
normals=self.normals,
max_range=max_range,
)
# Predict raydrop
H = len(elevation_angles)
W = len(azimuth_angles)
ray_el = elevation_angles.unsqueeze(1).expand(H, W).reshape(-1)
raydrop_prob = self.raydrop_net(
rendered_depth=result['depth'],
rendered_intensity=result['intensity'],
accumulated_alpha=result['alpha'],
ray_elevation=ray_el,
)
result['raydrop_prob'] = raydrop_prob
return result
def _eval_sh(self, idx: int, view_dir: torch.Tensor) -> torch.Tensor:
"""Evaluate spherical harmonics for Gaussian idx at view direction."""
# Simplified: return first SH coefficient (DC term = base color)
return torch.sigmoid(self.sh_coeffs[idx, :, 0])
def quaternion_to_rotation_matrix(q: torch.Tensor) -> torch.Tensor:
"""
Convert quaternions to rotation matrices.
Args:
q: (N, 4) quaternions [w, x, y, z]
Returns:
R: (N, 3, 3) rotation matrices
"""
q = nn.functional.normalize(q, dim=-1)
w, x, y, z = q[:, 0], q[:, 1], q[:, 2], q[:, 3]
R = torch.stack([
1 - 2*y*y - 2*z*z, 2*x*y - 2*w*z, 2*x*z + 2*w*y,
2*x*y + 2*w*z, 1 - 2*x*x - 2*z*z, 2*y*z - 2*w*x,
2*x*z - 2*w*y, 2*y*z + 2*w*x, 1 - 2*x*x - 2*y*y,
], dim=-1).reshape(-1, 3, 3)
return R
Side-by-side rendering demonstration:
def demonstrate_joint_rendering(scene, camera_calib, lidar_calib, ego_pose):
"""Render both camera and lidar from the same scene and display side by side."""
# Camera rendering
camera_image = scene.render_camera(camera_calib, ego_pose)
# Lidar rendering
elevation_angles, azimuth_angles = create_lidar_beam_pattern(
n_beams=64, n_azimuth=1024
)
lidar_result = scene.render_lidar(
lidar_calib, ego_pose, elevation_angles, azimuth_angles
)
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
# Camera image
cam_img = camera_image.detach().permute(1, 2, 0).clamp(0, 1).numpy()
axes[0, 0].imshow(cam_img)
axes[0, 0].set_title('Camera RGB Rendering')
axes[0, 0].axis('off')
# Lidar range image
H, W = len(elevation_angles), len(azimuth_angles)
depth_img = lidar_result['depth'].detach().reshape(H, W).numpy()
im1 = axes[0, 1].imshow(depth_img, cmap='turbo', vmin=0, vmax=75)
axes[0, 1].set_title('Lidar Range Image')
plt.colorbar(im1, ax=axes[0, 1])
# Lidar intensity image
int_img = lidar_result['intensity'].detach().reshape(H, W).numpy()
im2 = axes[1, 0].imshow(int_img, cmap='gray')
axes[1, 0].set_title('Lidar Intensity Image')
plt.colorbar(im2, ax=axes[1, 0])
# Raydrop probability
drop_img = lidar_result['raydrop_prob'].detach().reshape(H, W).numpy()
im3 = axes[1, 1].imshow(drop_img, cmap='RdYlGn', vmin=0, vmax=1)
axes[1, 1].set_title('Raydrop Probability (green=valid)')
plt.colorbar(im3, ax=axes[1, 1])
plt.suptitle('Multi-Sensor Neural Rendering -- Joint Output', fontsize=14)
plt.tight_layout()
plt.show()
Projecting lidar onto camera for consistency check:
def project_lidar_onto_camera(
lidar_depth: torch.Tensor,
elevation_angles: torch.Tensor,
azimuth_angles: torch.Tensor,
lidar_calib: 'SensorCalibration',
camera_calib: 'SensorCalibration',
ego_pose: torch.Tensor,
camera_image: torch.Tensor, # (3, H, W)
):
"""
Project lidar points onto the camera image to verify calibration consistency.
Overlays colored depth points on the camera rendering.
"""
H_l = len(elevation_angles)
W_l = len(azimuth_angles)
# Reconstruct 3D points from lidar depth
el = elevation_angles.unsqueeze(1).expand(H_l, W_l)
az = azimuth_angles.unsqueeze(0).expand(H_l, W_l)
depth_2d = lidar_depth.reshape(H_l, W_l)
x = depth_2d * torch.cos(el) * torch.cos(az)
y = depth_2d * torch.cos(el) * torch.sin(az)
z = depth_2d * torch.sin(el)
points_lidar = torch.stack([x, y, z], dim=-1).reshape(-1, 3)
depths = lidar_depth.reshape(-1)
# Transform: lidar frame -> ego frame -> world frame -> camera frame
T_ego_lidar = lidar_calib.T_ego_sensor
T_cam_ego = camera_calib.T_sensor_ego
ones = torch.ones(points_lidar.shape[0], 1)
p_h = torch.cat([points_lidar, ones], dim=1)
# lidar -> ego -> camera
p_ego = (T_ego_lidar @ p_h.T).T[:, :3]
p_cam = (T_cam_ego @ torch.cat([p_ego, ones], dim=1).T).T[:, :3]
# Project to image
pixels, valid = camera_calib.project_to_image(p_cam)
# Overlay on camera image
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
cam_img = camera_image.detach().permute(1, 2, 0).clamp(0, 1).numpy()
ax.imshow(cam_img)
valid_pixels = pixels[valid].detach().numpy()
valid_depths = depths[valid].detach().numpy()
scatter = ax.scatter(
valid_pixels[:, 0], valid_pixels[:, 1],
c=valid_depths, cmap='turbo', s=1, alpha=0.7,
vmin=0, vmax=75,
)
plt.colorbar(scatter, ax=ax, label='Depth (m)')
ax.set_title('Lidar Points Projected onto Camera Image')
ax.axis('off')
plt.tight_layout()
plt.show()
Exercise 3: Create a scene with a retroreflective traffic sign (reflectance = 0.95) and a matte black vehicle (reflectance = 0.15). Render lidar from the same viewpoint and verify that the intensity values differ by the expected ratio. Overlay both the camera image and lidar points to confirm geometric alignment.
Exercise 4: Rotate the ego vehicle by 45 degrees and re-render both sensors. Verify that the same scene objects appear in consistent positions across both modalities.
Step 4: Joint Training and Evaluation (Notebook 03, ~4 hours)
Goal: Train the multi-sensor Gaussian scene from paired camera images and lidar sweeps, evaluate quality with modality-specific metrics, and demonstrate novel view/sweep synthesis.
Training loop:
def train_multi_sensor(
scene: MultiSensorGaussianScene,
camera_data: list[dict], # [{'image': Tensor, 'calib': SensorCalibration, 'ego_pose': Tensor}, ...]
lidar_data: list[dict], # [{'depth': Tensor, 'intensity': Tensor, 'valid': Tensor, 'calib': ..., 'ego_pose': ...}, ...]
n_iterations: int = 5000,
lr: float = 1e-3,
lr_means: float = 1e-4,
) -> dict:
"""
Joint training loop for multi-sensor Gaussian scene.
Returns:
Training history dict with per-iteration losses.
"""
# Separate learning rates for different parameter groups
optimizer = torch.optim.Adam([
{'params': [scene.means], 'lr': lr_means},
{'params': [scene.scales_raw, scene.rotations, scene.opacities_raw], 'lr': lr},
{'params': [scene.sh_coeffs], 'lr': lr},
{'params': [scene.reflectances_raw, scene.normals_raw], 'lr': lr * 0.5},
{'params': scene.raydrop_net.parameters(), 'lr': lr * 0.1},
])
loss_fn = MultiSensorLoss(
lambda_rgb=1.0,
lambda_depth=0.1,
lambda_intensity=0.05,
lambda_raydrop=0.01,
)
# Learning rate schedule: warm up lidar losses
def get_lidar_weight_multiplier(iteration: int) -> float:
"""Ramp up lidar loss contribution over first 1000 iterations."""
return min(1.0, iteration / 1000.0)
elevation_angles, azimuth_angles = create_lidar_beam_pattern(n_beams=32, n_azimuth=512)
history = {'total': [], 'camera': [], 'depth': [], 'intensity': [], 'raydrop': []}
for it in tqdm(range(n_iterations), desc="Training"):
optimizer.zero_grad()
# Sample a random training frame
cam_idx = np.random.randint(len(camera_data))
lidar_idx = np.random.randint(len(lidar_data))
cam = camera_data[cam_idx]
lid = lidar_data[lidar_idx]
# Render camera
rendered_image = scene.render_camera(cam['calib'], cam['ego_pose'])
# Render lidar
lidar_result = scene.render_lidar(
lid['calib'], lid['ego_pose'],
elevation_angles, azimuth_angles,
)
# Compute losses
lidar_mult = get_lidar_weight_multiplier(it)
losses = loss_fn(
camera_rendered=rendered_image.unsqueeze(0),
camera_target=cam['image'].unsqueeze(0),
lidar_depth_rendered=lidar_result['depth'],
lidar_depth_target=lid['depth'],
lidar_intensity_rendered=lidar_result['intensity'],
lidar_intensity_target=lid['intensity'],
raydrop_pred=lidar_result['raydrop_prob'],
raydrop_target=lid['valid'].float(),
valid_mask=lid['valid'],
)
# Apply lidar weight ramp
total = (
losses['camera'] +
lidar_mult * loss_fn.lambda_depth * losses['depth'] +
lidar_mult * loss_fn.lambda_intensity * losses['intensity'] +
lidar_mult * loss_fn.lambda_raydrop * losses['raydrop']
)
total.backward()
# Gradient clipping for stability
torch.nn.utils.clip_grad_norm_(scene.parameters(), max_norm=1.0)
optimizer.step()
# Log
for key in history:
history[key].append(losses[key].item() if key != 'total' else total.item())
if (it + 1) % 500 == 0:
print(f"[{it+1:>5d}] total={total.item():.4f} "
f"cam={losses['camera'].item():.4f} "
f"depth={losses['depth'].item():.4f} "
f"intensity={losses['intensity'].item():.4f} "
f"raydrop={losses['raydrop'].item():.4f}")
return history
Evaluation metrics:
def evaluate_camera_metrics(
scene: MultiSensorGaussianScene,
test_frames: list[dict],
) -> dict:
"""
Evaluate camera rendering quality on held-out test views.
Returns:
dict with mean PSNR, SSIM, LPIPS across test set.
"""
import lpips as lpips_module
from skimage.metrics import peak_signal_noise_ratio, structural_similarity
lpips_fn = lpips_module.LPIPS(net='vgg')
psnrs, ssims, lpips_vals = [], [], []
for frame in test_frames:
with torch.no_grad():
rendered = scene.render_camera(frame['calib'], frame['ego_pose'])
# Convert to numpy HWC for skimage metrics
rendered_np = rendered.detach().permute(1, 2, 0).clamp(0, 1).numpy()
target_np = frame['image'].permute(1, 2, 0).numpy()
psnr = peak_signal_noise_ratio(target_np, rendered_np, data_range=1.0)
ssim = structural_similarity(target_np, rendered_np, channel_axis=2, data_range=1.0)
# LPIPS
r_tensor = rendered.unsqueeze(0) * 2 - 1 # [-1, 1]
t_tensor = frame['image'].unsqueeze(0) * 2 - 1
with torch.no_grad():
lpips_val = lpips_fn(r_tensor, t_tensor).item()
psnrs.append(psnr)
ssims.append(ssim)
lpips_vals.append(lpips_val)
return {
'psnr_mean': np.mean(psnrs),
'psnr_std': np.std(psnrs),
'ssim_mean': np.mean(ssims),
'ssim_std': np.std(ssims),
'lpips_mean': np.mean(lpips_vals),
'lpips_std': np.std(lpips_vals),
}
def evaluate_lidar_metrics(
scene: MultiSensorGaussianScene,
test_frames: list[dict],
elevation_angles: torch.Tensor,
azimuth_angles: torch.Tensor,
) -> dict:
"""
Evaluate lidar rendering quality on held-out test sweeps.
Returns:
dict with depth MAE, intensity MAE, raydrop F1 score.
"""
depth_errors = []
intensity_errors = []
raydrop_tp, raydrop_fp, raydrop_fn = 0, 0, 0
for frame in test_frames:
with torch.no_grad():
result = scene.render_lidar(
frame['calib'], frame['ego_pose'],
elevation_angles, azimuth_angles,
)
valid = frame['valid']
# Depth MAE (only on valid rays)
if valid.sum() > 0:
depth_mae = torch.abs(
result['depth'][valid] - frame['depth'][valid]
).mean().item()
depth_errors.append(depth_mae)
# Intensity MAE
int_mae = torch.abs(
result['intensity'][valid] - frame['intensity'][valid]
).mean().item()
intensity_errors.append(int_mae)
# Raydrop classification metrics
pred_valid = result['raydrop_prob'] > 0.5
gt_valid = frame['valid']
raydrop_tp += (pred_valid & gt_valid).sum().item()
raydrop_fp += (pred_valid & ~gt_valid).sum().item()
raydrop_fn += (~pred_valid & gt_valid).sum().item()
# F1 score
precision = raydrop_tp / (raydrop_tp + raydrop_fp + 1e-8)
recall = raydrop_tp / (raydrop_tp + raydrop_fn + 1e-8)
f1 = 2 * precision * recall / (precision + recall + 1e-8)
return {
'depth_mae_mean': np.mean(depth_errors),
'depth_mae_std': np.std(depth_errors),
'intensity_mae_mean': np.mean(intensity_errors),
'intensity_mae_std': np.std(intensity_errors),
'raydrop_precision': precision,
'raydrop_recall': recall,
'raydrop_f1': f1,
}
Visualization of training progress:
def plot_training_curves(history: dict):
"""Plot loss curves for all components."""
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
iterations = range(len(history['total']))
# Total loss
axes[0, 0].plot(iterations, history['total'], 'b-', alpha=0.3)
axes[0, 0].plot(
iterations,
np.convolve(history['total'], np.ones(50)/50, mode='valid'),
'b-', linewidth=2
)
axes[0, 0].set_title('Total Loss')
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_yscale('log')
axes[0, 0].grid(True, alpha=0.3)
# Camera loss
axes[0, 1].plot(iterations, history['camera'], 'g-', alpha=0.3)
axes[0, 1].set_title('Camera Loss (L1 + SSIM)')
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].grid(True, alpha=0.3)
# Depth loss
axes[0, 2].plot(iterations, history['depth'], 'r-', alpha=0.3)
axes[0, 2].set_title('Lidar Depth Loss')
axes[0, 2].set_xlabel('Iteration')
axes[0, 2].grid(True, alpha=0.3)
# Intensity loss
axes[1, 0].plot(iterations, history['intensity'], 'm-', alpha=0.3)
axes[1, 0].set_title('Lidar Intensity Loss')
axes[1, 0].set_xlabel('Iteration')
axes[1, 0].grid(True, alpha=0.3)
# Raydrop loss
axes[1, 1].plot(iterations, history['raydrop'], 'c-', alpha=0.3)
axes[1, 1].set_title('Raydrop BCE Loss')
axes[1, 1].set_xlabel('Iteration')
axes[1, 1].grid(True, alpha=0.3)
# Leave last subplot for metrics summary
axes[1, 2].axis('off')
plt.suptitle('Multi-Sensor Training Progress', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Novel view and novel sweep synthesis:
def synthesize_novel_views(
scene: MultiSensorGaussianScene,
camera_calib: 'SensorCalibration',
lidar_calib: 'SensorCalibration',
base_ego_pose: torch.Tensor,
n_views: int = 8,
orbit_radius: float = 5.0,
):
"""
Generate camera images and lidar sweeps from novel viewpoints
by orbiting around the base ego pose.
"""
elevation_angles, azimuth_angles = create_lidar_beam_pattern(n_beams=32, n_azimuth=512)
angles = torch.linspace(0, 2 * np.pi, n_views + 1)[:-1]
fig, axes = plt.subplots(2, n_views, figsize=(4 * n_views, 8))
for i, angle in enumerate(angles):
# Create novel ego pose (orbit around scene center)
offset = torch.tensor([
orbit_radius * torch.cos(angle),
orbit_radius * torch.sin(angle),
0.0,
])
novel_pose = base_ego_pose.clone()
novel_pose[:3, 3] += offset
with torch.no_grad():
# Camera
cam_img = scene.render_camera(camera_calib, novel_pose)
cam_np = cam_img.permute(1, 2, 0).clamp(0, 1).numpy()
# Lidar
lidar_out = scene.render_lidar(
lidar_calib, novel_pose,
elevation_angles, azimuth_angles,
)
axes[0, i].imshow(cam_np)
axes[0, i].set_title(f'Camera {i+1}')
axes[0, i].axis('off')
H = len(elevation_angles)
W = len(azimuth_angles)
depth_img = lidar_out['depth'].reshape(H, W).numpy()
axes[1, i].imshow(depth_img, cmap='turbo', vmin=0, vmax=50)
axes[1, i].set_title(f'Lidar {i+1}')
axes[1, i].axis('off')
plt.suptitle('Novel View Synthesis -- Camera + Lidar', fontsize=14)
plt.tight_layout()
plt.show()
Exercise 5: Train the model for 5000 iterations and plot the loss curves. At which iteration does each loss component converge? Does the lidar loss ramp-up strategy help compared to constant weighting?
Exercise 6: Compare the evaluation metrics when training with camera-only loss vs. joint camera+lidar loss. Does adding lidar supervision improve the camera metrics (PSNR, SSIM)? Does adding camera supervision improve the lidar metrics (depth MAE)?
Exercise 7: Generate 8 novel views at increasing distances from the training viewpoints. How do the metrics degrade as you move further from the training distribution? Plot PSNR and depth MAE as a function of distance from the nearest training view.
Notebook Exercises
| # | Notebook | Focus | Time |
|---|---|---|---|
| 1 | 01_lidar_ray_casting.ipynb | Lidar beam pattern; ray-Gaussian intersection math; depth rendering via alpha compositing; intensity from normals/reflectance; raydrop modeling; point cloud visualization. | 90 min |
| 2 | 02_multi_sensor_rendering.ipynb | Unified scene class; camera RGB rendering from SH; lidar rendering (depth, intensity, raydrop); sensor calibration and coordinate transforms; consistency verification via lidar-camera projection. | 90 min |
| 3 | 03_joint_training.ipynb | Multi-sensor loss function; joint training loop with loss ramp; camera metrics (PSNR/SSIM/LPIPS); lidar metrics (depth MAE, intensity MAE, raydrop F1); novel view + novel sweep synthesis. | 90 min |
Expected Deliverables
-
Multi-sensor Gaussian scene module: A well-structured PyTorch
nn.Modulecontaining the shared Gaussian representation with both camera (SH coefficients) and lidar (reflectance, normals) attributes, plus the raydrop predictor network. -
Lidar rendering pipeline: A differentiable lidar renderer that casts rays through Gaussian scenes and produces depth, intensity, and accumulated alpha outputs with proper gradients for training.
-
Joint training script: A training loop that optimizes the shared scene from both camera images and lidar sweeps simultaneously, with configurable loss weights and a lidar loss ramp-up schedule.
-
Evaluation pipeline: Functions to compute camera metrics (PSNR, SSIM, LPIPS) and lidar metrics (depth MAE, intensity MAE, raydrop precision/recall/F1) on held-out test data.
-
Novel sensor synthesis demonstration: Rendered camera images and lidar sweeps from viewpoints not in the training set, with qualitative and quantitative analysis of extrapolation quality.
-
Completed Jupyter notebooks (3 notebooks): All code cells executed, exercises completed with inline commentary, and visualizations showing the progression from individual components to the full joint system.
Evaluation Criteria
| Criteria | Weight | Description |
|---|---|---|
| Correctness | 30% | Ray-Gaussian intersection math is correct. Alpha compositing produces valid depths. Intensity model accounts for incidence angle and reflectance. Coordinate transforms are invertible and consistent. |
| Code Quality | 20% | Clean, modular PyTorch code. Functions have docstrings and type hints. Scene module follows nn.Module conventions. No code duplication between camera and lidar paths. |
| Analysis Depth | 25% | Written analysis demonstrates understanding of the camera-lidar complementarity. Includes comparison of joint vs. single-modality training. Discusses failure modes (e.g., thin structures, glass, vegetation). |
| Visualization | 15% | Range images and point clouds are correctly rendered. Camera-lidar projection overlay demonstrates calibration consistency. Training curves are smooth and show expected convergence patterns. |
| Documentation | 10% | README with clear installation and usage instructions. Notebooks include markdown cells explaining each step. Exercise solutions include written interpretation of results. |
Related Deep Dives
-
Neural Rendering for AD Simulation: Covers the foundational neural rendering techniques (NeRF, 3DGS) and their extensions for multi-sensor AD simulation. The "Multi-Sensor Extensions" section directly informs this project's architecture.
-
Sensor Simulation for AD: Deep dive into lidar physics, intensity modeling, raydrop patterns, and the role of neural methods in achieving sensor-realistic simulation. Essential reading for understanding why each component of the lidar pipeline matters.
Next Steps
After completing this project, consider these natural follow-ups:
-
Dynamic Scene Decomposition (Track A, Advanced): Extend the multi-sensor renderer to handle dynamic objects -- separate Gaussians for static background and moving actors, with per-actor rigid-body transforms over time.
-
Physics-Based Sensor Simulator (Track B, Advanced): Move from learned sensor models to physics-based ones -- implement a full lidar sensor simulator with beam divergence, atmospheric scattering, and material BRDF models.
-
3DGS Scene Reconstruction (Track A, Intermediate): If you want to deepen your understanding of the underlying 3DGS training pipeline, this project focuses on the camera-only case with full densification and pruning strategies.