Back to all projects
Track A: Neural SimulationIntermediate20-30 hours

3DGS Scene Reconstruction

Reconstruct driving scenes from multi-view images using 3D Gaussian Splatting.

PyTorch3D VisionDifferentiable Rendering

3DGS Scene Reconstruction

Reconstruct driving scenes from multi-view images using 3D Gaussian Splatting -- learn the representation, build the renderer, and train a full reconstruction pipeline with quantitative evaluation.

Overview

3D Gaussian Splatting (3DGS) has rapidly become the dominant approach for real-time, photorealistic scene reconstruction. Unlike Neural Radiance Fields (NeRFs) that require expensive per-ray sampling through a volumetric MLP, 3DGS represents scenes as collections of explicit 3D Gaussian primitives that can be rasterized in real time. Each Gaussian carries its own position, shape (via an anisotropic covariance matrix), opacity, and view-dependent color (via spherical harmonics). This explicit representation enables rendering speeds of 100+ FPS at high resolution -- a critical requirement for autonomous driving simulation where you need to generate sensor data in real time for hardware-in-the-loop testing.

For autonomous driving, 3DGS offers a compelling path to sensor-realistic simulation. Companies like Waymo (with SplatAD), NVIDIA, and Applied Intuition are actively exploring Gaussian-based scene representations because they can faithfully capture complex real-world geometry -- road surfaces, vehicles, vegetation, lane markings -- while rendering fast enough to close the simulation loop. The ability to render novel viewpoints (what would the camera see if the ego vehicle were 2 meters to the left?) is what makes 3DGS invaluable for safety-critical scenario generation and testing.

In this project, you will build a 3DGS pipeline from the ground up. Starting from the mathematical foundations of 3D Gaussians and their projection onto 2D image planes, you will implement a differentiable splatting renderer, and then train a complete reconstruction system on multi-view driving scene data. Along the way, you will implement adaptive density control (the mechanism that grows and prunes Gaussians during training), evaluate reconstruction quality with standard metrics (PSNR, SSIM, LPIPS), and render novel views of reconstructed scenes. By the end, you will have deep intuition for how 3DGS works, why it produces such high-quality results, and where its limitations lie for AD applications.

Learning Objectives

  • Understand the 3D Gaussian representation -- position, covariance (decomposed into scale and rotation), opacity, and spherical harmonics color, and why each component matters.
  • Implement the projection of 3D Gaussians to 2D splats using the Jacobian of the perspective projection and the viewing transformation.
  • Build a differentiable tile-based rasterizer that sorts Gaussians by depth, alpha-blends them front-to-back, and produces a rendered image.
  • Implement the full training loop including photometric loss (L1 + SSIM), learning rate scheduling, and adaptive density control (splitting, cloning, and pruning of Gaussians).
  • Initialize a scene from a sparse point cloud (from Structure-from-Motion) and understand why initialization quality matters.
  • Evaluate reconstruction quality using PSNR, SSIM, and LPIPS, and interpret what each metric reveals about different failure modes.
  • Render novel views by interpolating camera poses and generating images from viewpoints not seen during training.
  • Analyze failure modes specific to driving scenes: moving objects, reflective surfaces, thin structures, and sky regions.

Prerequisites

  • Required: Python 3.9+, PyTorch basics (tensors, autograd, optimizers, GPU usage), linear algebra fundamentals (matrix multiplication, eigendecomposition, rotation matrices).
  • Recommended: Basic 3D geometry (homogeneous coordinates, camera intrinsics/extrinsics, perspective projection). Familiarity with differentiable rendering concepts is helpful but not required -- we build everything from scratch.
  • Deep Dive Reading: Neural Rendering for AD Simulation -- read the sections on 3D Gaussian Splatting and the comparison with NeRF to understand the broader context of why this representation matters for autonomous driving.

Key Concepts

3D Gaussians and Their Properties

A 3D Gaussian is a probability distribution in three-dimensional space, parameterized by a mean (center position) and a covariance matrix (shape). In 3DGS, each Gaussian primitive represents a small "blob" of scene content -- a patch of road surface, part of a vehicle body, a leaf on a tree.

Position (mean): A 3D vector mu = [x, y, z] specifying where the Gaussian is centered in world coordinates.

Covariance matrix: A 3x3 positive semi-definite matrix Sigma that defines the shape and orientation of the Gaussian:

G(x) = exp(-0.5 * (x - mu)^T * Sigma^{-1} * (x - mu))

The covariance matrix encodes three geometric properties simultaneously:

  • Size (how large the Gaussian is along each axis)
  • Orientation (how the Gaussian is rotated in 3D)
  • Anisotropy (whether the Gaussian is elongated or spherical)

Covariance decomposition: Directly optimizing a 3x3 covariance matrix is problematic because we must ensure it remains positive semi-definite (otherwise the Gaussian is invalid). The key insight from the 3DGS paper is to decompose it as:

Sigma = R * S * S^T * R^T

where R is a 3x3 rotation matrix (parameterized by a unit quaternion q = [w, x, y, z]) and S = diag(s_x, s_y, s_z) is a diagonal scaling matrix. This decomposition guarantees positive semi-definiteness by construction, and we can freely optimize the quaternion and scale parameters with gradient descent.

import torch

def quaternion_to_rotation_matrix(q: torch.Tensor) -> torch.Tensor:
    """Convert unit quaternion [w, x, y, z] to 3x3 rotation matrix."""
    w, x, y, z = q[..., 0], q[..., 1], q[..., 2], q[..., 3]

    R = torch.stack([
        1 - 2*(y*y + z*z),   2*(x*y - w*z),       2*(x*z + w*y),
        2*(x*y + w*z),       1 - 2*(x*x + z*z),   2*(y*z - w*x),
        2*(x*z - w*y),       2*(y*z + w*x),       1 - 2*(x*x + y*y),
    ], dim=-1).reshape(*q.shape[:-1], 3, 3)

    return R


def build_covariance_3d(scales: torch.Tensor, quats: torch.Tensor) -> torch.Tensor:
    """Build 3D covariance matrix from scale and quaternion parameters.

    Args:
        scales: (N, 3) log-scale parameters (we exponentiate for positivity).
        quats: (N, 4) quaternion parameters [w, x, y, z].

    Returns:
        (N, 3, 3) covariance matrices.
    """
    # Ensure unit quaternions
    quats = quats / quats.norm(dim=-1, keepdim=True)
    R = quaternion_to_rotation_matrix(quats)        # (N, 3, 3)
    S = torch.diag_embed(torch.exp(scales))         # (N, 3, 3)
    # Sigma = R @ S @ S^T @ R^T
    M = R @ S                                        # (N, 3, 3)
    Sigma = M @ M.transpose(-1, -2)                  # (N, 3, 3)
    return Sigma

Opacity: A scalar alpha in [0, 1] controlling how opaque the Gaussian is. Stored as a logit (inverse sigmoid) for unconstrained optimization: alpha = sigmoid(alpha_raw).

Color via Spherical Harmonics (SH): Rather than storing a single RGB color, each Gaussian stores spherical harmonic coefficients that encode view-dependent appearance. For a given viewing direction d, the color is:

c(d) = sum_{l=0}^{L} sum_{m=-l}^{l} c_{l,m} * Y_{l,m}(d)

where Y_{l,m} are the real spherical harmonic basis functions and c_{l,m} are the per-Gaussian, per-color-channel coefficients. Using degree L=0 gives view-independent (constant) color (1 coefficient per channel = 3 total). Degree L=3 gives 16 coefficients per channel (48 total) and can capture specular highlights and view-dependent effects.

When it works well: 3D Gaussians excel at representing smooth, continuous surfaces (roads, building facades) and complex fine detail (vegetation, fences) because the adaptive density control can allocate more Gaussians where detail is needed. The explicit representation also makes it easy to edit scenes (remove objects, add new elements).

When it fails: Gaussians struggle with perfectly flat, textureless surfaces (where many redundant Gaussians accumulate), extremely thin structures (wires, poles -- require very elongated Gaussians that can cause rendering artifacts), and transparent or semi-transparent materials (glass, fog) where the single-opacity-per-Gaussian model is too simplistic.

Splatting: Projecting 3D Gaussians to 2D

The core rendering operation in 3DGS is "splatting" -- projecting each 3D Gaussian onto the 2D image plane. A remarkable mathematical property makes this efficient: the projection of a 3D Gaussian through a perspective camera is approximately a 2D Gaussian on the image plane.

The projection process:

  1. Transform to camera space: Apply the world-to-camera transformation W (a 4x4 matrix combining rotation and translation):

    mu_cam = W @ [mu; 1]     (homogeneous coordinates)
    
  2. Project to image space: Apply the camera intrinsic matrix K:

    mu_2d = K @ mu_cam / mu_cam.z    (perspective division)
    
  3. Project the covariance: This is the key step. The 3D covariance Sigma_3D is projected to a 2D covariance Sigma_2D using the Jacobian of the projection:

    J = d(project) / d(x_cam)     (2x3 Jacobian of perspective projection)
    Sigma_2D = J @ W[:3,:3] @ Sigma_3D @ W[:3,:3]^T @ J^T
    

The Jacobian of the perspective projection at a point [x, y, z] in camera coordinates is:

J = | f_x/z    0      -f_x*x/z^2 |
    |  0      f_y/z   -f_y*y/z^2 |

where f_x, f_y are the focal lengths.

def compute_projection_jacobian(
    means_cam: torch.Tensor,    # (N, 3) points in camera space
    fx: float, fy: float,       # focal lengths
) -> torch.Tensor:
    """Compute the Jacobian of perspective projection.

    Returns:
        (N, 2, 3) Jacobian matrices.
    """
    x, y, z = means_cam[:, 0], means_cam[:, 1], means_cam[:, 2]
    z2 = z * z

    J = torch.zeros(means_cam.shape[0], 2, 3, device=means_cam.device)
    J[:, 0, 0] = fx / z
    J[:, 0, 2] = -fx * x / z2
    J[:, 1, 1] = fy / z
    J[:, 1, 2] = -fy * y / z2

    return J


def project_covariance_to_2d(
    cov3d: torch.Tensor,        # (N, 3, 3)
    means_cam: torch.Tensor,    # (N, 3)
    W: torch.Tensor,            # (4, 4) world-to-camera
    fx: float, fy: float,
) -> torch.Tensor:
    """Project 3D covariance to 2D screen-space covariance.

    Returns:
        (N, 2, 2) 2D covariance matrices.
    """
    R_w2c = W[:3, :3]  # (3, 3)
    J = compute_projection_jacobian(means_cam, fx, fy)  # (N, 2, 3)

    # Transform covariance to camera space first
    cov_cam = R_w2c @ cov3d @ R_w2c.T   # (N, 3, 3) broadcast over N

    # Project to 2D
    cov2d = J @ cov_cam @ J.transpose(-1, -2)  # (N, 2, 2)

    # Add a small value to diagonal for numerical stability
    cov2d[:, 0, 0] += 0.3
    cov2d[:, 1, 1] += 0.3

    return cov2d

The resulting 2D Gaussian is characterized by its center mu_2d and 2x2 covariance Sigma_2D. The covariance can be eigendecomposed to find the semi-axes of the ellipse that the Gaussian forms on screen.

When it works well: The local linear approximation (using the Jacobian) is accurate when the Gaussian is small relative to its distance from the camera. This holds for the vast majority of Gaussians in typical driving scenes.

When it fails: Very large Gaussians close to the camera (e.g., the road surface directly under the ego vehicle) may have significant nonlinear distortion that the first-order approximation misses. In practice, this is handled by splitting such Gaussians during training.

Differentiable Rendering and Optimization

The rendering equation for 3DGS uses alpha compositing with depth-sorted Gaussians. For each pixel p, the rendered color is:

C(p) = sum_{i=1}^{N} c_i * alpha_i * G_i(p) * T_i

where:

  • c_i is the color of Gaussian i (from SH evaluation)
  • alpha_i is the opacity of Gaussian i
  • G_i(p) is the 2D Gaussian evaluated at pixel p
  • T_i = prod_{j=1}^{i-1} (1 - alpha_j * G_j(p)) is the transmittance (how much light passes through all Gaussians in front of i)

This is rendered front-to-back: Gaussians closer to the camera contribute first, and the transmittance T_i attenuates contributions from Gaussians behind them.

def alpha_composite_sorted(
    pixel: torch.Tensor,         # (2,) pixel coordinate
    means_2d: torch.Tensor,      # (N, 2) projected centers
    covs_2d_inv: torch.Tensor,   # (N, 2, 2) inverse 2D covariances
    colors: torch.Tensor,        # (N, 3) RGB colors
    opacities: torch.Tensor,     # (N,) opacity values in [0, 1]
    sorted_indices: torch.Tensor,  # (N,) indices sorted front-to-back
) -> torch.Tensor:
    """Alpha-composite sorted Gaussians at a single pixel.

    Returns:
        (3,) RGB color for this pixel.
    """
    C = torch.zeros(3, device=pixel.device)
    T = 1.0  # accumulated transmittance

    for idx in sorted_indices:
        diff = pixel - means_2d[idx]                       # (2,)
        power = -0.5 * diff @ covs_2d_inv[idx] @ diff      # scalar
        if power < -4.0:
            continue

        G = torch.exp(power)
        alpha = opacities[idx] * G

        if alpha < 1.0 / 255.0:
            continue

        C = C + T * alpha * colors[idx]
        T = T * (1.0 - alpha)

        if T < 1e-4:
            break

    return C

The loss function combines L1 (photometric) loss with a structural similarity term:

L = (1 - lambda) * L1(I_rendered, I_gt) + lambda * (1 - SSIM(I_rendered, I_gt))

where lambda = 0.2 is the default weighting. The L1 loss encourages pixel-accurate reconstruction, while the SSIM loss encourages structural fidelity (edges, textures).

Because every operation in the rendering pipeline (projection, Gaussian evaluation, alpha compositing) is differentiable, PyTorch's autograd can compute gradients of the loss with respect to all Gaussian parameters. This enables end-to-end optimization: we adjust positions, shapes, colors, and opacities to minimize the reconstruction error.

When it works well: The combination of L1 + SSIM produces sharp, detailed reconstructions. The SSIM term is particularly important for preventing the "blurry average" failure mode that pure L1/L2 losses can produce.

When it fails: The loss operates only on observed viewpoints. For driving scenes with limited viewpoint diversity (cameras pointing forward), the reconstruction may be poor for side-facing or rear-facing views. This is the view extrapolation problem that techniques like regularization and depth supervision help address.

Point Cloud Initialization from Structure-from-Motion

3DGS does not start from nothing -- it requires an initial set of 3D points to seed the Gaussian positions. The standard approach uses Structure-from-Motion (SfM), which simultaneously estimates camera poses and a sparse 3D point cloud from a set of images.

The SfM pipeline (typically COLMAP):

  1. Feature detection: Extract keypoints (e.g., SIFT, SuperPoint) from each image.
  2. Feature matching: Find correspondences between image pairs.
  3. Triangulation: Compute 3D positions of matched features from multiple views.
  4. Bundle adjustment: Jointly optimize camera poses and 3D points to minimize reprojection error.

The resulting sparse point cloud (typically thousands to tens of thousands of points) provides the initial positions for Gaussians. Each initial Gaussian is assigned:

  • Position: The 3D point location from SfM.
  • Scale: Set based on the average distance to the nearest neighbors.
  • Rotation: Identity quaternion [1, 0, 0, 0].
  • Opacity: A moderate initial value (e.g., sigmoid inverse of 0.1).
  • Color: The SH degree-0 coefficient is set to the average observed color of the point.
def initialize_gaussians_from_pointcloud(
    points: torch.Tensor,       # (N, 3) 3D positions
    colors: torch.Tensor,       # (N, 3) RGB colors in [0, 1]
    k_nearest: int = 3,
) -> dict:
    """Initialize Gaussian parameters from an SfM point cloud.

    Returns:
        Dict of parameter tensors ready for optimization.
    """
    N = points.shape[0]

    # Compute initial scale from nearest-neighbor distances
    from scipy.spatial import KDTree
    tree = KDTree(points.detach().cpu().numpy())
    dists, _ = tree.query(points.detach().cpu().numpy(), k=k_nearest + 1)
    avg_dist = torch.tensor(dists[:, 1:].mean(axis=1), dtype=torch.float32)
    log_scales = torch.log(avg_dist).unsqueeze(-1).repeat(1, 3)  # isotropic initial

    # Identity rotation
    quats = torch.zeros(N, 4)
    quats[:, 0] = 1.0   # w = 1, x = y = z = 0

    # Opacity (logit space, initialized to sigmoid^{-1}(0.1))
    opacity_logits = torch.logit(torch.full((N,), 0.1))

    # SH coefficients: degree 0 only for initialization
    # The SH degree-0 basis function Y_0^0 = 0.2821...
    # So c_0 = color / Y_0^0 to get the right DC component
    C0 = 0.28209479177387814
    sh_dc = (colors - 0.5) / C0   # (N, 3)

    return {
        'means': points.clone().requires_grad_(True),
        'log_scales': log_scales.requires_grad_(True),
        'quats': quats.requires_grad_(True),
        'opacity_logits': opacity_logits.requires_grad_(True),
        'sh_coeffs': sh_dc.unsqueeze(1).requires_grad_(True),  # (N, 1, 3)
    }

Why initialization matters: Good initialization dramatically affects convergence speed and final quality. With a dense, accurate SfM point cloud, the Gaussians start close to the true surface and primarily need to refine their shapes and colors. With a sparse or noisy point cloud, the adaptive density control must work much harder to fill in gaps, and the optimization may get stuck in local minima.

For driving scenes: SfM can struggle with driving data because the camera motion is predominantly forward-facing (low parallax for distant objects) and the scene contains many dynamic objects (other vehicles, pedestrians). Techniques like using LiDAR point clouds for initialization, or multi-sensor SfM, significantly improve initialization quality.

Adaptive Density Control: Split, Clone, and Prune

The initial SfM point cloud is sparse -- it might have 50,000 points for a scene that ultimately needs 500,000+ Gaussians. Adaptive density control is the mechanism that grows (and shrinks) the set of Gaussians during training to match the scene complexity.

The three operations:

Splitting: When a Gaussian is too large (covers too much of the scene) and has high gradient magnitude (indicating the optimizer is struggling to fit that region), it is replaced by two smaller Gaussians. Each child has the same center but reduced scale (divided by a factor, typically 1.6), and they are slightly offset along the major axis.

def split_gaussians(
    params: dict,
    grad_threshold: float = 0.0002,
    scale_threshold: float = 0.01,
    scene_extent: float = 1.0,
) -> dict:
    """Split large Gaussians with high positional gradients.

    Large Gaussians that the optimizer is struggling with are replaced
    by two smaller Gaussians.
    """
    grads = params['means'].grad.norm(dim=-1)
    scales = torch.exp(params['log_scales']).max(dim=-1).values

    mask = (grads > grad_threshold) & (scales > scale_threshold * scene_extent)

    if mask.sum() == 0:
        return params

    selected_means = params['means'][mask]
    selected_scales = params['log_scales'][mask]

    # Reduce scale by factor of 1.6
    new_log_scales = selected_scales - torch.log(torch.tensor(1.6))

    # Offset children along random direction scaled by current size
    stds = torch.exp(selected_scales)
    offsets = stds * torch.randn_like(stds) * 0.5

    child1_means = selected_means + offsets
    child2_means = selected_means - offsets

    # ... (concatenate new Gaussians, remove parents, rebuild optimizer)
    return params

Cloning: When a Gaussian is small but has high gradient magnitude (indicating under-reconstruction -- the region needs more coverage), it is duplicated. The clone has the same parameters but is offset slightly in the direction of the gradient.

def clone_gaussians(
    params: dict,
    grad_threshold: float = 0.0002,
    scale_threshold: float = 0.01,
    scene_extent: float = 1.0,
) -> dict:
    """Clone small Gaussians with high positional gradients.

    Small Gaussians in under-reconstructed regions are duplicated.
    """
    grads = params['means'].grad.norm(dim=-1)
    scales = torch.exp(params['log_scales']).max(dim=-1).values

    mask = (grads > grad_threshold) & (scales <= scale_threshold * scene_extent)

    if mask.sum() == 0:
        return params

    clone_means = params['means'][mask].clone()
    grad_dir = params['means'].grad[mask]
    grad_dir = grad_dir / (grad_dir.norm(dim=-1, keepdim=True) + 1e-8)
    offset_scale = torch.exp(params['log_scales'][mask]).mean(dim=-1, keepdim=True)
    clone_means = clone_means + grad_dir * offset_scale * 0.1

    # ... (concatenate clones, rebuild optimizer)
    return params

Pruning: Gaussians that become nearly transparent (opacity below a threshold, e.g., 0.005) or excessively large are removed. Pruning prevents the scene from accumulating "ghost" Gaussians that consume memory without contributing to the rendered image.

def prune_gaussians(params: dict, opacity_threshold: float = 0.005) -> dict:
    """Remove Gaussians that have become nearly transparent."""
    opacities = torch.sigmoid(params['opacity_logits'])
    keep_mask = opacities > opacity_threshold

    for key in params:
        params[key] = params[key][keep_mask].detach().requires_grad_(True)

    return params

The density control operations are typically applied every 100-500 training iterations, with pruning happening more frequently than splitting/cloning. After 15,000-30,000 iterations, the density control is usually disabled and the optimizer fine-tunes the existing Gaussians.

When it works well: Adaptive density control elegantly handles scenes with varying complexity -- allocating many small Gaussians for detailed areas (a car's license plate, tree leaves) and few large Gaussians for simple areas (the sky, flat road surfaces).

When it fails: The gradient-based heuristics can be noisy, leading to over-splitting in some regions (wasting memory) or under-cloning in others (leaving holes). Some implementations add explicit regularization to encourage uniform density or penalize Gaussians that are too elongated.

Step-by-Step Implementation Guide

Step 1: Environment Setup (30 min)

Create and activate a virtual environment:

mkdir -p 3dgs-scene-reconstruction && cd 3dgs-scene-reconstruction
python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install torch torchvision              # PyTorch (GPU recommended but CPU works)
pip install numpy matplotlib               # Numerics and visualization
pip install scipy                          # Spatial data structures (KDTree)
pip install plyfile                        # Reading PLY point cloud files
pip install opencv-python                  # Image I/O and camera utilities
pip install scikit-image                   # PSNR/SSIM reference implementations
pip install lpips                          # Learned perceptual metric
pip install open3d                         # 3D point cloud visualization
pip install tqdm                           # Progress bars
pip install ipykernel jupyter              # Notebook support

Or install from requirements.txt:

pip install -r requirements.txt

Register the Jupyter kernel:

python -m ipykernel install --user --name 3dgs-reconstruction --display-name "3DGS Reconstruction"

Verify the installation:

# verify_setup.py
import torch
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import KDTree
import cv2
from skimage.metrics import peak_signal_noise_ratio, structural_similarity
import lpips

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Quick tensor test
q = torch.tensor([1.0, 0.0, 0.0, 0.0])  # identity quaternion
print(f"Quaternion norm: {q.norm().item():.4f} (should be 1.0)")

# Quick LPIPS test
loss_fn = lpips.LPIPS(net='vgg')
img1 = torch.randn(1, 3, 64, 64)
img2 = torch.randn(1, 3, 64, 64)
d = loss_fn(img1, img2)
print(f"LPIPS distance (random images): {d.item():.4f}")

print("\nSetup verified successfully!")

Run it:

python verify_setup.py

You should see the PyTorch version, GPU information (if available), and a successful LPIPS computation. If CUDA is not available, all code in this project will still work on CPU -- just slower for training.

Prepare sample data:

For the notebooks, we create synthetic scenes procedurally (no external data needed for Notebooks 1 and 2). For Notebook 3, you have several options:

  1. Synthetic multi-view data (easiest, used in the notebook): We generate a simple 3D scene and render ground-truth images from multiple viewpoints using our own renderer.
  2. KITTI or nuScenes: Real driving data with calibrated multi-camera setups. Requires downloading a subset.
  3. Tanks and Temples / Mip-NeRF 360: Standard benchmarks used in 3DGS papers, with COLMAP-processed camera poses.

Step 2: Understanding 3D Gaussians (Notebook 01, ~60 min)

Goal: Build deep intuition for the 3D Gaussian representation by implementing the data structure, exploring how parameters affect appearance, and visualizing projections from 3D to 2D.

The Gaussian data structure:

We represent a collection of N Gaussians as a set of parameter tensors. Each Gaussian has 59 parameters at SH degree 3 (3 position + 3 scale + 4 quaternion + 1 opacity + 48 SH coefficients), though we start with degree 0 (11 parameters total):

import torch

class GaussianModel:
    """A collection of 3D Gaussians representing a scene."""

    def __init__(self, n_gaussians: int, sh_degree: int = 0):
        self.n_sh_coeffs = (sh_degree + 1) ** 2

        # Learnable parameters
        self.means = torch.randn(n_gaussians, 3) * 0.5
        self.log_scales = torch.zeros(n_gaussians, 3) - 2.0
        self.quats = torch.zeros(n_gaussians, 4)
        self.quats[:, 0] = 1.0  # identity rotation
        self.opacity_logits = torch.zeros(n_gaussians)
        self.sh_coeffs = torch.zeros(n_gaussians, self.n_sh_coeffs, 3)

    @property
    def scales(self) -> torch.Tensor:
        return torch.exp(self.log_scales)

    @property
    def opacities(self) -> torch.Tensor:
        return torch.sigmoid(self.opacity_logits)

    @property
    def rotations(self) -> torch.Tensor:
        return self.quats / self.quats.norm(dim=-1, keepdim=True)

    def get_covariance_3d(self) -> torch.Tensor:
        R = quaternion_to_rotation_matrix(self.rotations)
        S = torch.diag_embed(self.scales)
        M = R @ S
        return M @ M.transpose(-1, -2)

Key mathematical insight -- why the decomposition works:

The covariance matrix Sigma = R S S^T R^T is guaranteed positive semi-definite because for any vector v:

v^T Sigma v = v^T R S S^T R^T v = ||S^T R^T v||^2 >= 0

This means we never need to worry about invalid covariance matrices during optimization -- the parameterization makes it impossible.

Visualizing 3D Gaussians: Use the eigendecomposition of the covariance matrix to extract the principal axes and draw ellipsoids. The eigenvalues give the squared semi-axis lengths, and the eigenvectors give the orientation.

def covariance_to_ellipsoid_params(cov: torch.Tensor, n_std: float = 2.0):
    """Extract ellipsoid axes and rotation from a 3x3 covariance matrix."""
    eigenvalues, eigenvectors = torch.linalg.eigh(cov)
    radii = n_std * torch.sqrt(torch.clamp(eigenvalues, min=1e-8))
    return radii, eigenvectors

Exercise: Create a set of 5 Gaussians arranged to form the letter "T" when viewed from the front. Adjust scales to make some elongated (for the horizontal bar) and some compact (for the vertical stroke). Render the 2D projections from front, side, and top viewpoints and verify the geometry.

Step 3: Implementing the Splatting Pipeline (Notebook 02, ~90 min)

Goal: Build a complete differentiable renderer that takes 3D Gaussians and camera parameters and produces an image through alpha-composited splatting.

Camera model:

class Camera:
    """Pinhole camera model with extrinsic pose."""

    def __init__(self, fx, fy, cx, cy, width, height, R, t):
        self.fx, self.fy = fx, fy
        self.cx, self.cy = cx, cy
        self.width, self.height = width, height

        self.W = torch.eye(4)
        self.W[:3, :3] = R
        self.W[:3, 3] = t

        self.K = torch.tensor([
            [fx, 0, cx],
            [0, fy, cy],
            [0,  0,  1],
        ], dtype=torch.float32)

Tile-based rasterization: The original 3DGS uses a CUDA-accelerated tile-based rasterizer. We implement a simplified Python version that captures the same algorithm:

  1. Divide the image into tiles (e.g., 16x16 pixels).
  2. For each tile, find Gaussians whose 2D footprint overlaps it.
  3. Sort overlapping Gaussians by depth (front to back).
  4. Alpha-blend each pixel in the tile.

This pure-Python rasterizer is designed for learning with small scenes (64x64 images, <200 Gaussians). For practical training, you would use a CUDA implementation.

Loss computation:

def compute_loss(rendered, target, lambda_ssim=0.2):
    """Compute 3DGS training loss: (1-lambda)*L1 + lambda*(1-SSIM)."""
    l1_loss = torch.abs(rendered - target).mean()

    # Simplified SSIM (global statistics)
    C1, C2 = 0.01 ** 2, 0.03 ** 2
    mu_r, mu_t = rendered.mean(dim=(0, 1)), target.mean(dim=(0, 1))
    sigma_r = ((rendered - mu_r) ** 2).mean(dim=(0, 1))
    sigma_t = ((target - mu_t) ** 2).mean(dim=(0, 1))
    sigma_rt = ((rendered - mu_r) * (target - mu_t)).mean(dim=(0, 1))

    ssim = ((2 * mu_r * mu_t + C1) * (2 * sigma_rt + C2)) / \
           ((mu_r ** 2 + mu_t ** 2 + C1) * (sigma_r + sigma_t + C2))
    ssim_loss = 1.0 - ssim.mean()

    return (1.0 - lambda_ssim) * l1_loss + lambda_ssim * ssim_loss

Exercise: Create a synthetic scene with 20 colored Gaussians. Render the scene, then perturb the positions and use gradient descent to recover them. Plot the loss curve and render intermediate frames to watch convergence.

Step 4: Training on Driving Scenes (Notebook 03, ~90 min)

Goal: Implement the full training pipeline with initialization, adaptive density control, evaluation, and novel view synthesis.

Training loop overview:

def train_3dgs(cameras, initial_points, initial_colors, n_iterations=5000):
    params = initialize_gaussians_from_pointcloud(initial_points, initial_colors)

    optimizer = torch.optim.Adam([
        {'params': [params['means']], 'lr': 0.00016},
        {'params': [params['log_scales']], 'lr': 0.005},
        {'params': [params['quats']], 'lr': 0.001},
        {'params': [params['opacity_logits']], 'lr': 0.05},
        {'params': [params['sh_coeffs']], 'lr': 0.0025},
    ])

    for iteration in range(n_iterations):
        cam_idx = torch.randint(len(cameras), (1,)).item()
        camera = cameras[cam_idx]

        rendered = render(params, camera)
        loss = compute_loss(rendered, camera.image)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(list(params.values()), 1.0)
        optimizer.step()

        # Adaptive density control (every 100 iterations, iterations 500-3000)
        if 500 <= iteration < 3000 and iteration % 100 == 0:
            params = clone_gaussians(params)
            params = split_gaussians(params)
            params = prune_gaussians(params)
            optimizer = create_optimizer(params)  # rebuild with new params

    return params

Evaluation with PSNR/SSIM/LPIPS:

def evaluate_reconstruction(params, test_cameras):
    """Compute PSNR, SSIM, LPIPS on held-out views."""
    lpips_fn = lpips.LPIPS(net='vgg').eval()
    results = {'psnr': [], 'ssim': [], 'lpips': []}

    for camera in test_cameras:
        with torch.no_grad():
            rendered = render(params, camera).clamp(0, 1)
        gt = camera.image

        rendered_np = rendered.cpu().numpy()
        gt_np = gt.cpu().numpy()

        results['psnr'].append(peak_signal_noise_ratio(gt_np, rendered_np, data_range=1.0))
        results['ssim'].append(structural_similarity(gt_np, rendered_np, channel_axis=2, data_range=1.0))

        r_t = rendered.permute(2, 0, 1).unsqueeze(0) * 2 - 1
        g_t = gt.permute(2, 0, 1).unsqueeze(0) * 2 - 1
        with torch.no_grad():
            results['lpips'].append(lpips_fn(r_t, g_t).item())

    return {k: (np.mean(v), np.std(v)) for k, v in results.items()}

Novel view rendering via camera pose interpolation:

def interpolate_cameras(cam1, cam2, n_steps=30):
    """Interpolate between two camera poses using SLERP for rotation."""
    from scipy.spatial.transform import Rotation, Slerp

    R1 = Rotation.from_matrix(cam1.W[:3, :3].numpy())
    R2 = Rotation.from_matrix(cam2.W[:3, :3].numpy())
    slerp = Slerp([0, 1], Rotation.concatenate([R1, R2]))

    t1, t2 = cam1.W[:3, 3].numpy(), cam2.W[:3, 3].numpy()

    cameras = []
    for i in range(n_steps):
        alpha = i / (n_steps - 1)
        R_interp = torch.tensor(slerp(alpha).as_matrix(), dtype=torch.float32)
        t_interp = torch.tensor((1 - alpha) * t1 + alpha * t2, dtype=torch.float32)
        cameras.append(Camera(cam1.fx, cam1.fy, cam1.cx, cam1.cy,
                              cam1.width, cam1.height, R_interp, t_interp))
    return cameras

Exercise: Train the pipeline on a synthetic multi-view scene, then:

  1. Plot loss and PSNR curves over training iterations.
  2. Plot the number of Gaussians over time (showing density control effects).
  3. Render novel views by interpolating between training cameras.
  4. Create an evaluation table with PSNR/SSIM/LPIPS on held-out test views.
  5. Generate an SSIM error map and identify which scene regions reconstruct poorly.

Common Pitfalls and Debugging Tips

Training diverges (loss goes to NaN):

  • Check that quaternions are normalized before building the covariance matrix.
  • Ensure scales are positive (use exp(log_scale) parameterization).
  • Reduce learning rates, especially for position and scale.
  • Add gradient clipping (torch.nn.utils.clip_grad_norm_).

Rendered images are blank or all black:

  • Verify camera pose: are the Gaussians in front of the camera? Check that depths are positive.
  • Check opacity initialization: if all Gaussians start with very low opacity, nothing will be visible.
  • Verify the projection: plot the 2D Gaussian centers on the image to check they fall within bounds.

Floater artifacts (blobs floating in empty space):

  • Increase pruning frequency or lower the opacity threshold.
  • Add a depth distortion loss that penalizes Gaussians far from surfaces.
  • Check if the SfM initialization has outlier points.

Blurry reconstruction:

  • Ensure the SSIM loss component is enabled (not just L1).
  • Verify that density control is splitting large Gaussians in detailed regions.
  • Increase the number of training iterations.
  • Check that camera intrinsics (especially focal length) are correct.

Memory issues:

  • Monitor the number of Gaussians; it can grow unboundedly without proper pruning.
  • Use lower-resolution images for initial experiments.
  • Implement periodic opacity reset (set all opacities to a low value, then let the optimizer recover).

Further Reading and Extensions

  • Original 3DGS paper: Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering" (SIGGRAPH 2023). The foundational paper.
  • SplatAD: Hao et al., "SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving" (CVPR 2025). Extends 3DGS for multi-sensor AD simulation.
  • Street Gaussians: Yan et al., "Street Gaussians for Modeling Dynamic Urban Scenes" (2024). Handles dynamic objects in driving scenes.
  • GaussianPro: Cheng et al., "GaussianPro: 3D Gaussian Splatting with Progressive Propagation" (2024). Improved initialization for textureless regions.
  • Mip-Splatting: Yu et al., "Mip-Splatting: Alias-free 3D Gaussian Splatting" (CVPR 2024). Fixes aliasing artifacts when rendering at different scales.

Possible extensions of this project:

  1. Add spherical harmonics up to degree 3 for view-dependent color and implement the SH evaluation.
  2. Implement a CUDA kernel for the rasterizer using PyTorch custom extensions or Numba.
  3. Add LiDAR depth supervision to improve reconstruction in textureless regions.
  4. Implement per-object Gaussian tracking for dynamic scene reconstruction.
  5. Build an interactive viewer using OpenGL or WebGL to explore the reconstructed scene in real time.