Back to all papers
Deep Dive #950 min read

Neural Rendering for AD

3D Gaussian Splatting, NeRF, NeuRAD, SplatAD, and differentiable rendering for photorealistic sensor simulation.

Neural Rendering for Autonomous Driving Simulation: A Deep Dive

Focus: Neural rendering techniques (NeRF, 3D Gaussian Splatting) for photorealistic, closed-loop AD sensor simulation Key Papers: NeuRAD (CVPR 2024), SplatAD (2025), HUGSIM (2024), AutoSplat (2024), Industrial-Grade GS (2025) Read Time: 60 min


Table of Contents

  1. Executive Summary
  2. Background & Motivation
  3. Core Technologies
  4. Key Papers for AD Simulation
  5. Applied Intuition's Neural Sim Architecture
  6. Technical Deep Dive
  7. Code Examples
  8. Mental Models & Diagrams
  9. Hands-On Exercises
  10. Interview Questions
  11. References

Executive Summary

What Is Neural Rendering for AD Simulation?

Neural rendering replaces traditional computer graphics pipelines (hand-authored 3D assets, rasterization engines, ray tracers) with learned scene representations reconstructed directly from real sensor data. For autonomous driving, this means converting raw drive logs -- camera images, lidar point clouds, poses -- into photorealistic, re-renderable 3D scenes that can simulate novel viewpoints, new actor configurations, and multi-sensor outputs in closed loop.

Why It Matters

Traditional simulation suffers from the sim-to-real gap: synthetic scenes look different enough from reality that perception stacks trained or tested in simulation may not transfer. Neural rendering closes this gap by reconstructing scenes from reality rather than approximating it.

Traditional Pipeline:          Neural Rendering Pipeline:

  Artists --> 3D Assets          Drive Logs --> Neural Scene
      |                              |
  Game Engine --> Rendered       Differentiable Renderer --> Rendered
      |           Images             |                      Images
  Sim-to-Real Gap: LARGE         Sim-to-Real Gap: SMALL

Key Insight

The field has rapidly converged on 3D Gaussian Splatting (3DGS) as the preferred representation for AD simulation, overtaking NeRF due to its real-time rendering speed, explicit scene representation (enabling actor manipulation), and natural compatibility with lidar simulation. Applied Intuition's Neural Sim product exemplifies this trend: combining Gaussian Splatting for static backgrounds with physics-based rendering (PBR) for dynamic actors to achieve photorealistic closed-loop sensor simulation at fleet scale.


Background & Motivation

The Simulation Imperative

Autonomous driving companies rely on simulation for:

Use CaseWhy SimulationScale
Safety validationCannot test every edge case on roadsBillions of miles needed
Regression testingEvery software change needs validationThousands of scenarios per commit
Long-tail miningRare events are hard to encounter naturallyTargeted scenario generation
Perception testingSensor behavior in novel conditionsWeather, lighting, occlusions
Closed-loop replay"What if the ego had done X instead?"Every logged drive becomes reusable

Waymo has reported driving 100M+ real miles but 10B+ simulated miles. This 100:1 ratio only makes sense if simulation is trustworthy.

Limitations of Traditional Approaches

1. Artist-Authored Worlds (Game Engine Style)

Pros:                          Cons:
+ Full control over scene      - Expensive asset creation ($$$)
+ Deterministic rendering      - Cartoon-like appearance
+ Easy actor manipulation      - Material/lighting mismatch
+ Fast iteration               - Never matches real sensor output
                               - Perception models don't transfer

Companies like CARLA and LGSVL use Unreal Engine or Unity. Despite PBR materials and ray tracing, the domain gap remains significant -- a detector trained on real data drops 15-30% mAP when evaluated on synthetic images.

2. Log Replay (Replay Recorded Data)

Pros:                          Cons:
+ Perfect sensor realism       - Cannot change ego trajectory
+ No domain gap                - Cannot add/remove actors
+ Simple infrastructure        - Fixed viewpoint only
                               - Not closed-loop

Log replay is the gold standard for realism but fundamentally limited: you can only replay exactly what happened. If the ego car had braked 0.5 seconds earlier, you cannot generate the sensor data for that alternative trajectory.

3. Neural Rendering (The New Paradigm)

Pros:                          Cons:
+ Photorealistic (from data)   - Bounded to training distribution
+ Novel viewpoints             - Reconstruction artifacts
+ Multi-sensor capable         - Compute-intensive training
+ Closed-loop compatible       - Dynamic scene handling is hard
+ Scalable via ML pipelines    - Limited extrapolation range

Neural rendering sits at the intersection: near-real sensor fidelity with the flexibility to change viewpoints and scene configurations.

The Evolution Timeline

2020 ─── NeRF (Mildenhall et al.)
  |        First neural radiance field; 30+ hours to train, minutes to render
  |
2021 ─── Instant-NGP (Mueller et al.)
  |        Hash encoding cuts NeRF training to minutes
  |
2022 ─── Block-NeRF, Urban Radiance Fields
  |        NeRF scaled to city-level scenes
  |        Panoptic Neural Fields for scene understanding
  |
2023 ─── 3D Gaussian Splatting (Kerbl et al.)
  |        Real-time neural rendering via explicit Gaussians
  |        UniSim (Waabi) for AD simulation
  |        MARS, EmerNeRF for dynamic driving scenes
  |
2024 ─── NeuRAD (CVPR 2024), HUGSIM, AutoSplat
  |        GS-based AD simulation matures
  |        Street Gaussians, DrivingGaussian
  |        Applied Intuition Neural Sim launches
  |
2025 ─── SplatAD, Industrial-Grade GS
           Real-time camera+lidar from GS
           Fleet-scale neural sim pipelines

Core Technologies

Neural Radiance Fields (NeRF)

The Core Idea

NeRF represents a 3D scene as a continuous volumetric function that maps a 5D coordinate (3D position + 2D viewing direction) to color and density:

F_theta: (x, y, z, theta, phi) --> (r, g, b, sigma)

where:
  (x, y, z)       = 3D position in space
  (theta, phi)    = viewing direction (azimuth, elevation)
  (r, g, b)       = emitted color at that point from that direction
  sigma            = volume density (opacity) at that point

The function F_theta is parameterized by a multi-layer perceptron (MLP).

Volume Rendering

To render a pixel, NeRF casts a ray from the camera through that pixel and accumulates color along the ray:

Camera ----ray----> * ---- * ---- * ---- * ---- * ----> (background)
                    |      |      |      |      |
                   s_1    s_2    s_3    s_4    s_5   (sample points)
                    |      |      |      |      |
                  query  query  query  query  query  (MLP forward pass)
                    |      |      |      |      |
                 (c_1,   (c_2,  (c_3,  (c_4,  (c_5,
                  σ_1)    σ_2)   σ_3)   σ_4)   σ_5)

The final pixel color C(r) is computed via numerical quadrature of the rendering equation:

C(r) = sum_{i=1}^{N} T_i * alpha_i * c_i

where:
  alpha_i = 1 - exp(-sigma_i * delta_i)      (opacity of sample i)
  T_i     = prod_{j=1}^{i-1} (1 - alpha_j)   (transmittance to sample i)
  delta_i = t_{i+1} - t_i                     (distance between samples)
  c_i     = color at sample i

This is differentiable end-to-end, so we can optimize the MLP weights by minimizing the photometric loss between rendered and observed pixels.

Positional Encoding

Raw (x, y, z) coordinates fed to an MLP produce over-smooth results because MLPs are biased toward low-frequency functions. NeRF uses positional encoding to lift inputs into a higher-dimensional space:

gamma(p) = [sin(2^0 * pi * p), cos(2^0 * pi * p),
            sin(2^1 * pi * p), cos(2^1 * pi * p),
            ...
            sin(2^{L-1} * pi * p), cos(2^{L-1} * pi * p)]

For position (L=10): 3D input becomes 60D. For direction (L=4): 2D input becomes 24D. This allows the MLP to learn high-frequency detail like sharp edges and fine textures.

NeRF Limitations for AD Simulation

LimitationDescriptionImpact on AD Sim
Slow renderingHundreds of MLP queries per rayCannot achieve real-time for closed-loop
Slow trainingHours to days per sceneCannot scale to fleet data
Static scenesOriginal NeRF assumes fixed geometryDriving scenes are dynamic
Bounded scenesWorks best in object-centric settingsDriving scenes are unbounded
No explicit geometryImplicit density fieldHard to manipulate individual objects
Per-ray computationEach pixel requires marchingNo lidar simulation without ray marching

These limitations motivated the development of 3D Gaussian Splatting.

3D Gaussian Splatting (3DGS)

The Core Idea

Instead of an implicit function queried along rays, 3DGS represents the scene as a collection of explicit 3D Gaussian primitives -- millions of small, colored, semi-transparent ellipsoids scattered through space:

Each Gaussian is defined by:
  - Position (mean):       mu in R^3
  - Covariance:            Sigma in R^{3x3}  (stored as rotation q + scale s)
  - Opacity:               alpha in [0, 1]
  - Color:                 c (via spherical harmonics for view-dependence)

Total parameters per Gaussian: 3 + 4 + 3 + 1 + 48 = 59 floats
                               (pos  rot  scale opacity  SH coeffs)

A typical driving scene might use 1-5 million Gaussians.

How Rendering Works: Differentiable Rasterization

Unlike NeRF's ray marching, 3DGS uses a splatting (forward rasterization) approach:

Step 1: Project each 3D Gaussian onto the 2D image plane
        ┌──────────────────────────────┐
        │  3D Gaussian (ellipsoid)     │
        │       ╱ ╲                    │
        │      ╱   ╲    project        │
        │     ╱     ╲  ─────────►  2D Gaussian (ellipse)
        │    ╱       ╲                 │
        │   ╱─────────╲               │
        └──────────────────────────────┘

Step 2: Sort Gaussians by depth (front to back)

Step 3: For each pixel, alpha-composite overlapping Gaussians:
        C(pixel) = sum_{i in overlapping} c_i * alpha_i * T_i
        T_i = prod_{j=1}^{i-1} (1 - alpha_j)

This is implemented as a tile-based rasterizer on the GPU:

┌─────┬─────┬─────┬─────┐
│Tile │Tile │Tile │Tile │  Image divided into 16x16 pixel tiles
│ 0,0 │ 0,1 │ 0,2 │ 0,3 │
├─────┼─────┼─────┼─────┤  Each tile processed by one GPU thread block
│Tile │Tile │Tile │Tile │
│ 1,0 │ 1,1 │ 1,2 │ 1,3 │  Gaussians assigned to tiles they overlap
├─────┼─────┼─────┼─────┤
│Tile │Tile │Tile │Tile │  Within each tile: sorted alpha compositing
│ 2,0 │ 2,1 │ 2,2 │ 2,3 │
└─────┴─────┴─────┴─────┘

Key advantage: the rasterizer processes all pixels in parallel, achieving 100+ FPS at HD resolution on a single GPU.

Adaptive Density Control

During training, 3DGS dynamically adjusts the number and distribution of Gaussians:

Densification (every N iterations):
  1. CLONE: Large Gaussians with high gradient --> split into two
  2. SPLIT:  Small Gaussians with high gradient --> duplicate and perturb
  3. PRUNE:  Gaussians with very low opacity   --> remove

  Before:   ●  ●      ●         ●  ●
             (sparse, gaps)

  After:    ●●●●●●●●●●●●●●●●●●●●●
             (dense where needed, pruned where redundant)

This allows the representation to allocate capacity where scene detail is highest (e.g., object edges, fine textures) and remain sparse in empty or uniform regions.

Why 3DGS is Better for AD Simulation

  1. Real-time rendering: 100+ FPS vs. seconds per frame for NeRF
  2. Explicit geometry: Each Gaussian has a position -- can be moved, deleted, grouped
  3. Scene editing: Remove a car by removing its Gaussians; insert new actors
  4. Lidar-friendly: Gaussians have physical extent -- can trace lidar rays through them
  5. Fast training: Minutes to train (vs. hours for NeRF) with similar quality
  6. Memory-efficient rendering: Forward pass only, no per-ray MLP evaluation

Comparison Table: NeRF vs 3DGS for AD Simulation

AspectNeRF3D Gaussian Splatting
RepresentationImplicit (MLP weights)Explicit (point cloud of Gaussians)
RenderingVolume ray marchingTile-based rasterization (splatting)
Render Speed0.1 - 5 FPS100 - 300 FPS
Training TimeHours - daysMinutes - hours
Image QualityExcellent (PSNR ~31 dB)Excellent (PSNR ~32 dB)
Scene EditingVery difficultNatural (move/remove Gaussians)
Dynamic ScenesRequires deformation fieldsGroup Gaussians per object
Lidar SimRay march through densityIntersect rays with Gaussians
Memory (render)Low (MLP weights only)Higher (millions of Gaussians)
Memory (train)High (per-ray samples)Moderate (sorted splatting)
Unbounded ScenesNeeds contraction (mip-NeRF 360)Natural with scale parameters
View ExtrapolationPoor (overfits training views)Poor (same fundamental issue)
Closed-Loop SimToo slow for real-timeViable at real-time rates
Industry AdoptionDeclining for AD simDominant and growing

Verdict for AD Simulation: 3DGS is the clear winner. Its real-time performance, explicit representation, and natural compatibility with multi-sensor simulation make it the foundation of modern neural sim systems.


Key Papers for AD Simulation

NeuRAD (CVPR 2024)

Paper: "NeuRAD: Neural Rendering for Autonomous Driving" Authors: Tonderski et al. (Zenseact) Link: arxiv.org/abs/2311.15260

Key Contributions

  1. Unified multi-sensor rendering: Single neural scene representation that renders both camera images and lidar point clouds
  2. Sensor-specific modeling: Accounts for rolling shutter, beam divergence, ray dropping, and per-sensor exposure
  3. State-of-the-art on multiple benchmarks: Outperforms prior methods on nuScenes, PandaSet, and Argoverse2
  4. Practical design decisions: Extensive ablation study of what matters for AD-specific neural rendering

Architecture

                        Drive Log Input
                    ┌───────────────────────┐
                    │ Camera Images (6 cams) │
                    │ Lidar Scans            │
                    │ Ego Poses              │
                    │ Actor Bounding Boxes   │
                    └──────────┬────────────┘
                               │
                    ┌──────────▼────────────┐
                    │   Scene Decomposition  │
                    │  Static   │  Dynamic   │
                    │  (hash    │  (per-actor │
                    │   grid)   │   model)   │
                    └──────────┬────────────┘
                               │
                    ┌──────────▼────────────┐
                    │   Volume Rendering     │
                    │  with sensor models    │
                    │  - Rolling shutter     │
                    │  - Beam divergence     │
                    │  - Lidar intensity     │
                    └──────────┬────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                 ▼
        Camera RGB      Lidar Points      Lidar Intensity

Key Technical Details

  • Hash-grid backbone: Uses Instant-NGP style multi-resolution hash encoding for the static scene, giving fast training while maintaining detail
  • Actor models: Each dynamic actor gets its own small NeRF, conditioned on a learned latent code per actor instance
  • Rolling shutter: Models each image row at a different timestamp, interpolating ego pose accordingly -- critical for high-speed driving
  • Lidar modeling: Simulates beam divergence (lidar rays are not infinitely thin), ray drop probability, and intensity based on material and incidence angle
  • Losses: Photometric (L1 + LPIPS) for cameras, Chamfer distance + intensity loss for lidar

Results

DatasetCamera PSNRCamera SSIMLidar Chamfer (m)
nuScenes28.5 dB0.870.041
PandaSet29.1 dB0.890.038
Argoverse227.8 dB0.850.044

Why It Matters

NeuRAD demonstrated that a single neural representation can serve both camera and lidar simulation with high fidelity. Its extensive ablation study became a practical guide for the field -- showing, for example, that rolling shutter modeling alone improves camera PSNR by 1.5 dB in highway scenes.


SplatAD (2025)

Paper: "SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving" Authors: Hultman et al. (Zenseact) Link: arxiv.org/abs/2411.16816

Key Contributions

  1. First 3DGS method for joint real-time camera AND lidar: Prior GS works for AD focused on camera only
  2. Lidar rendering via Gaussian ray tracing: Novel approach to generate lidar point clouds from Gaussian scenes
  3. 14x faster than NeuRAD: Real-time performance enabling closed-loop simulation
  4. State-of-the-art quality: Matches or exceeds NeuRAD on all benchmarks

How Lidar Rendering Works with Gaussians

The key innovation of SplatAD is extending 3DGS to lidar. Since Gaussians are explicit primitives with spatial extent, lidar rays can be intersected with them:

Lidar Ray Intersection with Gaussians:

   Lidar
   Sensor        Ray direction
     ●─────────────────────────────────────►
     │            │           │
     │      ┌─────┼──┐  ┌────┼───┐
     │      │  G1 │  │  │ G2 │   │
     │      │  ●  │  │  │  ● │   │     Gaussians along the ray
     │      │     │  │  │    │   │
     │      └─────┘──┘  └────┘───┘
     │         t1            t2
     │
     ▼  Compute intersection weights and accumulate:
        depth = sum(w_i * t_i)  where w_i = alpha_i * T_i
        intensity = sum(w_i * I_i)

For each lidar ray:

  1. Find Gaussians whose influence region intersects the ray
  2. Evaluate each Gaussian's contribution along the ray (using the Gaussian's 3D shape)
  3. Alpha-composite depth and intensity (same as color compositing, but for range)
  4. Apply sensor-specific ray drop model

Architecture

┌─────────────────────────────────────────┐
│            SplatAD Pipeline              │
├─────────────────────────────────────────┤
│                                          │
│  Input: Drive log (images, lidar, poses) │
│                                          │
│  ┌──────────────┐   ┌────────────────┐  │
│  │ Static Scene  │   │ Dynamic Actors │  │
│  │ (3D Gaussians)│   │ (3D Gaussians  │  │
│  │               │   │  per actor)    │  │
│  └──────┬───────┘   └───────┬────────┘  │
│         │                    │           │
│         └────────┬───────────┘           │
│                  │                        │
│         ┌────────▼────────┐              │
│         │  Composed Scene  │              │
│         └────────┬────────┘              │
│                  │                        │
│     ┌────────────┼──────────────┐        │
│     ▼            ▼              ▼        │
│  Camera       Lidar          Lidar      │
│  Splatting    Ray Tracing    Intensity   │
│  (rasterize)  (ray-Gaussian  Estimation │
│               intersection)              │
└─────────────────────────────────────────┘

Performance Comparison

MethodCamera PSNRLidar ChamferCamera FPSLidar FPS
NeuRAD28.5 dB0.041 m1.50.8
SplatAD28.8 dB0.039 m2218
Speedup----14.7x22.5x

Why It Matters

SplatAD proved that 3DGS can simultaneously handle camera and lidar simulation at real-time rates without sacrificing quality. This is the critical capability needed for closed-loop simulation where the ego vehicle's perception stack runs online.


HUGSIM (2024)

Paper: "HUGSIM: A Real-Time, Photo-Realistic Closed-Loop Simulator for Autonomous Driving" Authors: Chen et al. Link: arxiv.org/abs/2403.17712

Key Contributions

  1. End-to-end closed-loop simulation: First system to combine 3DGS reconstruction with a full closed-loop driving simulator
  2. Real-time performance: Achieves interactive frame rates for online perception-planning loops
  3. Scene editing: Supports inserting, removing, and repositioning actors in the neural scene
  4. Downstream evaluation: Tests actual AD planners in the simulator and measures driving quality

Closed-Loop Architecture

┌──────────────────────────────────────────────────────────┐
│                    HUGSIM Closed Loop                      │
│                                                            │
│   ┌───────────┐    ┌────────────┐    ┌──────────────┐    │
│   │  Neural    │───►│ Perception │───►│  Planning    │    │
│   │  Scene     │    │  Stack     │    │  Module      │    │
│   │  (3DGS)   │    │            │    │              │    │
│   └─────┬─────┘    └────────────┘    └──────┬───────┘    │
│         ▲                                     │           │
│         │              ┌──────────┐           │           │
│         │              │ Behavior │           │           │
│         │              │ Model    │◄──────────┘           │
│         │              │ (agents) │                        │
│         │              └────┬─────┘                        │
│         │                   │                              │
│         │    ┌──────────────▼──────────────┐              │
│         └────┤  Scene State Update          │              │
│              │  - Move ego to new pose      │              │
│              │  - Update actor positions    │              │
│              │  - Re-render from new view   │              │
│              └─────────────────────────────┘              │
└──────────────────────────────────────────────────────────┘

Scene Decomposition Strategy

HUGSIM separates the scene into three layers:

Layer 1: STATIC BACKGROUND
  - Roads, buildings, trees, signs
  - Reconstructed as a single 3D Gaussian field
  - Stays fixed during simulation

Layer 2: DYNAMIC ACTORS (from original log)
  - Each vehicle/pedestrian is a separate Gaussian model
  - Can be moved, removed, or have trajectory edited
  - Actor Gaussians transformed by rigid body pose

Layer 3: INSERTED ACTORS
  - New vehicles/pedestrians not in original scene
  - Uses pre-built Gaussian asset library
  - Placed with correct scale, lighting, shadows

Why It Matters

HUGSIM was one of the first systems to demonstrate that neural rendering can be embedded in a full closed-loop simulation pipeline at interactive rates. It showed that 3DGS-based simulation can actually be used to evaluate and improve AD planners, not just produce pretty pictures.


AutoSplat (2024)

Paper: "AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction" Authors: Khan et al. Link: arxiv.org/abs/2407.02598

Key Contributions

  1. Geometry-aware Gaussian placement: Uses road surface priors and structural constraints to improve reconstruction quality, especially for roads and flat surfaces
  2. Road surface decomposition: Dedicated handling of road surfaces with planar constraints
  3. Appearance consistency: Cross-view appearance regularization for consistent rendering under viewpoint changes
  4. Vehicle reconstruction: Improved dynamic vehicle reconstruction via symmetric priors

Road Surface Innovation

Driving scenes are dominated by road surfaces, which are notoriously hard for unconstrained 3DGS (Gaussians tend to float above or below the road plane):

Problem with unconstrained GS:        AutoSplat's approach:

     ● ●    ● ●  ●                    ●●●●●●●●●●●●●●●●●●●
   ●     ●●     ●    ●                 (Gaussians constrained
  ──────────────────────── Road         to road plane with
     ●  ●    ●                          normal alignment)
  (Gaussians float randomly)

AutoSplat constrains road Gaussians to:

  • Lie on the estimated road surface (from lidar ground segmentation)
  • Have their shortest axis aligned with the surface normal (flat ellipsoids)
  • Maintain consistent appearance across different viewing angles

Why It Matters

AutoSplat addressed one of the most practical challenges in AD neural rendering: making road surfaces look correct. Roads occupy a huge portion of driving images, and artifacts on the road surface (floating blobs, inconsistent color) are immediately noticeable and can confuse lane detection and drivable area estimation.


Industrial-Grade Sensor Simulation via Gaussian Splatting (2025)

Paper: "Sensor Simulation via Gaussian Splatting: Industrial-Grade Driving Scene Reconstruction" Authors: Various (industry research)

Key Contributions

  1. Fleet-scale pipeline: Automated reconstruction pipeline processing thousands of drive logs
  2. Quality assurance: Automated metric-based quality gating for reconstructed scenes
  3. Multi-sensor fidelity: Camera, lidar, and radar simulation from a single reconstruction
  4. Production deployment: Designed for integration into commercial simulation platforms

Industrial Requirements vs Academic Methods

RequirementAcademic MethodsIndustrial Grade
Scale10-100 scenes10,000+ scenes
AutomationManual tuning per sceneFully automated pipeline
QualityAverage PSNR reportedPer-scene quality gating
RobustnessFails on hard casesGraceful degradation
LatencyHours per sceneMinutes per scene
IntegrationStandalone demoAPI-driven, CI/CD compatible

Why It Matters

This line of work bridges the gap between academic neural rendering research and production simulation systems. It addresses the "last mile" problems: how do you go from a research prototype that works on 10 curated scenes to a system that reliably reconstructs 10,000 diverse scenes from fleet data?


Applied Intuition's Neural Sim Architecture

Applied Intuition's Neural Sim product represents the current state of the art in commercial neural rendering for AD simulation. Based on public information, patents, and technical presentations, we can reconstruct its architecture.

High-Level Pipeline

┌──────────────────────────────────────────────────────────────────────┐
│                  Applied Intuition Neural Sim Pipeline                 │
│                                                                        │
│  ┌─────────────┐    ┌──────────────┐    ┌───────────────────────┐    │
│  │  Fleet Data  │    │ Reconstruction│    │  Scenario Authoring   │    │
│  │  (Drive Logs)│───►│   Pipeline   │───►│  (Edit, Insert, Move) │    │
│  │  - Cameras   │    │              │    │                        │    │
│  │  - Lidar     │    │ Automated ML │    │  ┌─────────────────┐  │    │
│  │  - Radar     │    │ at scale     │    │  │ Static Neural   │  │    │
│  │  - Poses     │    │              │    │  │ Scene (GS)      │  │    │
│  │  - Labels    │    │              │    │  ├─────────────────┤  │    │
│  └─────────────┘    └──────────────┘    │  │ Dynamic Actors  │  │    │
│                                          │  │ (PBR models)    │  │    │
│                                          │  └─────────────────┘  │    │
│                                          └───────────┬───────────┘    │
│                                                       │               │
│                                          ┌───────────▼───────────┐    │
│                                          │   Closed-Loop Sim     │    │
│                                          │   Runtime             │    │
│                                          │   - Camera rendering  │    │
│                                          │   - Lidar generation  │    │
│                                          │   - Radar simulation  │    │
│                                          └───────────┬───────────┘    │
│                                                       │               │
│                                          ┌───────────▼───────────┐    │
│                                          │   Validation          │    │
│                                          │   - Reconstruction    │    │
│                                          │     quality metrics   │    │
│                                          │   - Downstream percep │    │
│                                          │     performance       │    │
│                                          └───────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘

Scene Reconstruction Pipeline

The reconstruction pipeline converts raw drive logs into re-renderable neural scenes:

Drive Log ──► Preprocessing ──► Neural Reconstruction ──► Quality Check ──► Scene DB
                  │                      │                       │
                  ▼                      ▼                       ▼
           - Pose refinement      - Static background       - PSNR > threshold?
           - Lidar accumulation     (Gaussian Splatting)    - SSIM > threshold?
           - Sky segmentation     - Per-actor models        - Lidar error < threshold?
           - Dynamic masking      - Sky dome model          - Visual inspection (sampled)
           - Ground plane est     - Exposure compensation

Step 1: Preprocessing

  • Pose refinement: SfM or lidar SLAM to get centimeter-accurate poses (even small pose errors cause blurry reconstructions)
  • Dynamic object masking: Segment and mask moving objects in training images so the static model does not try to explain them
  • Sky segmentation: Separate sky pixels for special handling (sky has no real 3D geometry)
  • Lidar accumulation: Aggregate multiple lidar sweeps into a dense static point cloud for initialization

Step 2: Static Background Reconstruction

The static scene (everything except moving actors) is reconstructed as a 3D Gaussian field:

Initialization:
  - Start from accumulated lidar point cloud
  - Each lidar point becomes an initial Gaussian
  - Scale and color initialized from nearest image patches

Training (per scene, ~10-30 minutes):
  - Render training views via differentiable splatting
  - Compare to ground-truth images (masked for dynamic objects)
  - Backpropagate gradients to Gaussian parameters
  - Adaptive density control: split/clone/prune
  - Also supervise with lidar depth where available

Step 3: Dynamic Actor Models

Dynamic actors (vehicles, pedestrians, cyclists) are handled separately from the static scene:

Option A: Per-Actor Gaussian Models
  - Reconstruct each actor across frames where visible
  - Gaussian positions in actor-local coordinates
  - At sim time: transform by desired actor pose

Option B: PBR Asset Insertion (Applied Intuition's approach)
  - Detect and classify actors in log data
  - Match to high-quality PBR 3D asset library
  - Render PBR actors with matched lighting/material
  - Composite into neural background

  Advantage: PBR actors can have arbitrary new poses,
  animations, and interactions -- not limited to
  reconstructed appearances

Multi-Sensor Support

┌────────────────────────────────────────────────────┐
│              Multi-Sensor Rendering                 │
│                                                      │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐      │
│  │  Camera    │  │  Lidar    │  │  Radar    │      │
│  │  Rendering │  │  Rendering│  │  Rendering│      │
│  ├───────────┤  ├───────────┤  ├───────────┤      │
│  │ GS raster- │  │ Ray-      │  │ Learned   │      │
│  │ ization    │  │ Gaussian  │  │ radar     │      │
│  │ + rolling  │  │ intersect │  │ cross-    │      │
│  │ shutter    │  │ + beam    │  │ section   │      │
│  │ + exposure │  │ model     │  │ model     │      │
│  │ + lens     │  │ + ray     │  │ + multi-  │      │
│  │ distortion │  │ drop      │  │ path      │      │
│  │            │  │ + intens. │  │           │      │
│  └───────────┘  └───────────┘  └───────────┘      │
│        │              │              │              │
│        ▼              ▼              ▼              │
│   RGB Images     Point Clouds   Radar Returns     │
│   (per camera)   (range, intens)  (range, vel)    │
└────────────────────────────────────────────────────┘

Automated ML Pipelines for Fleet-Scale Reconstruction

At fleet scale, reconstruction must be fully automated:

Fleet Data Lake (petabytes)
         │
         ▼
┌──────────────────────────┐
│  Scene Selection          │
│  - Filter by scenario tag │
│  - Diversity sampling     │
│  - Geographic coverage    │
└──────────┬───────────────┘
           │
           ▼
┌──────────────────────────┐
│  Distributed Training     │
│  - GPU cluster (cloud)    │
│  - One GPU per scene      │
│  - Batch of 100s parallel │
│  - Automated hyperparams  │
└──────────┬───────────────┘
           │
           ▼
┌──────────────────────────┐
│  Quality Gating           │
│  - Auto PSNR/SSIM check   │
│  - Lidar fidelity check   │
│  - Artifact detection     │
│  - Human review (sampled) │
└──────────┬───────────────┘
           │
           ▼
┌──────────────────────────┐
│  Scene Database            │
│  - Versioned neural assets │
│  - Searchable metadata     │
│  - Scenario annotations    │
└──────────────────────────┘

Validation: Two Levels of Metrics

Neural Sim validation operates at two levels:

Level 1: Upstream Reconstruction Quality

How well does the neural scene match the original sensor data?

MetricWhat It MeasuresTarget
PSNRPixel-level accuracy (dB)> 27 dB
SSIMStructural similarity> 0.85
LPIPSPerceptual similarity (learned)< 0.15
Lidar ChamferPoint cloud geometric accuracy< 0.05 m
Lidar Intensity MAEReflectance accuracy< 0.1
FIDDistribution-level realism< 50

Level 2: Downstream Perception Performance

Does the perception stack perform equally well on neural-rendered vs. real data?

                Real Data                Neural-Rendered Data
                    │                            │
                    ▼                            ▼
            ┌──────────────┐            ┌──────────────┐
            │  Perception  │            │  Perception  │
            │  Stack       │            │  Stack       │
            │  (same model)│            │  (same model)│
            └──────┬───────┘            └──────┬───────┘
                   │                            │
                   ▼                            ▼
            Detection mAP: 72.3         Detection mAP: 71.8
            Lane F1: 0.94               Lane F1: 0.93
            Tracking MOTA: 68.1         Tracking MOTA: 67.5

  Gap should be < 1-2% for the simulation to be trustworthy

This is the ultimate validation: if a perception model produces nearly identical outputs on neural-rendered data vs. real data, then the simulation is faithful enough for testing and development.


Technical Deep Dive

Scene Decomposition: Static vs Dynamic

Scene decomposition is the foundational step in AD neural rendering. The scene must be separated into static elements (which can be reconstructed once) and dynamic elements (which must be modeled separately to allow trajectory editing).

Decomposition Pipeline

Input Frame ────────────────────────────────────────────────────
     │
     ├──► 2D Instance Segmentation (Mask R-CNN, SAM, etc.)
     │         │
     │         ▼
     │    Dynamic Object Masks ──────────────────────┐
     │                                                │
     ├──► 3D Bounding Box Detection/Tracking          │
     │         │                                      │
     │         ▼                                      │
     │    Per-Object Tracks ─────────────────────┐   │
     │    (position, size, heading per frame)     │   │
     │                                            │   │
     ▼                                            ▼   ▼
  Static Reconstruction               Dynamic Object Models
  (inpaint masked regions,            (per-actor Gaussians
   train on static pixels only)        in local coordinates)

Challenges in Decomposition

  1. Shadow handling: A car's shadow is static-looking but moves with the car. Including shadows in the static model creates ghosting artifacts when the car is moved.
Original:    Car casts shadow on road
             ┌───┐
             │ C │  ░░░░ (shadow)
             └───┘
            ══════════════ road

Naive move:  Car moved, but shadow stays!
                        ┌───┐
             ░░░░       │ C │
                        └───┘
            ══════════════ road

Correct:     Shadow masked and inpainted in static scene,
             re-rendered with car in new position
  1. Occlusion hallucination: When a dynamic object is removed, the static model must fill in the region behind it. This is typically handled by:

    • Lidar-guided depth completion (we know the approximate geometry behind the car)
    • Learned inpainting (neural network fills in plausible texture)
    • Multi-frame aggregation (the occluded area may be visible in other frames)
  2. Semi-static objects: Parked cars, construction barriers -- static in the current log but potentially movable. These are often included in the static model for simplicity but can be separated if needed.

Novel View Synthesis for Ego Pose Changes

The primary use case for neural sim is closed-loop replay: the ego vehicle takes a different action, resulting in a different pose, and we need to render what the sensors would see from that new pose.

Original ego trajectory:    ● ─── ● ─── ● ─── ● ─── ●
                           t=0   t=1   t=2   t=3   t=4

Novel ego trajectory:       ● ─── ● ─── ●
                           t=0   t=1   t=2  ╲
                                              ╲
                                               ● ─── ●
                                              t=3   t=4
                                              (ego braked and
                                               turned right)

At each new pose, render:
  - 6 camera images (surround view)
  - 1 lidar sweep (full 360 degrees)
  - radar returns

View Extrapolation Limits

Neural rendering works well for interpolation (rendering from a viewpoint between training views) but poorly for extrapolation (rendering from a viewpoint far from any training view):

Training views:    ▼       ▼       ▼       ▼       ▼
                   ●───────●───────●───────●───────●

Good quality zone: ◄───────────────────────────────►
  (within ~1-2m lateral, ~5m longitudinal of training trajectory)

Degraded quality:                               ╱
                                              ╱   (> 2m lateral
                                            ╱      deviation)
                                          ●
                                     artifacts here

Practical systems limit the deviation of the simulated ego trajectory from the original logged trajectory. Typical limits:

  • Lateral: +/- 1-3 meters
  • Longitudinal: +/- 5-10 meters
  • Heading: +/- 15-30 degrees

Beyond these limits, the neural scene produces artifacts (blurring, floaters, hallucinated geometry).

Lidar Point Cloud Generation from Neural Scenes

Generating realistic lidar point clouds from neural scenes requires modeling the full lidar sensing pipeline:

Step 1: Ray Generation
  ┌──────────────────────────────────┐
  │ For each lidar beam (e.g., 128): │
  │   For each azimuth angle:         │
  │     Compute ray origin and dir    │
  │     Account for sensor rotation   │
  │     during sweep (~100ms for      │
  │     full 360-degree rotation)     │
  └──────────────────────────────────┘

Step 2: Ray-Scene Intersection
  ┌──────────────────────────────────┐
  │ NeRF: March along ray, accumulate│
  │       density to find depth      │
  │                                   │
  │ 3DGS: Intersect ray with nearby  │
  │       Gaussians, alpha-composite │
  │       depth values               │
  └──────────────────────────────────┘

Step 3: Sensor Modeling
  ┌──────────────────────────────────┐
  │ - Beam divergence: lidar beams   │
  │   have non-zero width (~3 mrad)  │
  │ - Intensity: depends on surface  │
  │   material and incidence angle   │
  │ - Ray dropping: some rays return │
  │   no measurement (absorption,    │
  │   out of range, specular reflect)│
  │ - Noise: range noise ~1-3 cm     │
  │ - Multi-return: some beams hit   │
  │   multiple surfaces (vegetation) │
  └──────────────────────────────────┘

Step 4: Point Cloud Assembly
  ┌──────────────────────────────────┐
  │ Assemble (x, y, z, intensity,    │
  │ ring_id, timestamp) per point    │
  │ Transform to ego frame           │
  │ Apply motion compensation        │
  └──────────────────────────────────┘

Handling Sky, Road Surfaces, and Distant Geometry

Sky Modeling

Sky has no real 3D geometry -- it is infinitely far away. Naive reconstruction places Gaussians at arbitrary large distances, creating artifacts when the viewpoint changes:

Problem:                         Solution:
  * * *   (sky Gaussians at       Dedicated sky model:
  random depths cause              - Segment sky pixels
  parallax artifacts)              - Render sky with environment
        \  |  /                      map (no parallax)
         \ | /                     - Blend sky with scene at
          \|/                        boundary
         [cam]

Common approaches:

  • Environment map: Fit a learnable HDR environment map for the sky
  • Sky segmentation + separate model: Train a view-direction-only sky network
  • Infinite-distance Gaussians: Place sky Gaussians at a very large fixed distance with zero opacity gradient w.r.t. depth

Road Surface Handling

Roads are problematic because:

  1. They are viewed at extreme grazing angles
  2. They cover a large portion of the image
  3. Lidar returns are sparse on flat surfaces at distance
  4. Road markings require high-frequency detail
                  Camera
                    │
                    │\
                    │ \  Extreme grazing angle
                    │  \
                    │   \
                    │    \
    ════════════════│═════\══════════════════ Road surface
                    │      \
                    │       \  Very few pixels per unit area
                    │        \  at distance

Solutions:

  • Planar constraints (AutoSplat): Force road Gaussians onto the estimated ground plane
  • Multi-resolution: Use smaller, denser Gaussians for nearby road, larger ones for distant
  • Lidar supervision: Use lidar depth as hard constraint for road surface geometry
  • Separate road model: Dedicated parameterization for the road surface

Distant Geometry

Objects at great distance (buildings on the horizon, mountains, far trees) pose challenges:

Near objects: Many views, good triangulation
  ●────────────●────────────●
  (view 1)     (view 2)     (view 3)
                 \  |  /
                  \ | /        Good reconstruction
                   \|/
                  [bldg]

Far objects: All views see nearly the same angle
  ●────●────●
  (v1) (v2) (v3)
        |
        |               Poor triangulation
        |
        |
     [far mountain]

Solutions:

  • Multi-scale representation: Coarse Gaussians for distant geometry
  • Depth regularization: Use lidar or monocular depth to constrain far geometry
  • Level-of-detail: Render distant objects at lower resolution

Rolling Shutter and Sensor-Specific Artifacts

Rolling Shutter

Most automotive cameras use rolling shutter sensors, where each row of pixels is exposed at a slightly different time:

Time ──►
                    ┌─────────────────┐
Row 0    ──────────►│  Exposed first   │
Row 1     ─────────►│                  │
Row 2      ────────►│                  │
...                 │  Each row at a   │
Row N-1        ────►│  different time  │
                    └─────────────────┘

If the car is moving at 30 m/s and exposure takes 33ms:
  Total ego motion during one frame: ~1 meter
  Different rows "see" different ego poses

To model this correctly:

  1. Compute the ego pose at the timestamp of each image row
  2. Render that row from its specific pose
  3. Assemble the full image from per-row renders (or approximate with a few groups of rows)

NeuRAD showed this is critical: neglecting rolling shutter degrades PSNR by 1-2 dB in highway scenes.

Other Sensor Artifacts

ArtifactSensorHow to Model
Auto-exposureCameraPer-frame learnable exposure scaling
Lens flareCameraLearned post-processing or physics model
Chromatic aberrationCameraPer-channel distortion model
Motion blurCameraMulti-sample temporal averaging
Beam divergenceLidarIntegrate Gaussian over beam cross-section
Ray droppingLidarLearned drop probability model
Multi-path reflectionsRadarPhysics-based reflection model
BloomingLidarIntensity-dependent range bias

Code Examples

Basic 3DGS Training Loop

import torch
import torch.nn as nn
from dataclasses import dataclass
from typing import Tuple

@dataclass
class GaussianParams:
    """Parameters for a set of 3D Gaussians."""
    means: torch.Tensor       # (N, 3) - positions
    scales: torch.Tensor      # (N, 3) - scale in each axis (log space)
    rotations: torch.Tensor   # (N, 4) - quaternions
    opacities: torch.Tensor   # (N, 1) - sigmoid space
    sh_coeffs: torch.Tensor   # (N, K, 3) - spherical harmonics for color

    def num_gaussians(self) -> int:
        return self.means.shape[0]


def init_gaussians_from_pointcloud(
    points: torch.Tensor,    # (N, 3) from lidar
    colors: torch.Tensor,    # (N, 3) initial RGB
    device: str = "cuda"
) -> GaussianParams:
    """Initialize Gaussians from a lidar point cloud."""
    N = points.shape[0]

    # Position: directly from point cloud
    means = points.clone().to(device).requires_grad_(True)

    # Scale: small initial size, in log space
    # Use average nearest-neighbor distance as initial scale
    from pytorch3d.ops import knn_points
    dists, _, _ = knn_points(points.unsqueeze(0), points.unsqueeze(0), K=4)
    avg_dist = dists[0, :, 1:].mean(dim=-1).sqrt()  # skip self
    scales = torch.log(avg_dist.unsqueeze(-1).repeat(1, 3)).to(device)
    scales.requires_grad_(True)

    # Rotation: identity quaternion
    rotations = torch.zeros(N, 4, device=device)
    rotations[:, 0] = 1.0  # w=1, x=y=z=0
    rotations.requires_grad_(True)

    # Opacity: moderate initial value (in logit space)
    opacities = torch.full((N, 1), 0.5, device=device)  # sigmoid(0.5) ~ 0.62
    opacities.requires_grad_(True)

    # Spherical harmonics: degree 0 initialized from colors
    # SH degree 0 coefficient = color * C0, where C0 = 0.28209479...
    C0 = 0.28209479177387814
    sh_dc = (colors / C0).unsqueeze(1).to(device)  # (N, 1, 3)
    num_sh_extra = 15  # degrees 1-3: 15 additional coefficients
    sh_rest = torch.zeros(N, num_sh_extra, 3, device=device)
    sh_coeffs = torch.cat([sh_dc, sh_rest], dim=1).requires_grad_(True)

    return GaussianParams(
        means=means,
        scales=scales,
        rotations=rotations,
        opacities=opacities,
        sh_coeffs=sh_coeffs,
    )


def build_covariance_3d(scales: torch.Tensor, rotations: torch.Tensor) -> torch.Tensor:
    """Build 3D covariance matrices from scale and rotation.

    Covariance = R @ S @ S^T @ R^T
    where S = diag(exp(scales)), R = quaternion_to_matrix(rotations)
    """
    # Convert log-scale to actual scale
    S = torch.diag_embed(torch.exp(scales))  # (N, 3, 3)

    # Quaternion to rotation matrix
    R = quaternion_to_rotation_matrix(rotations)  # (N, 3, 3)

    # Covariance = R S S^T R^T
    RS = torch.bmm(R, S)  # (N, 3, 3)
    cov = torch.bmm(RS, RS.transpose(1, 2))  # (N, 3, 3)
    return cov


def quaternion_to_rotation_matrix(q: torch.Tensor) -> torch.Tensor:
    """Convert quaternion (w, x, y, z) to 3x3 rotation matrix."""
    q = nn.functional.normalize(q, dim=-1)
    w, x, y, z = q.unbind(-1)

    R = torch.stack([
        1 - 2*(y*y + z*z),     2*(x*y - w*z),     2*(x*z + w*y),
            2*(x*y + w*z), 1 - 2*(x*x + z*z),     2*(y*z - w*x),
            2*(x*z - w*y),     2*(y*z + w*x), 1 - 2*(x*x + y*y),
    ], dim=-1).reshape(-1, 3, 3)

    return R


class GaussianRasterizer:
    """Simplified tile-based Gaussian rasterizer (pseudocode).

    In practice, use the CUDA implementation from the original 3DGS paper
    or libraries like gsplat, nerfstudio, or diff-gaussian-rasterization.
    """

    def __init__(self, image_width: int, image_height: int, tile_size: int = 16):
        self.W = image_width
        self.H = image_height
        self.tile_size = tile_size

    def forward(
        self,
        gaussians: GaussianParams,
        camera_intrinsics: torch.Tensor,  # (3, 3)
        camera_extrinsics: torch.Tensor,  # (4, 4) world-to-camera
        camera_direction: torch.Tensor,   # (3,) view direction for SH
    ) -> torch.Tensor:
        """Render an image from the Gaussians.

        Returns: (H, W, 3) rendered RGB image.

        NOTE: This is pseudocode. The actual implementation requires
        a custom CUDA kernel for the tile-based rasterization.
        """
        # Step 1: Transform Gaussians to camera space
        means_cam = transform_points(gaussians.means, camera_extrinsics)

        # Step 2: Project 3D covariances to 2D
        cov_3d = build_covariance_3d(gaussians.scales, gaussians.rotations)
        means_2d, cov_2d = project_gaussians(
            means_cam, cov_3d, camera_intrinsics
        )

        # Step 3: Evaluate SH to get view-dependent colors
        colors = eval_spherical_harmonics(
            gaussians.sh_coeffs, camera_direction
        )

        # Step 4: Tile-based rasterization
        # (In practice, this is a fused CUDA kernel)
        image = tile_based_rasterize(
            means_2d, cov_2d, colors,
            torch.sigmoid(gaussians.opacities),
            self.H, self.W, self.tile_size
        )

        return image


def training_loop(
    train_dataset,     # provides (image, camera_params) pairs
    lidar_points,      # initial point cloud
    lidar_colors,      # initial colors from nearest images
    num_iterations: int = 30_000,
    lr_means: float = 1.6e-4,
    lr_scales: float = 5e-3,
    lr_rotations: float = 1e-3,
    lr_opacities: float = 5e-2,
    lr_sh: float = 2.5e-3,
    densify_interval: int = 100,
    densify_start: int = 500,
    densify_stop: int = 15_000,
):
    """Main 3DGS training loop for a driving scene."""

    # Initialize Gaussians from lidar
    gaussians = init_gaussians_from_pointcloud(lidar_points, lidar_colors)
    rasterizer = GaussianRasterizer(
        image_width=1920, image_height=1080
    )

    # Separate optimizers for different parameter groups
    optimizer = torch.optim.Adam([
        {"params": [gaussians.means],     "lr": lr_means},
        {"params": [gaussians.scales],    "lr": lr_scales},
        {"params": [gaussians.rotations], "lr": lr_rotations},
        {"params": [gaussians.opacities], "lr": lr_opacities},
        {"params": [gaussians.sh_coeffs], "lr": lr_sh},
    ])

    # Learning rate scheduler: exponential decay for positions
    scheduler = torch.optim.lr_scheduler.ExponentialLR(
        optimizer, gamma=0.01 ** (1.0 / num_iterations)
    )

    for iteration in range(num_iterations):
        # Sample a random training view
        image_gt, camera_params = train_dataset.random_sample()

        # Render
        image_pred = rasterizer.forward(
            gaussians,
            camera_params.intrinsics,
            camera_params.extrinsics,
            camera_params.view_direction,
        )

        # Loss: L1 + D-SSIM (as in original 3DGS paper)
        l1_loss = torch.abs(image_pred - image_gt).mean()
        ssim_loss = 1.0 - compute_ssim(image_pred, image_gt)
        loss = 0.8 * l1_loss + 0.2 * ssim_loss

        # Backprop
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        scheduler.step()

        # Adaptive density control
        if densify_start < iteration < densify_stop:
            if iteration % densify_interval == 0:
                gaussians = adaptive_density_control(
                    gaussians,
                    grad_threshold=0.0002,
                    min_opacity=0.005,
                    max_screen_size=20,
                )

        if iteration % 1000 == 0:
            print(f"Iter {iteration}: L1={l1_loss:.4f}, "
                  f"SSIM={1-ssim_loss:.4f}, "
                  f"N_gaussians={gaussians.num_gaussians()}")


def adaptive_density_control(
    gaussians: GaussianParams,
    grad_threshold: float,
    min_opacity: float,
    max_screen_size: float,
) -> GaussianParams:
    """Split, clone, and prune Gaussians based on gradients and opacity.

    Pseudocode -- actual implementation tracks accumulated gradients
    over multiple iterations.
    """
    # Accumulated position gradients (tracked externally in practice)
    grads = gaussians.means.grad.norm(dim=-1)  # (N,)
    scales = torch.exp(gaussians.scales)        # (N, 3)
    max_scale = scales.max(dim=-1).values       # (N,)
    opacities = torch.sigmoid(gaussians.opacities.squeeze())  # (N,)

    # CLONE: small Gaussians with high gradient (under-reconstruction)
    clone_mask = (grads > grad_threshold) & (max_scale < 0.01)

    # SPLIT: large Gaussians with high gradient (over-reconstruction)
    split_mask = (grads > grad_threshold) & (max_scale >= 0.01)

    # PRUNE: transparent Gaussians
    prune_mask = opacities < min_opacity

    # Apply operations (pseudocode)
    # gaussians = clone(gaussians, clone_mask)
    # gaussians = split(gaussians, split_mask)
    # gaussians = prune(gaussians, prune_mask)

    return gaussians

Novel View Rendering

def render_novel_view(
    gaussians: GaussianParams,
    novel_pose: torch.Tensor,       # (4, 4) new ego pose
    camera_calibration: dict,       # intrinsics, distortion, etc.
    original_pose: torch.Tensor,    # (4, 4) original ego pose
    dynamic_actors: list,           # list of (actor_gaussians, actor_pose)
    sky_model: nn.Module,           # environment map for sky
) -> torch.Tensor:
    """Render a camera image from a novel ego pose.

    This combines the static neural scene, dynamic actors,
    and sky model into a final rendered image.
    """
    rasterizer = GaussianRasterizer(
        image_width=camera_calibration["width"],
        image_height=camera_calibration["height"],
    )

    # Camera extrinsics: ego_pose @ ego_to_camera
    ego_to_camera = camera_calibration["extrinsics"]  # fixed calibration
    world_to_camera = torch.inverse(novel_pose @ ego_to_camera)

    # 1. Render static background
    static_rgb, static_depth, static_alpha = rasterizer.forward_with_depth(
        gaussians,
        camera_calibration["intrinsics"],
        world_to_camera,
        compute_view_direction(novel_pose, camera_calibration),
    )

    # 2. Render each dynamic actor
    composed_rgb = static_rgb.clone()
    composed_depth = static_depth.clone()

    for actor_gs, actor_pose in dynamic_actors:
        # Transform actor Gaussians to world space
        actor_world_gs = transform_gaussians(actor_gs, actor_pose)

        actor_rgb, actor_depth, actor_alpha = rasterizer.forward_with_depth(
            actor_world_gs,
            camera_calibration["intrinsics"],
            world_to_camera,
            compute_view_direction(novel_pose, camera_calibration),
        )

        # Composite: actor in front where closer
        actor_closer = actor_depth < composed_depth
        mask = actor_closer & (actor_alpha > 0.5)
        composed_rgb[mask] = actor_rgb[mask]
        composed_depth[mask] = actor_depth[mask]

    # 3. Fill sky regions
    sky_mask = static_alpha < 0.1  # transparent = sky
    if sky_mask.any():
        view_dirs = compute_pixel_directions(
            camera_calibration, novel_pose
        )
        sky_color = sky_model(view_dirs[sky_mask])
        composed_rgb[sky_mask] = sky_color

    # 4. Apply camera effects
    composed_rgb = apply_camera_effects(
        composed_rgb,
        camera_calibration,
        effects=["lens_distortion", "vignetting", "auto_exposure"],
    )

    return composed_rgb


def render_with_rolling_shutter(
    gaussians: GaussianParams,
    ego_poses_interp: callable,  # function: timestamp -> (4, 4) pose
    camera_calibration: dict,
    frame_timestamp: float,
    readout_time: float = 0.033,  # 33ms typical rolling shutter
    num_row_groups: int = 8,      # approximate with 8 sub-renders
) -> torch.Tensor:
    """Render with rolling shutter simulation.

    Each group of rows is rendered from a slightly different ego pose,
    corresponding to the pose at that row's exposure timestamp.
    """
    H = camera_calibration["height"]
    W = camera_calibration["width"]
    rows_per_group = H // num_row_groups

    full_image = torch.zeros(H, W, 3, device="cuda")

    for g in range(num_row_groups):
        row_start = g * rows_per_group
        row_end = min((g + 1) * rows_per_group, H)

        # Timestamp for this row group
        row_fraction = (row_start + row_end) / (2 * H)
        row_timestamp = frame_timestamp + row_fraction * readout_time

        # Ego pose at this timestamp
        ego_pose = ego_poses_interp(row_timestamp)

        # Render full image from this pose
        rendered = render_novel_view(
            gaussians, ego_pose, camera_calibration, ...
        )

        # Take only the relevant rows
        full_image[row_start:row_end] = rendered[row_start:row_end]

    return full_image

Lidar Ray Casting Through a Neural Scene

import torch
import numpy as np
from typing import Tuple, Optional


def generate_lidar_rays(
    lidar_pose: torch.Tensor,         # (4, 4) lidar pose in world frame
    lidar_config: dict,               # sensor configuration
    sweep_duration: float = 0.1,      # 100ms for full rotation
    ego_poses_interp: Optional[callable] = None,  # for motion compensation
) -> Tuple[torch.Tensor, torch.Tensor]:
    """Generate lidar ray origins and directions for one full sweep.

    Returns:
        origins: (N_rays, 3) ray origins in world frame
        directions: (N_rays, 3) ray directions in world frame
    """
    num_beams = lidar_config["num_beams"]        # e.g., 128
    beam_angles = lidar_config["beam_angles"]    # vertical angles per beam
    azimuth_resolution = lidar_config["azimuth_resolution"]  # e.g., 0.1 deg
    fov_azimuth = lidar_config.get("fov_azimuth", 360.0)

    num_azimuths = int(fov_azimuth / azimuth_resolution)

    # Build ray directions in lidar frame
    azimuths = torch.linspace(0, fov_azimuth, num_azimuths) * np.pi / 180
    elevations = torch.tensor(beam_angles) * np.pi / 180

    # Create grid of (azimuth, elevation) pairs
    az_grid, el_grid = torch.meshgrid(azimuths, elevations, indexing="ij")
    az_flat = az_grid.flatten()
    el_flat = el_grid.flatten()

    # Spherical to Cartesian
    dirs_lidar = torch.stack([
        torch.cos(el_flat) * torch.cos(az_flat),
        torch.cos(el_flat) * torch.sin(az_flat),
        torch.sin(el_flat),
    ], dim=-1)  # (N_rays, 3)

    N_rays = dirs_lidar.shape[0]

    if ego_poses_interp is not None:
        # Motion-compensated: each azimuth has a different pose
        origins = torch.zeros(N_rays, 3, device="cuda")
        directions = torch.zeros(N_rays, 3, device="cuda")

        for i, az_idx in enumerate(range(num_azimuths)):
            t_fraction = az_idx / num_azimuths
            timestamp = t_fraction * sweep_duration
            pose_at_t = ego_poses_interp(timestamp)
            lidar_world_at_t = pose_at_t @ lidar_config["lidar_to_ego"]

            beam_slice = slice(i * num_beams, (i + 1) * num_beams)
            origins[beam_slice] = lidar_world_at_t[:3, 3].unsqueeze(0)
            directions[beam_slice] = (
                lidar_world_at_t[:3, :3] @ dirs_lidar[beam_slice].T
            ).T
    else:
        # Simple: all rays from a single pose
        origins = lidar_pose[:3, 3].unsqueeze(0).expand(N_rays, -1)
        directions = (lidar_pose[:3, :3] @ dirs_lidar.T).T

    directions = directions / directions.norm(dim=-1, keepdim=True)
    return origins.cuda(), directions.cuda()


def lidar_ray_cast_gaussians(
    origins: torch.Tensor,        # (N_rays, 3)
    directions: torch.Tensor,     # (N_rays, 3)
    gaussians: GaussianParams,
    max_range: float = 120.0,     # max lidar range in meters
    beam_divergence: float = 3e-3, # ~3 mrad typical
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Cast lidar rays through a Gaussian scene.

    Returns:
        depths: (N_rays,) depth per ray (0 if no hit)
        intensities: (N_rays,) intensity per ray
        hit_mask: (N_rays,) bool, True if ray returned a measurement
    """
    N_rays = origins.shape[0]
    N_gaussians = gaussians.num_gaussians()

    # For each ray, find nearby Gaussians (spatial hashing or BVH in practice)
    # Here we use a simplified brute-force approach for clarity
    means = gaussians.means  # (N_gs, 3)
    scales = torch.exp(gaussians.scales)  # (N_gs, 3)
    max_extent = scales.max(dim=-1).values * 3  # 3-sigma cutoff

    depths = torch.zeros(N_rays, device="cuda")
    intensities = torch.zeros(N_rays, device="cuda")
    hit_mask = torch.zeros(N_rays, dtype=torch.bool, device="cuda")

    # Process in chunks to manage memory
    chunk_size = 4096
    for start in range(0, N_rays, chunk_size):
        end = min(start + chunk_size, N_rays)
        ray_o = origins[start:end]      # (C, 3)
        ray_d = directions[start:end]   # (C, 3)
        C = ray_o.shape[0]

        # Compute ray-Gaussian distances
        # For each ray and Gaussian, find closest point on ray to Gaussian mean
        diff = means.unsqueeze(0) - ray_o.unsqueeze(1)  # (C, N_gs, 3)
        t_closest = (diff * ray_d.unsqueeze(1)).sum(-1)  # (C, N_gs)
        t_closest = t_closest.clamp(min=0, max=max_range)

        closest_point = ray_o.unsqueeze(1) + t_closest.unsqueeze(-1) * ray_d.unsqueeze(1)
        dist_to_mean = (closest_point - means.unsqueeze(0)).norm(dim=-1)  # (C, N_gs)

        # Filter: only consider Gaussians within their extent
        within_range = dist_to_mean < max_extent.unsqueeze(0)

        # For qualifying Gaussians, compute alpha-weighted depth
        cov_3d = build_covariance_3d(gaussians.scales, gaussians.rotations)
        opacities = torch.sigmoid(gaussians.opacities.squeeze())

        # Evaluate Gaussian along ray (simplified)
        # Full implementation uses the Mahalanobis distance
        gaussian_weight = torch.exp(-0.5 * (dist_to_mean ** 2) /
                                     (scales.max(dim=-1).values.unsqueeze(0) ** 2))
        gaussian_weight = gaussian_weight * opacities.unsqueeze(0)
        gaussian_weight[~within_range] = 0

        # Alpha compositing along depth-sorted Gaussians
        # Sort by t_closest for each ray
        sorted_t, sort_idx = t_closest.sort(dim=-1)
        sorted_weights = gaussian_weight.gather(1, sort_idx)

        # Compute transmittance and alpha
        alpha = sorted_weights.clamp(0, 0.99)
        transmittance = torch.cumprod(1 - alpha + 1e-10, dim=-1)
        transmittance = torch.cat([
            torch.ones(C, 1, device="cuda"),
            transmittance[:, :-1]
        ], dim=-1)

        weights = alpha * transmittance  # (C, N_gs)

        # Accumulated depth
        ray_depth = (weights * sorted_t).sum(dim=-1)  # (C,)
        total_weight = weights.sum(dim=-1)

        # Hit detection
        ray_hit = total_weight > 0.5  # sufficient accumulated opacity

        depths[start:end] = ray_depth
        hit_mask[start:end] = ray_hit & (ray_depth > 0.5) & (ray_depth < max_range)

        # Intensity (simplified: based on normal and distance)
        intensities[start:end] = estimate_lidar_intensity(
            ray_depth, ray_d, gaussians, sort_idx, weights
        )

    return depths, intensities, hit_mask


def simulate_ray_dropping(
    depths: torch.Tensor,
    intensities: torch.Tensor,
    hit_mask: torch.Tensor,
    drop_model: Optional[nn.Module] = None,
) -> torch.Tensor:
    """Simulate realistic lidar ray dropping.

    Real lidars drop rays due to:
    - Specular reflections (puddles, glass)
    - Out-of-range returns
    - Dark surfaces (low reflectance)
    - Atmospheric effects (rain, fog, dust)
    """
    if drop_model is not None:
        # Learned ray drop model
        features = torch.stack([depths, intensities], dim=-1)
        drop_prob = drop_model(features).squeeze(-1)
        keep = torch.bernoulli(1 - drop_prob).bool()
    else:
        # Simple heuristic model
        # Higher drop probability at long range and low intensity
        range_factor = (depths / 120.0).clamp(0, 1)
        intensity_factor = (1 - intensities).clamp(0, 1)
        drop_prob = 0.05 + 0.1 * range_factor + 0.1 * intensity_factor
        keep = torch.bernoulli(1 - drop_prob).bool()

    final_mask = hit_mask & keep
    return final_mask

Quality Metric Computation

import torch
import torch.nn.functional as F
from torchmetrics.image import (
    PeakSignalNoiseRatio,
    StructuralSimilarityIndexMeasure,
)
from torchmetrics.image.lpip import LearnedPerceptualImagePatchSimilarity


class NeuralSimMetrics:
    """Compute reconstruction quality metrics for neural sim validation.

    Two levels of metrics:
      1. Upstream: how well does the reconstruction match ground truth?
      2. Downstream: does the perception stack perform the same?
    """

    def __init__(self, device: str = "cuda"):
        self.device = device

        # Image quality metrics
        self.psnr = PeakSignalNoiseRatio(data_range=1.0).to(device)
        self.ssim = StructuralSimilarityIndexMeasure(data_range=1.0).to(device)
        self.lpips = LearnedPerceptualImagePatchSimilarity(
            net_type="alex",  # AlexNet backbone
            normalize=True,
        ).to(device)

    def compute_image_metrics(
        self,
        rendered: torch.Tensor,   # (B, 3, H, W) predicted images [0, 1]
        ground_truth: torch.Tensor,  # (B, 3, H, W) real images [0, 1]
    ) -> dict:
        """Compute upstream image quality metrics."""
        metrics = {}

        # PSNR: Peak Signal-to-Noise Ratio (higher is better)
        # Measures pixel-level accuracy. PSNR > 27 dB is typically good.
        metrics["psnr"] = self.psnr(rendered, ground_truth).item()

        # SSIM: Structural Similarity Index (higher is better, max 1.0)
        # Measures structural patterns. SSIM > 0.85 is typically good.
        metrics["ssim"] = self.ssim(rendered, ground_truth).item()

        # LPIPS: Learned Perceptual Image Patch Similarity (lower is better)
        # Uses deep features to measure perceptual quality. LPIPS < 0.15 is good.
        metrics["lpips"] = self.lpips(rendered, ground_truth).item()

        return metrics

    def compute_lidar_metrics(
        self,
        rendered_points: torch.Tensor,    # (N, 3) rendered point cloud
        ground_truth_points: torch.Tensor,  # (M, 3) real point cloud
        rendered_intensity: torch.Tensor,   # (N,) rendered intensity
        gt_intensity: torch.Tensor,         # (M,) real intensity
    ) -> dict:
        """Compute upstream lidar quality metrics."""
        metrics = {}

        # Chamfer Distance: average bidirectional nearest-neighbor distance
        # Lower is better. < 0.05m is typically good.
        from pytorch3d.loss import chamfer_distance
        cd, _ = chamfer_distance(
            rendered_points.unsqueeze(0),
            ground_truth_points.unsqueeze(0),
        )
        metrics["chamfer_distance_m"] = cd.item()

        # Median Absolute Depth Error
        # Match rendered and GT points by lidar beam ID, then compare depth
        # (simplified here as overall statistics)
        metrics["median_depth_error_m"] = torch.median(
            torch.abs(rendered_points[:, :3].norm(dim=-1) -
                      ground_truth_points[:rendered_points.shape[0], :3].norm(dim=-1))
        ).item()

        # Intensity MAE
        min_len = min(len(rendered_intensity), len(gt_intensity))
        metrics["intensity_mae"] = torch.abs(
            rendered_intensity[:min_len] - gt_intensity[:min_len]
        ).mean().item()

        return metrics

    def compute_downstream_metrics(
        self,
        perception_model: torch.nn.Module,
        rendered_data: dict,       # sensor data from neural sim
        real_data: dict,           # real sensor data
        ground_truth_labels: dict, # 3D bounding boxes, lanes, etc.
    ) -> dict:
        """Compute downstream perception metrics.

        The key question: does the perception stack produce the same
        outputs on neural-rendered data vs. real data?
        """
        metrics = {}

        # Run perception on real data
        with torch.no_grad():
            real_detections = perception_model(real_data)
            rendered_detections = perception_model(rendered_data)

        # Detection mAP on real vs rendered
        real_map = compute_detection_map(
            real_detections, ground_truth_labels
        )
        rendered_map = compute_detection_map(
            rendered_detections, ground_truth_labels
        )

        metrics["real_detection_map"] = real_map
        metrics["rendered_detection_map"] = rendered_map
        metrics["detection_map_gap"] = abs(real_map - rendered_map)

        # The gap should be small (< 1-2%) for trustworthy simulation
        metrics["sim_trustworthy"] = metrics["detection_map_gap"] < 0.02

        return metrics


def compute_psnr_manual(
    rendered: torch.Tensor,
    ground_truth: torch.Tensor,
    max_val: float = 1.0,
) -> float:
    """Manual PSNR computation for understanding.

    PSNR = 10 * log10(MAX^2 / MSE)
         = 20 * log10(MAX / RMSE)

    Higher PSNR = less error = better reconstruction.
    Typical values for neural rendering:
      - 25-28 dB: decent quality
      - 28-32 dB: good quality
      - 32+ dB:   excellent quality
    """
    mse = F.mse_loss(rendered, ground_truth)
    if mse == 0:
        return float("inf")
    psnr = 20 * torch.log10(torch.tensor(max_val)) - 10 * torch.log10(mse)
    return psnr.item()


def compute_ssim(
    img1: torch.Tensor,  # (B, C, H, W)
    img2: torch.Tensor,  # (B, C, H, W)
    window_size: int = 11,
    C1: float = 0.01 ** 2,
    C2: float = 0.03 ** 2,
) -> torch.Tensor:
    """Structural Similarity Index (simplified).

    SSIM compares luminance, contrast, and structure:
      SSIM(x, y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) /
                   (mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2)

    Values range [0, 1]. Higher = more similar.
    """
    # Create Gaussian window
    coords = torch.arange(window_size, dtype=torch.float32) - window_size // 2
    gauss = torch.exp(-coords ** 2 / (2 * 1.5 ** 2))
    gauss = gauss / gauss.sum()
    window = gauss.unsqueeze(0) * gauss.unsqueeze(1)
    window = window.unsqueeze(0).unsqueeze(0)  # (1, 1, K, K)
    window = window.expand(img1.shape[1], -1, -1, -1).to(img1.device)

    mu1 = F.conv2d(img1, window, groups=img1.shape[1], padding=window_size // 2)
    mu2 = F.conv2d(img2, window, groups=img2.shape[1], padding=window_size // 2)

    mu1_sq = mu1 ** 2
    mu2_sq = mu2 ** 2
    mu1_mu2 = mu1 * mu2

    sigma1_sq = F.conv2d(img1 * img1, window, groups=img1.shape[1],
                         padding=window_size // 2) - mu1_sq
    sigma2_sq = F.conv2d(img2 * img2, window, groups=img2.shape[1],
                         padding=window_size // 2) - mu2_sq
    sigma12 = F.conv2d(img1 * img2, window, groups=img1.shape[1],
                       padding=window_size // 2) - mu1_mu2

    ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) / \
               ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))

    return ssim_map.mean()

Mental Models & Diagrams

Neural Sim Pipeline (End-to-End)

┌─────────────────────────────────────────────────────────────────────────┐
│                    NEURAL SIM: END-TO-END PIPELINE                       │
│                                                                          │
│  PHASE 1: DATA COLLECTION                                                │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  Fleet Vehicle                                                      │  │
│  │  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                    │  │
│  │  │Cam 0 │ │Cam 1 │ │Cam 2 │ │LiDAR │ │ IMU/ │                    │  │
│  │  │Front │ │Left  │ │Right │ │360   │ │ GPS  │                    │  │
│  │  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘                    │  │
│  │     └────────┴────────┴────────┴────────┘                          │  │
│  │                      │                                              │  │
│  │                 Drive Log                                           │  │
│  │          (images, points, poses, timestamps)                       │  │
│  └────────────────────────┬───────────────────────────────────────────┘  │
│                           │                                              │
│  PHASE 2: RECONSTRUCTION  ▼                                              │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                                                                      │  │
│  │  ┌────────────┐  ┌──────────────┐  ┌─────────────────────────┐    │  │
│  │  │Pose Refine │  │Dynamic Mask  │  │Sky Segmentation         │    │  │
│  │  │(SfM/SLAM)  │  │(Track+Seg)   │  │(Semantic Seg)           │    │  │
│  │  └─────┬──────┘  └──────┬───────┘  └────────────┬────────────┘    │  │
│  │        │                │                        │                 │  │
│  │        └────────────────┼────────────────────────┘                 │  │
│  │                         ▼                                          │  │
│  │           ┌─────────────────────────────┐                          │  │
│  │           │  Gaussian Splatting Training │                          │  │
│  │           │  (static background)         │                          │  │
│  │           │  + Actor Model Training      │                          │  │
│  │           │  + Sky Model Fitting         │                          │  │
│  │           └──────────────┬──────────────┘                          │  │
│  │                          │                                          │  │
│  │                    Neural Scene                                     │  │
│  └──────────────────────────┬─────────────────────────────────────────┘  │
│                             │                                            │
│  PHASE 3: SIMULATION        ▼                                            │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                                                                      │  │
│  │  Scenario Definition:                                                │  │
│  │  "Ego brakes 0.5s later; lead vehicle cuts in from left"            │  │
│  │        │                                                             │  │
│  │        ▼                                                             │  │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐                      │  │
│  │  │ Update   │───►│ Render   │───►│ Run      │──┐                   │  │
│  │  │ Scene    │    │ Sensors  │    │ Percep.  │  │                   │  │
│  │  │ State    │    │ (cam+lid)│    │ Stack    │  │                   │  │
│  │  └──────────┘    └──────────┘    └──────────┘  │                   │  │
│  │       ▲                                         │                   │  │
│  │       │          ┌──────────┐    ┌──────────┐  │                   │  │
│  │       └──────────┤ Update   │◄───┤ Run      │◄─┘                   │  │
│  │                  │ Actors   │    │ Planner  │                       │  │
│  │                  └──────────┘    └──────────┘                       │  │
│  │                                                                      │  │
│  │  Output: Planner decisions, safety metrics, perception KPIs         │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  PHASE 4: VALIDATION                                                     │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  Upstream:  PSNR=29.1 dB | SSIM=0.88 | LPIPS=0.11 | Chamfer=0.03m│  │
│  │  Downstream: det mAP gap=0.8% | lane F1 gap=0.5% | PASS          │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

NeRF vs Gaussian Splatting Rendering Comparison

┌─────────────────────────────────────────────────────────────────────────┐
│                    RENDERING APPROACH COMPARISON                          │
│                                                                          │
│  NeRF (Backward / Ray Marching):                                        │
│                                                                          │
│    For each PIXEL:                                                       │
│      Cast a ray through the scene                                       │
│      Sample N points along the ray (e.g., 64 + 128 = 192)              │
│      For each sample: query MLP --> (color, density)                    │
│      Accumulate via volume rendering                                    │
│                                                                          │
│    Camera         Scene (implicit MLP)                                  │
│      │                                                                   │
│      │  ray    ●──●──●──●──●──●──●  (sample points)                    │
│      ├────────►│  │  │  │  │  │  │  Each ● = MLP forward pass          │
│      │         ●──●──●──●──●──●──●                                      │
│      │  ray    │  │  │  │  │  │  │                                      │
│      ├────────►●──●──●──●──●──●──●                                      │
│      │                                                                   │
│    Cost: H x W x N_samples x MLP_cost                                  │
│    For 1920x1080: ~400M MLP evaluations per frame                      │
│    Result: 0.1 - 5 FPS                                                  │
│                                                                          │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │
│                                                                          │
│  3DGS (Forward / Splatting):                                            │
│                                                                          │
│    For each GAUSSIAN:                                                    │
│      Project onto image plane (one matrix multiply)                     │
│      Determine which tiles it overlaps                                  │
│    Sort all Gaussians by depth                                          │
│    For each TILE (16x16 pixels):                                        │
│      Alpha-composite sorted Gaussians                                   │
│                                                                          │
│    Gaussians          Camera / Image                                    │
│      ●                ┌─────┬─────┐                                     │
│        ●    project   │░░░░░│     │  ░ = contributions                  │
│      ●   ──────────►  │░░░░░│░░   │      from projected                │
│        ●              ├─────┼─────┤      Gaussians                      │
│      ●                │     │░░░░░│                                      │
│                       └─────┴─────┘                                     │
│                       (tile-based compositing)                           │
│                                                                          │
│    Cost: N_gaussians x (project + sort + composite)                    │
│    Highly parallelizable on GPU (one thread block per tile)             │
│    Result: 100 - 300 FPS                                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Scene Decomposition Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                    SCENE DECOMPOSITION FOR AD                            │
│                                                                          │
│  Input: Single Drive Log Sequence (10-20 seconds)                       │
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────┐      │
│  │  Frame t=0        Frame t=1        Frame t=2                  │      │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐               │      │
│  │  │ road     │    │ road     │    │ road     │               │      │
│  │  │  ┌──┐   │    │   ┌──┐  │    │    ┌──┐  │               │      │
│  │  │  │A │   │    │   │A │  │    │    │A │  │  (A = car)     │      │
│  │  │  └──┘   │    │   └──┘  │    │    └──┘  │               │      │
│  │  │ bldg    │    │ bldg    │    │ bldg    │               │      │
│  │  └──────────┘    └──────────┘    └──────────┘               │      │
│  └───────────────────────────────────────────────────────────────┘      │
│         │                                                                │
│         ▼                                                                │
│  ┌────────────────────────────────────────────────┐                     │
│  │          DETECTION + TRACKING                    │                     │
│  │  3D bounding boxes per actor per frame          │                     │
│  │  Actor A: [(x0,y0,z0,w,h,l,yaw), ...]         │                     │
│  └────────────────────┬───────────────────────────┘                     │
│                       │                                                  │
│         ┌─────────────┴──────────────┐                                  │
│         ▼                            ▼                                   │
│  ┌──────────────────┐    ┌───────────────────────┐                      │
│  │  STATIC LAYER     │    │  DYNAMIC LAYER         │                      │
│  │                    │    │                         │                      │
│  │  All pixels NOT    │    │  Per-actor cropped      │                      │
│  │  belonging to      │    │  observations:          │                      │
│  │  tracked actors    │    │                         │                      │
│  │                    │    │  ┌──┐  ┌──┐  ┌──┐     │                      │
│  │  ┌──────────┐     │    │  │A │  │A │  │A │     │                      │
│  │  │ road     │     │    │  │t0│  │t1│  │t2│     │                      │
│  │  │  ████   │     │    │  └──┘  └──┘  └──┘     │                      │
│  │  │ (hole   │     │    │                         │                      │
│  │  │  filled │     │    │  Train per-actor model  │                      │
│  │  │  by     │     │    │  in actor-local coords  │                      │
│  │  │  inpaint│     │    │                         │                      │
│  │  │  + lidar│     │    │  OR match to PBR asset  │                      │
│  │  │  depth) │     │    │  from library           │                      │
│  │  │ bldg    │     │    │                         │                      │
│  │  └──────────┘     │    └───────────────────────┘                      │
│  │                    │                                                    │
│  │  Train 3DGS on    │                                                    │
│  │  masked frames    │                                                    │
│  └──────────────────┘                                                    │
│                                                                          │
│         ┌─────────────┴──────────────┐                                  │
│         ▼                            ▼                                   │
│  AT SIMULATION TIME:                                                     │
│  ┌──────────────────────────────────────────────────┐                   │
│  │                                                    │                   │
│  │  Static BG (fixed)  +  Actor A at new_pose_A      │                   │
│  │                     +  Actor B at new_pose_B      │                   │
│  │                     +  New Actor C (from library) │                   │
│  │                     +  Sky dome                    │                   │
│  │                     =  Composed Scene Render      │                   │
│  │                                                    │                   │
│  └──────────────────────────────────────────────────┘                   │
└─────────────────────────────────────────────────────────────────────────┘

Hands-On Exercises

Exercise 1: Implement a Minimal 2D Gaussian Splatting Renderer

Goal: Build intuition for how Gaussian splatting works by implementing a 2D version from scratch.

Task:

  1. Create a set of 2D Gaussians (position, scale, rotation, color, opacity)
  2. Implement the forward rendering pass: project, sort by depth, alpha-composite
  3. Implement a training loop that optimizes Gaussian parameters to match a target image
  4. Visualize the optimization process
# Starter code
import torch
import matplotlib.pyplot as plt

class Gaussian2D:
    def __init__(self, n_gaussians=1000, image_size=256):
        self.means = torch.randn(n_gaussians, 2) * image_size / 4 + image_size / 2
        self.means.requires_grad_(True)
        self.scales = torch.full((n_gaussians, 2), 3.0).requires_grad_(True)
        self.colors = torch.rand(n_gaussians, 3).requires_grad_(True)
        self.opacities = torch.zeros(n_gaussians, 1).requires_grad_(True)
        # TODO: implement render() and training loop

# Expected outcome: reproduce a target image using ~5000 Gaussians
# in under 1000 optimization steps

What you will learn: The core splatting algorithm, alpha compositing, gradient-based optimization of explicit primitives.


Exercise 2: Compare NeRF vs 3DGS on a Driving Scene

Goal: Understand the practical trade-offs by training both representations on the same data.

Task:

  1. Use the nerfstudio framework (supports both NeRF and 3DGS)
  2. Download a driving scene from the nuScenes mini dataset
  3. Train both nerfacto (NeRF variant) and splatfacto (3DGS variant)
  4. Compare: training time, render speed (FPS), PSNR, SSIM, LPIPS
  5. Try rendering from novel viewpoints 1m, 2m, 5m off the original trajectory
# Setup
pip install nerfstudio
ns-install-cli

# Process nuScenes data
ns-process-data nuscenes --data /path/to/nuscenes-mini --output-dir data/nuscenes

# Train NeRF
ns-train nerfacto --data data/nuscenes --experiment-name nerf_driving

# Train 3DGS
ns-train splatfacto --data data/nuscenes --experiment-name gs_driving

# Compare metrics
ns-eval --load-config outputs/nerf_driving/config.yml
ns-eval --load-config outputs/gs_driving/config.yml

What you will learn: First-hand experience with training and rendering speed differences, quality trade-offs, and extrapolation behavior.


Exercise 3: Implement Scene Decomposition

Goal: Separate a driving sequence into static background and dynamic actors.

Task:

  1. Take a sequence of driving images with 2D bounding box labels
  2. Create binary masks for dynamic objects in each frame
  3. Inpaint the masked regions using OpenCV or a diffusion model
  4. Train 3DGS on the masked (static-only) images
  5. Compare the static reconstruction with and without masking
import cv2
import numpy as np

def create_dynamic_mask(image, bounding_boxes, expansion_px=10):
    """Create a binary mask of dynamic objects.

    Args:
        image: (H, W, 3) input image
        bounding_boxes: list of (x1, y1, x2, y2) for each dynamic object
        expansion_px: expand each box by this many pixels to cover shadows
    """
    mask = np.zeros(image.shape[:2], dtype=np.uint8)
    for (x1, y1, x2, y2) in bounding_boxes:
        x1 = max(0, x1 - expansion_px)
        y1 = max(0, y1 - expansion_px)
        x2 = min(image.shape[1], x2 + expansion_px)
        y2 = min(image.shape[0], y2 + expansion_px)
        mask[y1:y2, x1:x2] = 255
    return mask

def inpaint_static(image, mask):
    """Inpaint dynamic object regions for static scene training."""
    return cv2.inpaint(image, mask, inpaintRadius=5, flags=cv2.INPAINT_TELEA)

# TODO: process entire sequence, then train 3DGS on inpainted images

What you will learn: Why decomposition matters, how masking quality affects reconstruction, the challenge of filling in occluded regions.


Exercise 4: Build a Lidar Simulator from a Gaussian Scene

Goal: Generate synthetic lidar point clouds from a trained 3DGS model.

Task:

  1. Train a 3DGS model on a scene (can reuse from Exercise 2)
  2. Implement lidar ray generation for a Velodyne VLP-128 configuration
  3. Intersect rays with the Gaussian scene to produce depth and intensity
  4. Compare generated point cloud with real lidar ground truth
  5. Implement a basic ray drop model

Expected output: Side-by-side visualization of real vs. neural-simulated lidar point clouds, with Chamfer distance < 0.1m.

What you will learn: How lidar simulation works with Gaussians, the importance of sensor modeling (beam divergence, ray dropping), and where the current quality limits are.


Exercise 5: Closed-Loop Replay with Neural Rendering

Goal: Implement a minimal closed-loop replay system where the ego trajectory is modified and sensor data is re-rendered.

Task:

  1. Take a trained neural scene from Exercise 2
  2. Define an alternative ego trajectory (e.g., lateral offset of 1m)
  3. Render camera images from the new trajectory
  4. Run a pretrained object detector on both original and re-rendered images
  5. Compare detection outputs (are the same objects detected?)
def modify_ego_trajectory(
    original_poses: list,       # list of (4, 4) poses
    lateral_offset_m: float,    # how far to shift laterally
) -> list:
    """Create a modified ego trajectory with a lateral offset."""
    modified = []
    for pose in original_poses:
        new_pose = pose.clone()
        # Shift in the vehicle's lateral direction (y-axis in ego frame)
        lateral_dir = pose[:3, 1]  # second column = y-axis
        new_pose[:3, 3] += lateral_offset_m * lateral_dir
        modified.append(new_pose)
    return modified

# TODO: render from modified trajectory, run detector, compare

What you will learn: The complete closed-loop neural sim pipeline, how rendering quality degrades with trajectory deviation, and the practical limits of novel view synthesis.


Exercise 6: Implement and Compare Quality Metrics

Goal: Build a comprehensive evaluation pipeline for neural sim quality.

Task:

  1. Implement PSNR, SSIM, LPIPS, and FID computation
  2. Evaluate your 3DGS model from Exercise 2 on held-out test views
  3. Compute per-image metrics and visualize the distribution
  4. Identify which image regions have the worst quality (hint: sky boundaries, thin structures, distant objects)
  5. Compute downstream metrics: run a detection model and compare mAP on real vs. rendered

What you will learn: How to evaluate neural rendering quality at both the pixel level (upstream metrics) and the perception level (downstream metrics). Understanding which metrics matter most for AD simulation.


Interview Questions

1. Why is 3D Gaussian Splatting preferred over NeRF for autonomous driving simulation?

Answer hint: Real-time rendering (100+ FPS vs. <5 FPS), explicit representation (each Gaussian has a position/shape that can be manipulated for scene editing), natural scene decomposition (group Gaussians per object), lidar compatibility (rays can intersect Gaussian primitives), and faster training (minutes vs. hours). For closed-loop simulation, the ego vehicle's perception stack needs to run at real-time rates, which NeRF cannot support.

2. Explain the difference between splatting (forward rendering) and ray marching (backward rendering).

Answer hint: Ray marching (NeRF) starts from each pixel, casts a ray into the scene, and samples the volumetric representation along the ray -- cost is proportional to pixels x samples. Splatting (3DGS) starts from each primitive, projects it onto the image plane, and accumulates contributions -- cost is proportional to primitives. Splatting is more GPU-friendly because it avoids per-pixel ray marching and enables tile-based parallelism. The alpha compositing formula is mathematically equivalent in both cases.

3. How does scene decomposition work in neural rendering for AD, and why is it necessary?

Answer hint: The scene is split into static (roads, buildings, vegetation) and dynamic (vehicles, pedestrians) components. Static elements are reconstructed as a single background model, while dynamic actors get individual models in local coordinates. This is necessary because: (1) dynamic objects must be independently movable for scenario editing, (2) training the static model requires masking out dynamic objects to avoid ghosting, and (3) actors may need to be replaced with PBR assets for flexibility. Shadow handling is particularly tricky -- car shadows must be masked from the static model and re-rendered appropriately.

4. What are the key metrics for validating a neural sim system, and which ones matter most?

Answer hint: Two levels: upstream (PSNR, SSIM, LPIPS, Chamfer distance) measure reconstruction fidelity, and downstream (perception mAP gap, tracking MOTA gap) measure whether the perception stack produces the same output on rendered vs. real data. Downstream metrics matter more -- a rendering could have mediocre PSNR but still produce identical perception outputs if the differences are in regions the detector ignores. Conversely, a high-PSNR rendering could have artifacts exactly in critical regions. The ultimate metric is: "Does the planner make the same decision on neural-rendered data as it would on real data?"

5. How does Applied Intuition's Neural Sim handle dynamic actors differently from purely neural approaches like NeuRAD?

Answer hint: Applied Intuition uses a hybrid approach: Gaussian Splatting for the static background but physics-based rendering (PBR) for dynamic actors. This means dynamic actors are high-quality 3D assets with physically-based materials, not neural reconstructions. Advantage: PBR actors can be placed in arbitrary new poses, animated, and lit correctly -- they are not limited to appearances seen in the training data. The trade-off is that you need a library of matched PBR assets, but this provides much greater flexibility for scenario editing. Purely neural approaches (NeuRAD, SplatAD) reconstruct actors as neural primitives, which are faithful to the training data but limited in how much they can be manipulated.

6. What causes the "novel view synthesis quality cliff" when deviating from the training trajectory, and how can it be mitigated?

Answer hint: Neural rendering works by interpolating between training views. When the novel viewpoint deviates significantly, the system must extrapolate, revealing: (1) regions never observed in training (e.g., behind parked cars), (2) under-constrained geometry (floaters, collapsed surfaces), and (3) view-dependent appearance not captured by limited SH coefficients. Mitigation strategies include: multi-traversal data (driving the same route multiple times from slightly different lanes), lidar depth supervision (constrains geometry even without visual coverage), diffusion-based inpainting (fills hallucinated regions), and conservative trajectory deviation limits (stay within 1-3m laterally). Some systems also use learned priors about common scene structures.

7. How do you generate realistic lidar point clouds from a 3D Gaussian scene?

Answer hint: For each lidar beam, generate a ray from the sensor origin in the beam's direction (accounting for the sensor's rotation during the sweep). Intersect the ray with nearby Gaussians by evaluating each Gaussian's contribution along the ray (using the Mahalanobis distance from the ray to the Gaussian center). Alpha-composite depth values to get the final range measurement. Then apply sensor modeling: beam divergence (the ray has non-zero width), intensity estimation (based on surface normal and material), ray dropping (learned or heuristic model for missing returns), and range noise. Motion compensation is critical -- the lidar rotates over ~100ms, so each azimuth angle corresponds to a slightly different ego pose.

8. Why is rolling shutter modeling important for neural rendering in driving scenes?

Answer hint: Automotive cameras use rolling shutters where each row is exposed at a slightly different time. At highway speeds (30 m/s), the ego vehicle moves ~1m during a single frame's readout time. If the neural renderer assumes a single pose per frame (global shutter), it produces a blurred, misaligned rendering. NeuRAD showed that modeling rolling shutter improves PSNR by 1-2 dB. The solution is to interpolate the ego pose for each row (or group of rows) and render each from its correct pose, then assemble the final image from these sub-renders.

9. What are the main challenges in scaling neural sim to fleet-level data (thousands of scenes)?

Answer hint: (1) Automation: every step (pose refinement, segmentation, masking, training, quality checking) must be fully automated with no manual tuning. (2) Robustness: some scenes have degenerate geometry, poor lighting, or unusual configurations that cause training to diverge. (3) Quality gating: automated metrics must identify failed reconstructions without human review of every scene. (4) Compute cost: training thousands of scenes requires efficient GPU scheduling and resource management. (5) Versioning: as the reconstruction pipeline improves, all scenes must be re-trained and re-validated. (6) Data diversity: the pipeline must handle highways, intersections, parking lots, construction zones, adverse weather, and night scenes.

10. Compare the trade-offs between using neural-reconstructed actors vs. PBR asset-matched actors for dynamic objects in simulation.

Answer hint:

AspectNeural ActorsPBR Asset Actors
Fidelity to originalVery highApproximate (matched)
Novel posesLimited to training dataArbitrary
AnimationDifficultStandard 3D animation
Lighting consistencyBaked into reconstructionPhysically correct
Asset creation costAutomated (from data)Requires library + matching
Scenario editingLimitedFull flexibility
ScalabilityEasy (reconstruct from data)Needs large asset library

The industry trend is toward hybrid approaches: use neural backgrounds (which are hard to build by hand) and PBR actors (which need to be fully controllable). Applied Intuition exemplifies this with GS backgrounds + PBR actors.


References

Foundational Methods

  1. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Mildenhall et al., ECCV 2020 arxiv.org/abs/2003.08934

  2. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding Mueller et al., SIGGRAPH 2022 arxiv.org/abs/2201.05989

  3. 3D Gaussian Splatting for Real-Time Radiance Field Rendering Kerbl et al., SIGGRAPH 2023 arxiv.org/abs/2308.14737

  4. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields Barron et al., CVPR 2022 arxiv.org/abs/2111.12077

AD-Specific Neural Rendering

  1. NeuRAD: Neural Rendering for Autonomous Driving Tonderski et al., CVPR 2024 arxiv.org/abs/2311.15260

  2. SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving Hultman et al., 2025 arxiv.org/abs/2411.16816

  3. HUGSIM: A Real-Time, Photo-Realistic Closed-Loop Simulator for Autonomous Driving Chen et al., 2024 arxiv.org/abs/2403.17712

  4. AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction Khan et al., 2024 arxiv.org/abs/2407.02598

Earlier AD Neural Rendering

  1. Block-NeRF: Scalable Large Scene Neural View Synthesis Tancik et al., CVPR 2022 arxiv.org/abs/2202.05263

  2. UniSim: A Neural Closed-Loop Sensor Simulator Yang et al. (Waabi), CVPR 2023 arxiv.org/abs/2308.01898

  3. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision Yang et al., ICLR 2024 arxiv.org/abs/2311.02077

  4. MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving Wu et al., CICAI 2023 arxiv.org/abs/2307.15058

  5. Street Gaussians for Modeling Dynamic Urban Scenes Yan et al., 2024 arxiv.org/abs/2401.01339

  6. DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes Zhou et al., CVPR 2024 arxiv.org/abs/2312.07920

Lidar-Specific Neural Rendering

  1. LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields Tao et al., 2024 arxiv.org/abs/2304.10406

  2. Neural LiDAR Fields for Novel LiDAR View Synthesis Huang et al., ICCV 2023 arxiv.org/abs/2305.01643

Surveys and Overviews

  1. Neural Rendering for Autonomous Driving: A Survey Various authors, 2024 (Multiple survey papers covering the rapidly evolving landscape)

  2. A Survey on 3D Gaussian Splatting Chen et al., 2024 arxiv.org/abs/2401.03890


This deep dive covers the core technologies, key papers, and practical considerations for neural rendering in autonomous driving simulation. The field is evolving rapidly -- the transition from NeRF to 3DGS happened in under two years, and production systems like Applied Intuition's Neural Sim are already deploying these techniques at fleet scale. For engineers entering this space, hands-on experience with 3DGS training and rendering is the most valuable starting point.