Neural Rendering for Autonomous Driving Simulation: A Deep Dive
Focus: Neural rendering techniques (NeRF, 3D Gaussian Splatting) for photorealistic, closed-loop AD sensor simulation Key Papers: NeuRAD (CVPR 2024), SplatAD (2025), HUGSIM (2024), AutoSplat (2024), Industrial-Grade GS (2025) Read Time: 60 min
Table of Contents
- Executive Summary
- Background & Motivation
- Core Technologies
- Key Papers for AD Simulation
- Applied Intuition's Neural Sim Architecture
- Technical Deep Dive
- Code Examples
- Mental Models & Diagrams
- Hands-On Exercises
- Interview Questions
- References
Executive Summary
What Is Neural Rendering for AD Simulation?
Neural rendering replaces traditional computer graphics pipelines (hand-authored 3D assets, rasterization engines, ray tracers) with learned scene representations reconstructed directly from real sensor data. For autonomous driving, this means converting raw drive logs -- camera images, lidar point clouds, poses -- into photorealistic, re-renderable 3D scenes that can simulate novel viewpoints, new actor configurations, and multi-sensor outputs in closed loop.
Why It Matters
Traditional simulation suffers from the sim-to-real gap: synthetic scenes look different enough from reality that perception stacks trained or tested in simulation may not transfer. Neural rendering closes this gap by reconstructing scenes from reality rather than approximating it.
Traditional Pipeline: Neural Rendering Pipeline:
Artists --> 3D Assets Drive Logs --> Neural Scene
| |
Game Engine --> Rendered Differentiable Renderer --> Rendered
| Images | Images
Sim-to-Real Gap: LARGE Sim-to-Real Gap: SMALL
Key Insight
The field has rapidly converged on 3D Gaussian Splatting (3DGS) as the preferred representation for AD simulation, overtaking NeRF due to its real-time rendering speed, explicit scene representation (enabling actor manipulation), and natural compatibility with lidar simulation. Applied Intuition's Neural Sim product exemplifies this trend: combining Gaussian Splatting for static backgrounds with physics-based rendering (PBR) for dynamic actors to achieve photorealistic closed-loop sensor simulation at fleet scale.
Background & Motivation
The Simulation Imperative
Autonomous driving companies rely on simulation for:
| Use Case | Why Simulation | Scale |
|---|---|---|
| Safety validation | Cannot test every edge case on roads | Billions of miles needed |
| Regression testing | Every software change needs validation | Thousands of scenarios per commit |
| Long-tail mining | Rare events are hard to encounter naturally | Targeted scenario generation |
| Perception testing | Sensor behavior in novel conditions | Weather, lighting, occlusions |
| Closed-loop replay | "What if the ego had done X instead?" | Every logged drive becomes reusable |
Waymo has reported driving 100M+ real miles but 10B+ simulated miles. This 100:1 ratio only makes sense if simulation is trustworthy.
Limitations of Traditional Approaches
1. Artist-Authored Worlds (Game Engine Style)
Pros: Cons:
+ Full control over scene - Expensive asset creation ($$$)
+ Deterministic rendering - Cartoon-like appearance
+ Easy actor manipulation - Material/lighting mismatch
+ Fast iteration - Never matches real sensor output
- Perception models don't transfer
Companies like CARLA and LGSVL use Unreal Engine or Unity. Despite PBR materials and ray tracing, the domain gap remains significant -- a detector trained on real data drops 15-30% mAP when evaluated on synthetic images.
2. Log Replay (Replay Recorded Data)
Pros: Cons:
+ Perfect sensor realism - Cannot change ego trajectory
+ No domain gap - Cannot add/remove actors
+ Simple infrastructure - Fixed viewpoint only
- Not closed-loop
Log replay is the gold standard for realism but fundamentally limited: you can only replay exactly what happened. If the ego car had braked 0.5 seconds earlier, you cannot generate the sensor data for that alternative trajectory.
3. Neural Rendering (The New Paradigm)
Pros: Cons:
+ Photorealistic (from data) - Bounded to training distribution
+ Novel viewpoints - Reconstruction artifacts
+ Multi-sensor capable - Compute-intensive training
+ Closed-loop compatible - Dynamic scene handling is hard
+ Scalable via ML pipelines - Limited extrapolation range
Neural rendering sits at the intersection: near-real sensor fidelity with the flexibility to change viewpoints and scene configurations.
The Evolution Timeline
2020 ─── NeRF (Mildenhall et al.)
| First neural radiance field; 30+ hours to train, minutes to render
|
2021 ─── Instant-NGP (Mueller et al.)
| Hash encoding cuts NeRF training to minutes
|
2022 ─── Block-NeRF, Urban Radiance Fields
| NeRF scaled to city-level scenes
| Panoptic Neural Fields for scene understanding
|
2023 ─── 3D Gaussian Splatting (Kerbl et al.)
| Real-time neural rendering via explicit Gaussians
| UniSim (Waabi) for AD simulation
| MARS, EmerNeRF for dynamic driving scenes
|
2024 ─── NeuRAD (CVPR 2024), HUGSIM, AutoSplat
| GS-based AD simulation matures
| Street Gaussians, DrivingGaussian
| Applied Intuition Neural Sim launches
|
2025 ─── SplatAD, Industrial-Grade GS
Real-time camera+lidar from GS
Fleet-scale neural sim pipelines
Core Technologies
Neural Radiance Fields (NeRF)
The Core Idea
NeRF represents a 3D scene as a continuous volumetric function that maps a 5D coordinate (3D position + 2D viewing direction) to color and density:
F_theta: (x, y, z, theta, phi) --> (r, g, b, sigma)
where:
(x, y, z) = 3D position in space
(theta, phi) = viewing direction (azimuth, elevation)
(r, g, b) = emitted color at that point from that direction
sigma = volume density (opacity) at that point
The function F_theta is parameterized by a multi-layer perceptron (MLP).
Volume Rendering
To render a pixel, NeRF casts a ray from the camera through that pixel and accumulates color along the ray:
Camera ----ray----> * ---- * ---- * ---- * ---- * ----> (background)
| | | | |
s_1 s_2 s_3 s_4 s_5 (sample points)
| | | | |
query query query query query (MLP forward pass)
| | | | |
(c_1, (c_2, (c_3, (c_4, (c_5,
σ_1) σ_2) σ_3) σ_4) σ_5)
The final pixel color C(r) is computed via numerical quadrature of the rendering equation:
C(r) = sum_{i=1}^{N} T_i * alpha_i * c_i
where:
alpha_i = 1 - exp(-sigma_i * delta_i) (opacity of sample i)
T_i = prod_{j=1}^{i-1} (1 - alpha_j) (transmittance to sample i)
delta_i = t_{i+1} - t_i (distance between samples)
c_i = color at sample i
This is differentiable end-to-end, so we can optimize the MLP weights by minimizing the photometric loss between rendered and observed pixels.
Positional Encoding
Raw (x, y, z) coordinates fed to an MLP produce over-smooth results because MLPs are biased toward low-frequency functions. NeRF uses positional encoding to lift inputs into a higher-dimensional space:
gamma(p) = [sin(2^0 * pi * p), cos(2^0 * pi * p),
sin(2^1 * pi * p), cos(2^1 * pi * p),
...
sin(2^{L-1} * pi * p), cos(2^{L-1} * pi * p)]
For position (L=10): 3D input becomes 60D. For direction (L=4): 2D input becomes 24D. This allows the MLP to learn high-frequency detail like sharp edges and fine textures.
NeRF Limitations for AD Simulation
| Limitation | Description | Impact on AD Sim |
|---|---|---|
| Slow rendering | Hundreds of MLP queries per ray | Cannot achieve real-time for closed-loop |
| Slow training | Hours to days per scene | Cannot scale to fleet data |
| Static scenes | Original NeRF assumes fixed geometry | Driving scenes are dynamic |
| Bounded scenes | Works best in object-centric settings | Driving scenes are unbounded |
| No explicit geometry | Implicit density field | Hard to manipulate individual objects |
| Per-ray computation | Each pixel requires marching | No lidar simulation without ray marching |
These limitations motivated the development of 3D Gaussian Splatting.
3D Gaussian Splatting (3DGS)
The Core Idea
Instead of an implicit function queried along rays, 3DGS represents the scene as a collection of explicit 3D Gaussian primitives -- millions of small, colored, semi-transparent ellipsoids scattered through space:
Each Gaussian is defined by:
- Position (mean): mu in R^3
- Covariance: Sigma in R^{3x3} (stored as rotation q + scale s)
- Opacity: alpha in [0, 1]
- Color: c (via spherical harmonics for view-dependence)
Total parameters per Gaussian: 3 + 4 + 3 + 1 + 48 = 59 floats
(pos rot scale opacity SH coeffs)
A typical driving scene might use 1-5 million Gaussians.
How Rendering Works: Differentiable Rasterization
Unlike NeRF's ray marching, 3DGS uses a splatting (forward rasterization) approach:
Step 1: Project each 3D Gaussian onto the 2D image plane
┌──────────────────────────────┐
│ 3D Gaussian (ellipsoid) │
│ ╱ ╲ │
│ ╱ ╲ project │
│ ╱ ╲ ─────────► 2D Gaussian (ellipse)
│ ╱ ╲ │
│ ╱─────────╲ │
└──────────────────────────────┘
Step 2: Sort Gaussians by depth (front to back)
Step 3: For each pixel, alpha-composite overlapping Gaussians:
C(pixel) = sum_{i in overlapping} c_i * alpha_i * T_i
T_i = prod_{j=1}^{i-1} (1 - alpha_j)
This is implemented as a tile-based rasterizer on the GPU:
┌─────┬─────┬─────┬─────┐
│Tile │Tile │Tile │Tile │ Image divided into 16x16 pixel tiles
│ 0,0 │ 0,1 │ 0,2 │ 0,3 │
├─────┼─────┼─────┼─────┤ Each tile processed by one GPU thread block
│Tile │Tile │Tile │Tile │
│ 1,0 │ 1,1 │ 1,2 │ 1,3 │ Gaussians assigned to tiles they overlap
├─────┼─────┼─────┼─────┤
│Tile │Tile │Tile │Tile │ Within each tile: sorted alpha compositing
│ 2,0 │ 2,1 │ 2,2 │ 2,3 │
└─────┴─────┴─────┴─────┘
Key advantage: the rasterizer processes all pixels in parallel, achieving 100+ FPS at HD resolution on a single GPU.
Adaptive Density Control
During training, 3DGS dynamically adjusts the number and distribution of Gaussians:
Densification (every N iterations):
1. CLONE: Large Gaussians with high gradient --> split into two
2. SPLIT: Small Gaussians with high gradient --> duplicate and perturb
3. PRUNE: Gaussians with very low opacity --> remove
Before: ● ● ● ● ●
(sparse, gaps)
After: ●●●●●●●●●●●●●●●●●●●●●
(dense where needed, pruned where redundant)
This allows the representation to allocate capacity where scene detail is highest (e.g., object edges, fine textures) and remain sparse in empty or uniform regions.
Why 3DGS is Better for AD Simulation
- Real-time rendering: 100+ FPS vs. seconds per frame for NeRF
- Explicit geometry: Each Gaussian has a position -- can be moved, deleted, grouped
- Scene editing: Remove a car by removing its Gaussians; insert new actors
- Lidar-friendly: Gaussians have physical extent -- can trace lidar rays through them
- Fast training: Minutes to train (vs. hours for NeRF) with similar quality
- Memory-efficient rendering: Forward pass only, no per-ray MLP evaluation
Comparison Table: NeRF vs 3DGS for AD Simulation
| Aspect | NeRF | 3D Gaussian Splatting |
|---|---|---|
| Representation | Implicit (MLP weights) | Explicit (point cloud of Gaussians) |
| Rendering | Volume ray marching | Tile-based rasterization (splatting) |
| Render Speed | 0.1 - 5 FPS | 100 - 300 FPS |
| Training Time | Hours - days | Minutes - hours |
| Image Quality | Excellent (PSNR ~31 dB) | Excellent (PSNR ~32 dB) |
| Scene Editing | Very difficult | Natural (move/remove Gaussians) |
| Dynamic Scenes | Requires deformation fields | Group Gaussians per object |
| Lidar Sim | Ray march through density | Intersect rays with Gaussians |
| Memory (render) | Low (MLP weights only) | Higher (millions of Gaussians) |
| Memory (train) | High (per-ray samples) | Moderate (sorted splatting) |
| Unbounded Scenes | Needs contraction (mip-NeRF 360) | Natural with scale parameters |
| View Extrapolation | Poor (overfits training views) | Poor (same fundamental issue) |
| Closed-Loop Sim | Too slow for real-time | Viable at real-time rates |
| Industry Adoption | Declining for AD sim | Dominant and growing |
Verdict for AD Simulation: 3DGS is the clear winner. Its real-time performance, explicit representation, and natural compatibility with multi-sensor simulation make it the foundation of modern neural sim systems.
Key Papers for AD Simulation
NeuRAD (CVPR 2024)
Paper: "NeuRAD: Neural Rendering for Autonomous Driving" Authors: Tonderski et al. (Zenseact) Link: arxiv.org/abs/2311.15260
Key Contributions
- Unified multi-sensor rendering: Single neural scene representation that renders both camera images and lidar point clouds
- Sensor-specific modeling: Accounts for rolling shutter, beam divergence, ray dropping, and per-sensor exposure
- State-of-the-art on multiple benchmarks: Outperforms prior methods on nuScenes, PandaSet, and Argoverse2
- Practical design decisions: Extensive ablation study of what matters for AD-specific neural rendering
Architecture
Drive Log Input
┌───────────────────────┐
│ Camera Images (6 cams) │
│ Lidar Scans │
│ Ego Poses │
│ Actor Bounding Boxes │
└──────────┬────────────┘
│
┌──────────▼────────────┐
│ Scene Decomposition │
│ Static │ Dynamic │
│ (hash │ (per-actor │
│ grid) │ model) │
└──────────┬────────────┘
│
┌──────────▼────────────┐
│ Volume Rendering │
│ with sensor models │
│ - Rolling shutter │
│ - Beam divergence │
│ - Lidar intensity │
└──────────┬────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Camera RGB Lidar Points Lidar Intensity
Key Technical Details
- Hash-grid backbone: Uses Instant-NGP style multi-resolution hash encoding for the static scene, giving fast training while maintaining detail
- Actor models: Each dynamic actor gets its own small NeRF, conditioned on a learned latent code per actor instance
- Rolling shutter: Models each image row at a different timestamp, interpolating ego pose accordingly -- critical for high-speed driving
- Lidar modeling: Simulates beam divergence (lidar rays are not infinitely thin), ray drop probability, and intensity based on material and incidence angle
- Losses: Photometric (L1 + LPIPS) for cameras, Chamfer distance + intensity loss for lidar
Results
| Dataset | Camera PSNR | Camera SSIM | Lidar Chamfer (m) |
|---|---|---|---|
| nuScenes | 28.5 dB | 0.87 | 0.041 |
| PandaSet | 29.1 dB | 0.89 | 0.038 |
| Argoverse2 | 27.8 dB | 0.85 | 0.044 |
Why It Matters
NeuRAD demonstrated that a single neural representation can serve both camera and lidar simulation with high fidelity. Its extensive ablation study became a practical guide for the field -- showing, for example, that rolling shutter modeling alone improves camera PSNR by 1.5 dB in highway scenes.
SplatAD (2025)
Paper: "SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving" Authors: Hultman et al. (Zenseact) Link: arxiv.org/abs/2411.16816
Key Contributions
- First 3DGS method for joint real-time camera AND lidar: Prior GS works for AD focused on camera only
- Lidar rendering via Gaussian ray tracing: Novel approach to generate lidar point clouds from Gaussian scenes
- 14x faster than NeuRAD: Real-time performance enabling closed-loop simulation
- State-of-the-art quality: Matches or exceeds NeuRAD on all benchmarks
How Lidar Rendering Works with Gaussians
The key innovation of SplatAD is extending 3DGS to lidar. Since Gaussians are explicit primitives with spatial extent, lidar rays can be intersected with them:
Lidar Ray Intersection with Gaussians:
Lidar
Sensor Ray direction
●─────────────────────────────────────►
│ │ │
│ ┌─────┼──┐ ┌────┼───┐
│ │ G1 │ │ │ G2 │ │
│ │ ● │ │ │ ● │ │ Gaussians along the ray
│ │ │ │ │ │ │
│ └─────┘──┘ └────┘───┘
│ t1 t2
│
▼ Compute intersection weights and accumulate:
depth = sum(w_i * t_i) where w_i = alpha_i * T_i
intensity = sum(w_i * I_i)
For each lidar ray:
- Find Gaussians whose influence region intersects the ray
- Evaluate each Gaussian's contribution along the ray (using the Gaussian's 3D shape)
- Alpha-composite depth and intensity (same as color compositing, but for range)
- Apply sensor-specific ray drop model
Architecture
┌─────────────────────────────────────────┐
│ SplatAD Pipeline │
├─────────────────────────────────────────┤
│ │
│ Input: Drive log (images, lidar, poses) │
│ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ Static Scene │ │ Dynamic Actors │ │
│ │ (3D Gaussians)│ │ (3D Gaussians │ │
│ │ │ │ per actor) │ │
│ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │
│ └────────┬───────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Composed Scene │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ Camera Lidar Lidar │
│ Splatting Ray Tracing Intensity │
│ (rasterize) (ray-Gaussian Estimation │
│ intersection) │
└─────────────────────────────────────────┘
Performance Comparison
| Method | Camera PSNR | Lidar Chamfer | Camera FPS | Lidar FPS |
|---|---|---|---|---|
| NeuRAD | 28.5 dB | 0.041 m | 1.5 | 0.8 |
| SplatAD | 28.8 dB | 0.039 m | 22 | 18 |
| Speedup | -- | -- | 14.7x | 22.5x |
Why It Matters
SplatAD proved that 3DGS can simultaneously handle camera and lidar simulation at real-time rates without sacrificing quality. This is the critical capability needed for closed-loop simulation where the ego vehicle's perception stack runs online.
HUGSIM (2024)
Paper: "HUGSIM: A Real-Time, Photo-Realistic Closed-Loop Simulator for Autonomous Driving" Authors: Chen et al. Link: arxiv.org/abs/2403.17712
Key Contributions
- End-to-end closed-loop simulation: First system to combine 3DGS reconstruction with a full closed-loop driving simulator
- Real-time performance: Achieves interactive frame rates for online perception-planning loops
- Scene editing: Supports inserting, removing, and repositioning actors in the neural scene
- Downstream evaluation: Tests actual AD planners in the simulator and measures driving quality
Closed-Loop Architecture
┌──────────────────────────────────────────────────────────┐
│ HUGSIM Closed Loop │
│ │
│ ┌───────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Neural │───►│ Perception │───►│ Planning │ │
│ │ Scene │ │ Stack │ │ Module │ │
│ │ (3DGS) │ │ │ │ │ │
│ └─────┬─────┘ └────────────┘ └──────┬───────┘ │
│ ▲ │ │
│ │ ┌──────────┐ │ │
│ │ │ Behavior │ │ │
│ │ │ Model │◄──────────┘ │
│ │ │ (agents) │ │
│ │ └────┬─────┘ │
│ │ │ │
│ │ ┌──────────────▼──────────────┐ │
│ └────┤ Scene State Update │ │
│ │ - Move ego to new pose │ │
│ │ - Update actor positions │ │
│ │ - Re-render from new view │ │
│ └─────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Scene Decomposition Strategy
HUGSIM separates the scene into three layers:
Layer 1: STATIC BACKGROUND
- Roads, buildings, trees, signs
- Reconstructed as a single 3D Gaussian field
- Stays fixed during simulation
Layer 2: DYNAMIC ACTORS (from original log)
- Each vehicle/pedestrian is a separate Gaussian model
- Can be moved, removed, or have trajectory edited
- Actor Gaussians transformed by rigid body pose
Layer 3: INSERTED ACTORS
- New vehicles/pedestrians not in original scene
- Uses pre-built Gaussian asset library
- Placed with correct scale, lighting, shadows
Why It Matters
HUGSIM was one of the first systems to demonstrate that neural rendering can be embedded in a full closed-loop simulation pipeline at interactive rates. It showed that 3DGS-based simulation can actually be used to evaluate and improve AD planners, not just produce pretty pictures.
AutoSplat (2024)
Paper: "AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction" Authors: Khan et al. Link: arxiv.org/abs/2407.02598
Key Contributions
- Geometry-aware Gaussian placement: Uses road surface priors and structural constraints to improve reconstruction quality, especially for roads and flat surfaces
- Road surface decomposition: Dedicated handling of road surfaces with planar constraints
- Appearance consistency: Cross-view appearance regularization for consistent rendering under viewpoint changes
- Vehicle reconstruction: Improved dynamic vehicle reconstruction via symmetric priors
Road Surface Innovation
Driving scenes are dominated by road surfaces, which are notoriously hard for unconstrained 3DGS (Gaussians tend to float above or below the road plane):
Problem with unconstrained GS: AutoSplat's approach:
● ● ● ● ● ●●●●●●●●●●●●●●●●●●●
● ●● ● ● (Gaussians constrained
──────────────────────── Road to road plane with
● ● ● normal alignment)
(Gaussians float randomly)
AutoSplat constrains road Gaussians to:
- Lie on the estimated road surface (from lidar ground segmentation)
- Have their shortest axis aligned with the surface normal (flat ellipsoids)
- Maintain consistent appearance across different viewing angles
Why It Matters
AutoSplat addressed one of the most practical challenges in AD neural rendering: making road surfaces look correct. Roads occupy a huge portion of driving images, and artifacts on the road surface (floating blobs, inconsistent color) are immediately noticeable and can confuse lane detection and drivable area estimation.
Industrial-Grade Sensor Simulation via Gaussian Splatting (2025)
Paper: "Sensor Simulation via Gaussian Splatting: Industrial-Grade Driving Scene Reconstruction" Authors: Various (industry research)
Key Contributions
- Fleet-scale pipeline: Automated reconstruction pipeline processing thousands of drive logs
- Quality assurance: Automated metric-based quality gating for reconstructed scenes
- Multi-sensor fidelity: Camera, lidar, and radar simulation from a single reconstruction
- Production deployment: Designed for integration into commercial simulation platforms
Industrial Requirements vs Academic Methods
| Requirement | Academic Methods | Industrial Grade |
|---|---|---|
| Scale | 10-100 scenes | 10,000+ scenes |
| Automation | Manual tuning per scene | Fully automated pipeline |
| Quality | Average PSNR reported | Per-scene quality gating |
| Robustness | Fails on hard cases | Graceful degradation |
| Latency | Hours per scene | Minutes per scene |
| Integration | Standalone demo | API-driven, CI/CD compatible |
Why It Matters
This line of work bridges the gap between academic neural rendering research and production simulation systems. It addresses the "last mile" problems: how do you go from a research prototype that works on 10 curated scenes to a system that reliably reconstructs 10,000 diverse scenes from fleet data?
Applied Intuition's Neural Sim Architecture
Applied Intuition's Neural Sim product represents the current state of the art in commercial neural rendering for AD simulation. Based on public information, patents, and technical presentations, we can reconstruct its architecture.
High-Level Pipeline
┌──────────────────────────────────────────────────────────────────────┐
│ Applied Intuition Neural Sim Pipeline │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Fleet Data │ │ Reconstruction│ │ Scenario Authoring │ │
│ │ (Drive Logs)│───►│ Pipeline │───►│ (Edit, Insert, Move) │ │
│ │ - Cameras │ │ │ │ │ │
│ │ - Lidar │ │ Automated ML │ │ ┌─────────────────┐ │ │
│ │ - Radar │ │ at scale │ │ │ Static Neural │ │ │
│ │ - Poses │ │ │ │ │ Scene (GS) │ │ │
│ │ - Labels │ │ │ │ ├─────────────────┤ │ │
│ └─────────────┘ └──────────────┘ │ │ Dynamic Actors │ │ │
│ │ │ (PBR models) │ │ │
│ │ └─────────────────┘ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ Closed-Loop Sim │ │
│ │ Runtime │ │
│ │ - Camera rendering │ │
│ │ - Lidar generation │ │
│ │ - Radar simulation │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ Validation │ │
│ │ - Reconstruction │ │
│ │ quality metrics │ │
│ │ - Downstream percep │ │
│ │ performance │ │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Scene Reconstruction Pipeline
The reconstruction pipeline converts raw drive logs into re-renderable neural scenes:
Drive Log ──► Preprocessing ──► Neural Reconstruction ──► Quality Check ──► Scene DB
│ │ │
▼ ▼ ▼
- Pose refinement - Static background - PSNR > threshold?
- Lidar accumulation (Gaussian Splatting) - SSIM > threshold?
- Sky segmentation - Per-actor models - Lidar error < threshold?
- Dynamic masking - Sky dome model - Visual inspection (sampled)
- Ground plane est - Exposure compensation
Step 1: Preprocessing
- Pose refinement: SfM or lidar SLAM to get centimeter-accurate poses (even small pose errors cause blurry reconstructions)
- Dynamic object masking: Segment and mask moving objects in training images so the static model does not try to explain them
- Sky segmentation: Separate sky pixels for special handling (sky has no real 3D geometry)
- Lidar accumulation: Aggregate multiple lidar sweeps into a dense static point cloud for initialization
Step 2: Static Background Reconstruction
The static scene (everything except moving actors) is reconstructed as a 3D Gaussian field:
Initialization:
- Start from accumulated lidar point cloud
- Each lidar point becomes an initial Gaussian
- Scale and color initialized from nearest image patches
Training (per scene, ~10-30 minutes):
- Render training views via differentiable splatting
- Compare to ground-truth images (masked for dynamic objects)
- Backpropagate gradients to Gaussian parameters
- Adaptive density control: split/clone/prune
- Also supervise with lidar depth where available
Step 3: Dynamic Actor Models
Dynamic actors (vehicles, pedestrians, cyclists) are handled separately from the static scene:
Option A: Per-Actor Gaussian Models
- Reconstruct each actor across frames where visible
- Gaussian positions in actor-local coordinates
- At sim time: transform by desired actor pose
Option B: PBR Asset Insertion (Applied Intuition's approach)
- Detect and classify actors in log data
- Match to high-quality PBR 3D asset library
- Render PBR actors with matched lighting/material
- Composite into neural background
Advantage: PBR actors can have arbitrary new poses,
animations, and interactions -- not limited to
reconstructed appearances
Multi-Sensor Support
┌────────────────────────────────────────────────────┐
│ Multi-Sensor Rendering │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Camera │ │ Lidar │ │ Radar │ │
│ │ Rendering │ │ Rendering│ │ Rendering│ │
│ ├───────────┤ ├───────────┤ ├───────────┤ │
│ │ GS raster- │ │ Ray- │ │ Learned │ │
│ │ ization │ │ Gaussian │ │ radar │ │
│ │ + rolling │ │ intersect │ │ cross- │ │
│ │ shutter │ │ + beam │ │ section │ │
│ │ + exposure │ │ model │ │ model │ │
│ │ + lens │ │ + ray │ │ + multi- │ │
│ │ distortion │ │ drop │ │ path │ │
│ │ │ │ + intens. │ │ │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ RGB Images Point Clouds Radar Returns │
│ (per camera) (range, intens) (range, vel) │
└────────────────────────────────────────────────────┘
Automated ML Pipelines for Fleet-Scale Reconstruction
At fleet scale, reconstruction must be fully automated:
Fleet Data Lake (petabytes)
│
▼
┌──────────────────────────┐
│ Scene Selection │
│ - Filter by scenario tag │
│ - Diversity sampling │
│ - Geographic coverage │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ Distributed Training │
│ - GPU cluster (cloud) │
│ - One GPU per scene │
│ - Batch of 100s parallel │
│ - Automated hyperparams │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ Quality Gating │
│ - Auto PSNR/SSIM check │
│ - Lidar fidelity check │
│ - Artifact detection │
│ - Human review (sampled) │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ Scene Database │
│ - Versioned neural assets │
│ - Searchable metadata │
│ - Scenario annotations │
└──────────────────────────┘
Validation: Two Levels of Metrics
Neural Sim validation operates at two levels:
Level 1: Upstream Reconstruction Quality
How well does the neural scene match the original sensor data?
| Metric | What It Measures | Target |
|---|---|---|
| PSNR | Pixel-level accuracy (dB) | > 27 dB |
| SSIM | Structural similarity | > 0.85 |
| LPIPS | Perceptual similarity (learned) | < 0.15 |
| Lidar Chamfer | Point cloud geometric accuracy | < 0.05 m |
| Lidar Intensity MAE | Reflectance accuracy | < 0.1 |
| FID | Distribution-level realism | < 50 |
Level 2: Downstream Perception Performance
Does the perception stack perform equally well on neural-rendered vs. real data?
Real Data Neural-Rendered Data
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Perception │ │ Perception │
│ Stack │ │ Stack │
│ (same model)│ │ (same model)│
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
Detection mAP: 72.3 Detection mAP: 71.8
Lane F1: 0.94 Lane F1: 0.93
Tracking MOTA: 68.1 Tracking MOTA: 67.5
Gap should be < 1-2% for the simulation to be trustworthy
This is the ultimate validation: if a perception model produces nearly identical outputs on neural-rendered data vs. real data, then the simulation is faithful enough for testing and development.
Technical Deep Dive
Scene Decomposition: Static vs Dynamic
Scene decomposition is the foundational step in AD neural rendering. The scene must be separated into static elements (which can be reconstructed once) and dynamic elements (which must be modeled separately to allow trajectory editing).
Decomposition Pipeline
Input Frame ────────────────────────────────────────────────────
│
├──► 2D Instance Segmentation (Mask R-CNN, SAM, etc.)
│ │
│ ▼
│ Dynamic Object Masks ──────────────────────┐
│ │
├──► 3D Bounding Box Detection/Tracking │
│ │ │
│ ▼ │
│ Per-Object Tracks ─────────────────────┐ │
│ (position, size, heading per frame) │ │
│ │ │
▼ ▼ ▼
Static Reconstruction Dynamic Object Models
(inpaint masked regions, (per-actor Gaussians
train on static pixels only) in local coordinates)
Challenges in Decomposition
- Shadow handling: A car's shadow is static-looking but moves with the car. Including shadows in the static model creates ghosting artifacts when the car is moved.
Original: Car casts shadow on road
┌───┐
│ C │ ░░░░ (shadow)
└───┘
══════════════ road
Naive move: Car moved, but shadow stays!
┌───┐
░░░░ │ C │
└───┘
══════════════ road
Correct: Shadow masked and inpainted in static scene,
re-rendered with car in new position
-
Occlusion hallucination: When a dynamic object is removed, the static model must fill in the region behind it. This is typically handled by:
- Lidar-guided depth completion (we know the approximate geometry behind the car)
- Learned inpainting (neural network fills in plausible texture)
- Multi-frame aggregation (the occluded area may be visible in other frames)
-
Semi-static objects: Parked cars, construction barriers -- static in the current log but potentially movable. These are often included in the static model for simplicity but can be separated if needed.
Novel View Synthesis for Ego Pose Changes
The primary use case for neural sim is closed-loop replay: the ego vehicle takes a different action, resulting in a different pose, and we need to render what the sensors would see from that new pose.
Original ego trajectory: ● ─── ● ─── ● ─── ● ─── ●
t=0 t=1 t=2 t=3 t=4
Novel ego trajectory: ● ─── ● ─── ●
t=0 t=1 t=2 ╲
╲
● ─── ●
t=3 t=4
(ego braked and
turned right)
At each new pose, render:
- 6 camera images (surround view)
- 1 lidar sweep (full 360 degrees)
- radar returns
View Extrapolation Limits
Neural rendering works well for interpolation (rendering from a viewpoint between training views) but poorly for extrapolation (rendering from a viewpoint far from any training view):
Training views: ▼ ▼ ▼ ▼ ▼
●───────●───────●───────●───────●
Good quality zone: ◄───────────────────────────────►
(within ~1-2m lateral, ~5m longitudinal of training trajectory)
Degraded quality: ╱
╱ (> 2m lateral
╱ deviation)
●
artifacts here
Practical systems limit the deviation of the simulated ego trajectory from the original logged trajectory. Typical limits:
- Lateral: +/- 1-3 meters
- Longitudinal: +/- 5-10 meters
- Heading: +/- 15-30 degrees
Beyond these limits, the neural scene produces artifacts (blurring, floaters, hallucinated geometry).
Lidar Point Cloud Generation from Neural Scenes
Generating realistic lidar point clouds from neural scenes requires modeling the full lidar sensing pipeline:
Step 1: Ray Generation
┌──────────────────────────────────┐
│ For each lidar beam (e.g., 128): │
│ For each azimuth angle: │
│ Compute ray origin and dir │
│ Account for sensor rotation │
│ during sweep (~100ms for │
│ full 360-degree rotation) │
└──────────────────────────────────┘
Step 2: Ray-Scene Intersection
┌──────────────────────────────────┐
│ NeRF: March along ray, accumulate│
│ density to find depth │
│ │
│ 3DGS: Intersect ray with nearby │
│ Gaussians, alpha-composite │
│ depth values │
└──────────────────────────────────┘
Step 3: Sensor Modeling
┌──────────────────────────────────┐
│ - Beam divergence: lidar beams │
│ have non-zero width (~3 mrad) │
│ - Intensity: depends on surface │
│ material and incidence angle │
│ - Ray dropping: some rays return │
│ no measurement (absorption, │
│ out of range, specular reflect)│
│ - Noise: range noise ~1-3 cm │
│ - Multi-return: some beams hit │
│ multiple surfaces (vegetation) │
└──────────────────────────────────┘
Step 4: Point Cloud Assembly
┌──────────────────────────────────┐
│ Assemble (x, y, z, intensity, │
│ ring_id, timestamp) per point │
│ Transform to ego frame │
│ Apply motion compensation │
└──────────────────────────────────┘
Handling Sky, Road Surfaces, and Distant Geometry
Sky Modeling
Sky has no real 3D geometry -- it is infinitely far away. Naive reconstruction places Gaussians at arbitrary large distances, creating artifacts when the viewpoint changes:
Problem: Solution:
* * * (sky Gaussians at Dedicated sky model:
random depths cause - Segment sky pixels
parallax artifacts) - Render sky with environment
\ | / map (no parallax)
\ | / - Blend sky with scene at
\|/ boundary
[cam]
Common approaches:
- Environment map: Fit a learnable HDR environment map for the sky
- Sky segmentation + separate model: Train a view-direction-only sky network
- Infinite-distance Gaussians: Place sky Gaussians at a very large fixed distance with zero opacity gradient w.r.t. depth
Road Surface Handling
Roads are problematic because:
- They are viewed at extreme grazing angles
- They cover a large portion of the image
- Lidar returns are sparse on flat surfaces at distance
- Road markings require high-frequency detail
Camera
│
│\
│ \ Extreme grazing angle
│ \
│ \
│ \
════════════════│═════\══════════════════ Road surface
│ \
│ \ Very few pixels per unit area
│ \ at distance
Solutions:
- Planar constraints (AutoSplat): Force road Gaussians onto the estimated ground plane
- Multi-resolution: Use smaller, denser Gaussians for nearby road, larger ones for distant
- Lidar supervision: Use lidar depth as hard constraint for road surface geometry
- Separate road model: Dedicated parameterization for the road surface
Distant Geometry
Objects at great distance (buildings on the horizon, mountains, far trees) pose challenges:
Near objects: Many views, good triangulation
●────────────●────────────●
(view 1) (view 2) (view 3)
\ | /
\ | / Good reconstruction
\|/
[bldg]
Far objects: All views see nearly the same angle
●────●────●
(v1) (v2) (v3)
|
| Poor triangulation
|
|
[far mountain]
Solutions:
- Multi-scale representation: Coarse Gaussians for distant geometry
- Depth regularization: Use lidar or monocular depth to constrain far geometry
- Level-of-detail: Render distant objects at lower resolution
Rolling Shutter and Sensor-Specific Artifacts
Rolling Shutter
Most automotive cameras use rolling shutter sensors, where each row of pixels is exposed at a slightly different time:
Time ──►
┌─────────────────┐
Row 0 ──────────►│ Exposed first │
Row 1 ─────────►│ │
Row 2 ────────►│ │
... │ Each row at a │
Row N-1 ────►│ different time │
└─────────────────┘
If the car is moving at 30 m/s and exposure takes 33ms:
Total ego motion during one frame: ~1 meter
Different rows "see" different ego poses
To model this correctly:
- Compute the ego pose at the timestamp of each image row
- Render that row from its specific pose
- Assemble the full image from per-row renders (or approximate with a few groups of rows)
NeuRAD showed this is critical: neglecting rolling shutter degrades PSNR by 1-2 dB in highway scenes.
Other Sensor Artifacts
| Artifact | Sensor | How to Model |
|---|---|---|
| Auto-exposure | Camera | Per-frame learnable exposure scaling |
| Lens flare | Camera | Learned post-processing or physics model |
| Chromatic aberration | Camera | Per-channel distortion model |
| Motion blur | Camera | Multi-sample temporal averaging |
| Beam divergence | Lidar | Integrate Gaussian over beam cross-section |
| Ray dropping | Lidar | Learned drop probability model |
| Multi-path reflections | Radar | Physics-based reflection model |
| Blooming | Lidar | Intensity-dependent range bias |
Code Examples
Basic 3DGS Training Loop
import torch
import torch.nn as nn
from dataclasses import dataclass
from typing import Tuple
@dataclass
class GaussianParams:
"""Parameters for a set of 3D Gaussians."""
means: torch.Tensor # (N, 3) - positions
scales: torch.Tensor # (N, 3) - scale in each axis (log space)
rotations: torch.Tensor # (N, 4) - quaternions
opacities: torch.Tensor # (N, 1) - sigmoid space
sh_coeffs: torch.Tensor # (N, K, 3) - spherical harmonics for color
def num_gaussians(self) -> int:
return self.means.shape[0]
def init_gaussians_from_pointcloud(
points: torch.Tensor, # (N, 3) from lidar
colors: torch.Tensor, # (N, 3) initial RGB
device: str = "cuda"
) -> GaussianParams:
"""Initialize Gaussians from a lidar point cloud."""
N = points.shape[0]
# Position: directly from point cloud
means = points.clone().to(device).requires_grad_(True)
# Scale: small initial size, in log space
# Use average nearest-neighbor distance as initial scale
from pytorch3d.ops import knn_points
dists, _, _ = knn_points(points.unsqueeze(0), points.unsqueeze(0), K=4)
avg_dist = dists[0, :, 1:].mean(dim=-1).sqrt() # skip self
scales = torch.log(avg_dist.unsqueeze(-1).repeat(1, 3)).to(device)
scales.requires_grad_(True)
# Rotation: identity quaternion
rotations = torch.zeros(N, 4, device=device)
rotations[:, 0] = 1.0 # w=1, x=y=z=0
rotations.requires_grad_(True)
# Opacity: moderate initial value (in logit space)
opacities = torch.full((N, 1), 0.5, device=device) # sigmoid(0.5) ~ 0.62
opacities.requires_grad_(True)
# Spherical harmonics: degree 0 initialized from colors
# SH degree 0 coefficient = color * C0, where C0 = 0.28209479...
C0 = 0.28209479177387814
sh_dc = (colors / C0).unsqueeze(1).to(device) # (N, 1, 3)
num_sh_extra = 15 # degrees 1-3: 15 additional coefficients
sh_rest = torch.zeros(N, num_sh_extra, 3, device=device)
sh_coeffs = torch.cat([sh_dc, sh_rest], dim=1).requires_grad_(True)
return GaussianParams(
means=means,
scales=scales,
rotations=rotations,
opacities=opacities,
sh_coeffs=sh_coeffs,
)
def build_covariance_3d(scales: torch.Tensor, rotations: torch.Tensor) -> torch.Tensor:
"""Build 3D covariance matrices from scale and rotation.
Covariance = R @ S @ S^T @ R^T
where S = diag(exp(scales)), R = quaternion_to_matrix(rotations)
"""
# Convert log-scale to actual scale
S = torch.diag_embed(torch.exp(scales)) # (N, 3, 3)
# Quaternion to rotation matrix
R = quaternion_to_rotation_matrix(rotations) # (N, 3, 3)
# Covariance = R S S^T R^T
RS = torch.bmm(R, S) # (N, 3, 3)
cov = torch.bmm(RS, RS.transpose(1, 2)) # (N, 3, 3)
return cov
def quaternion_to_rotation_matrix(q: torch.Tensor) -> torch.Tensor:
"""Convert quaternion (w, x, y, z) to 3x3 rotation matrix."""
q = nn.functional.normalize(q, dim=-1)
w, x, y, z = q.unbind(-1)
R = torch.stack([
1 - 2*(y*y + z*z), 2*(x*y - w*z), 2*(x*z + w*y),
2*(x*y + w*z), 1 - 2*(x*x + z*z), 2*(y*z - w*x),
2*(x*z - w*y), 2*(y*z + w*x), 1 - 2*(x*x + y*y),
], dim=-1).reshape(-1, 3, 3)
return R
class GaussianRasterizer:
"""Simplified tile-based Gaussian rasterizer (pseudocode).
In practice, use the CUDA implementation from the original 3DGS paper
or libraries like gsplat, nerfstudio, or diff-gaussian-rasterization.
"""
def __init__(self, image_width: int, image_height: int, tile_size: int = 16):
self.W = image_width
self.H = image_height
self.tile_size = tile_size
def forward(
self,
gaussians: GaussianParams,
camera_intrinsics: torch.Tensor, # (3, 3)
camera_extrinsics: torch.Tensor, # (4, 4) world-to-camera
camera_direction: torch.Tensor, # (3,) view direction for SH
) -> torch.Tensor:
"""Render an image from the Gaussians.
Returns: (H, W, 3) rendered RGB image.
NOTE: This is pseudocode. The actual implementation requires
a custom CUDA kernel for the tile-based rasterization.
"""
# Step 1: Transform Gaussians to camera space
means_cam = transform_points(gaussians.means, camera_extrinsics)
# Step 2: Project 3D covariances to 2D
cov_3d = build_covariance_3d(gaussians.scales, gaussians.rotations)
means_2d, cov_2d = project_gaussians(
means_cam, cov_3d, camera_intrinsics
)
# Step 3: Evaluate SH to get view-dependent colors
colors = eval_spherical_harmonics(
gaussians.sh_coeffs, camera_direction
)
# Step 4: Tile-based rasterization
# (In practice, this is a fused CUDA kernel)
image = tile_based_rasterize(
means_2d, cov_2d, colors,
torch.sigmoid(gaussians.opacities),
self.H, self.W, self.tile_size
)
return image
def training_loop(
train_dataset, # provides (image, camera_params) pairs
lidar_points, # initial point cloud
lidar_colors, # initial colors from nearest images
num_iterations: int = 30_000,
lr_means: float = 1.6e-4,
lr_scales: float = 5e-3,
lr_rotations: float = 1e-3,
lr_opacities: float = 5e-2,
lr_sh: float = 2.5e-3,
densify_interval: int = 100,
densify_start: int = 500,
densify_stop: int = 15_000,
):
"""Main 3DGS training loop for a driving scene."""
# Initialize Gaussians from lidar
gaussians = init_gaussians_from_pointcloud(lidar_points, lidar_colors)
rasterizer = GaussianRasterizer(
image_width=1920, image_height=1080
)
# Separate optimizers for different parameter groups
optimizer = torch.optim.Adam([
{"params": [gaussians.means], "lr": lr_means},
{"params": [gaussians.scales], "lr": lr_scales},
{"params": [gaussians.rotations], "lr": lr_rotations},
{"params": [gaussians.opacities], "lr": lr_opacities},
{"params": [gaussians.sh_coeffs], "lr": lr_sh},
])
# Learning rate scheduler: exponential decay for positions
scheduler = torch.optim.lr_scheduler.ExponentialLR(
optimizer, gamma=0.01 ** (1.0 / num_iterations)
)
for iteration in range(num_iterations):
# Sample a random training view
image_gt, camera_params = train_dataset.random_sample()
# Render
image_pred = rasterizer.forward(
gaussians,
camera_params.intrinsics,
camera_params.extrinsics,
camera_params.view_direction,
)
# Loss: L1 + D-SSIM (as in original 3DGS paper)
l1_loss = torch.abs(image_pred - image_gt).mean()
ssim_loss = 1.0 - compute_ssim(image_pred, image_gt)
loss = 0.8 * l1_loss + 0.2 * ssim_loss
# Backprop
loss.backward()
optimizer.step()
optimizer.zero_grad()
scheduler.step()
# Adaptive density control
if densify_start < iteration < densify_stop:
if iteration % densify_interval == 0:
gaussians = adaptive_density_control(
gaussians,
grad_threshold=0.0002,
min_opacity=0.005,
max_screen_size=20,
)
if iteration % 1000 == 0:
print(f"Iter {iteration}: L1={l1_loss:.4f}, "
f"SSIM={1-ssim_loss:.4f}, "
f"N_gaussians={gaussians.num_gaussians()}")
def adaptive_density_control(
gaussians: GaussianParams,
grad_threshold: float,
min_opacity: float,
max_screen_size: float,
) -> GaussianParams:
"""Split, clone, and prune Gaussians based on gradients and opacity.
Pseudocode -- actual implementation tracks accumulated gradients
over multiple iterations.
"""
# Accumulated position gradients (tracked externally in practice)
grads = gaussians.means.grad.norm(dim=-1) # (N,)
scales = torch.exp(gaussians.scales) # (N, 3)
max_scale = scales.max(dim=-1).values # (N,)
opacities = torch.sigmoid(gaussians.opacities.squeeze()) # (N,)
# CLONE: small Gaussians with high gradient (under-reconstruction)
clone_mask = (grads > grad_threshold) & (max_scale < 0.01)
# SPLIT: large Gaussians with high gradient (over-reconstruction)
split_mask = (grads > grad_threshold) & (max_scale >= 0.01)
# PRUNE: transparent Gaussians
prune_mask = opacities < min_opacity
# Apply operations (pseudocode)
# gaussians = clone(gaussians, clone_mask)
# gaussians = split(gaussians, split_mask)
# gaussians = prune(gaussians, prune_mask)
return gaussians
Novel View Rendering
def render_novel_view(
gaussians: GaussianParams,
novel_pose: torch.Tensor, # (4, 4) new ego pose
camera_calibration: dict, # intrinsics, distortion, etc.
original_pose: torch.Tensor, # (4, 4) original ego pose
dynamic_actors: list, # list of (actor_gaussians, actor_pose)
sky_model: nn.Module, # environment map for sky
) -> torch.Tensor:
"""Render a camera image from a novel ego pose.
This combines the static neural scene, dynamic actors,
and sky model into a final rendered image.
"""
rasterizer = GaussianRasterizer(
image_width=camera_calibration["width"],
image_height=camera_calibration["height"],
)
# Camera extrinsics: ego_pose @ ego_to_camera
ego_to_camera = camera_calibration["extrinsics"] # fixed calibration
world_to_camera = torch.inverse(novel_pose @ ego_to_camera)
# 1. Render static background
static_rgb, static_depth, static_alpha = rasterizer.forward_with_depth(
gaussians,
camera_calibration["intrinsics"],
world_to_camera,
compute_view_direction(novel_pose, camera_calibration),
)
# 2. Render each dynamic actor
composed_rgb = static_rgb.clone()
composed_depth = static_depth.clone()
for actor_gs, actor_pose in dynamic_actors:
# Transform actor Gaussians to world space
actor_world_gs = transform_gaussians(actor_gs, actor_pose)
actor_rgb, actor_depth, actor_alpha = rasterizer.forward_with_depth(
actor_world_gs,
camera_calibration["intrinsics"],
world_to_camera,
compute_view_direction(novel_pose, camera_calibration),
)
# Composite: actor in front where closer
actor_closer = actor_depth < composed_depth
mask = actor_closer & (actor_alpha > 0.5)
composed_rgb[mask] = actor_rgb[mask]
composed_depth[mask] = actor_depth[mask]
# 3. Fill sky regions
sky_mask = static_alpha < 0.1 # transparent = sky
if sky_mask.any():
view_dirs = compute_pixel_directions(
camera_calibration, novel_pose
)
sky_color = sky_model(view_dirs[sky_mask])
composed_rgb[sky_mask] = sky_color
# 4. Apply camera effects
composed_rgb = apply_camera_effects(
composed_rgb,
camera_calibration,
effects=["lens_distortion", "vignetting", "auto_exposure"],
)
return composed_rgb
def render_with_rolling_shutter(
gaussians: GaussianParams,
ego_poses_interp: callable, # function: timestamp -> (4, 4) pose
camera_calibration: dict,
frame_timestamp: float,
readout_time: float = 0.033, # 33ms typical rolling shutter
num_row_groups: int = 8, # approximate with 8 sub-renders
) -> torch.Tensor:
"""Render with rolling shutter simulation.
Each group of rows is rendered from a slightly different ego pose,
corresponding to the pose at that row's exposure timestamp.
"""
H = camera_calibration["height"]
W = camera_calibration["width"]
rows_per_group = H // num_row_groups
full_image = torch.zeros(H, W, 3, device="cuda")
for g in range(num_row_groups):
row_start = g * rows_per_group
row_end = min((g + 1) * rows_per_group, H)
# Timestamp for this row group
row_fraction = (row_start + row_end) / (2 * H)
row_timestamp = frame_timestamp + row_fraction * readout_time
# Ego pose at this timestamp
ego_pose = ego_poses_interp(row_timestamp)
# Render full image from this pose
rendered = render_novel_view(
gaussians, ego_pose, camera_calibration, ...
)
# Take only the relevant rows
full_image[row_start:row_end] = rendered[row_start:row_end]
return full_image
Lidar Ray Casting Through a Neural Scene
import torch
import numpy as np
from typing import Tuple, Optional
def generate_lidar_rays(
lidar_pose: torch.Tensor, # (4, 4) lidar pose in world frame
lidar_config: dict, # sensor configuration
sweep_duration: float = 0.1, # 100ms for full rotation
ego_poses_interp: Optional[callable] = None, # for motion compensation
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Generate lidar ray origins and directions for one full sweep.
Returns:
origins: (N_rays, 3) ray origins in world frame
directions: (N_rays, 3) ray directions in world frame
"""
num_beams = lidar_config["num_beams"] # e.g., 128
beam_angles = lidar_config["beam_angles"] # vertical angles per beam
azimuth_resolution = lidar_config["azimuth_resolution"] # e.g., 0.1 deg
fov_azimuth = lidar_config.get("fov_azimuth", 360.0)
num_azimuths = int(fov_azimuth / azimuth_resolution)
# Build ray directions in lidar frame
azimuths = torch.linspace(0, fov_azimuth, num_azimuths) * np.pi / 180
elevations = torch.tensor(beam_angles) * np.pi / 180
# Create grid of (azimuth, elevation) pairs
az_grid, el_grid = torch.meshgrid(azimuths, elevations, indexing="ij")
az_flat = az_grid.flatten()
el_flat = el_grid.flatten()
# Spherical to Cartesian
dirs_lidar = torch.stack([
torch.cos(el_flat) * torch.cos(az_flat),
torch.cos(el_flat) * torch.sin(az_flat),
torch.sin(el_flat),
], dim=-1) # (N_rays, 3)
N_rays = dirs_lidar.shape[0]
if ego_poses_interp is not None:
# Motion-compensated: each azimuth has a different pose
origins = torch.zeros(N_rays, 3, device="cuda")
directions = torch.zeros(N_rays, 3, device="cuda")
for i, az_idx in enumerate(range(num_azimuths)):
t_fraction = az_idx / num_azimuths
timestamp = t_fraction * sweep_duration
pose_at_t = ego_poses_interp(timestamp)
lidar_world_at_t = pose_at_t @ lidar_config["lidar_to_ego"]
beam_slice = slice(i * num_beams, (i + 1) * num_beams)
origins[beam_slice] = lidar_world_at_t[:3, 3].unsqueeze(0)
directions[beam_slice] = (
lidar_world_at_t[:3, :3] @ dirs_lidar[beam_slice].T
).T
else:
# Simple: all rays from a single pose
origins = lidar_pose[:3, 3].unsqueeze(0).expand(N_rays, -1)
directions = (lidar_pose[:3, :3] @ dirs_lidar.T).T
directions = directions / directions.norm(dim=-1, keepdim=True)
return origins.cuda(), directions.cuda()
def lidar_ray_cast_gaussians(
origins: torch.Tensor, # (N_rays, 3)
directions: torch.Tensor, # (N_rays, 3)
gaussians: GaussianParams,
max_range: float = 120.0, # max lidar range in meters
beam_divergence: float = 3e-3, # ~3 mrad typical
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Cast lidar rays through a Gaussian scene.
Returns:
depths: (N_rays,) depth per ray (0 if no hit)
intensities: (N_rays,) intensity per ray
hit_mask: (N_rays,) bool, True if ray returned a measurement
"""
N_rays = origins.shape[0]
N_gaussians = gaussians.num_gaussians()
# For each ray, find nearby Gaussians (spatial hashing or BVH in practice)
# Here we use a simplified brute-force approach for clarity
means = gaussians.means # (N_gs, 3)
scales = torch.exp(gaussians.scales) # (N_gs, 3)
max_extent = scales.max(dim=-1).values * 3 # 3-sigma cutoff
depths = torch.zeros(N_rays, device="cuda")
intensities = torch.zeros(N_rays, device="cuda")
hit_mask = torch.zeros(N_rays, dtype=torch.bool, device="cuda")
# Process in chunks to manage memory
chunk_size = 4096
for start in range(0, N_rays, chunk_size):
end = min(start + chunk_size, N_rays)
ray_o = origins[start:end] # (C, 3)
ray_d = directions[start:end] # (C, 3)
C = ray_o.shape[0]
# Compute ray-Gaussian distances
# For each ray and Gaussian, find closest point on ray to Gaussian mean
diff = means.unsqueeze(0) - ray_o.unsqueeze(1) # (C, N_gs, 3)
t_closest = (diff * ray_d.unsqueeze(1)).sum(-1) # (C, N_gs)
t_closest = t_closest.clamp(min=0, max=max_range)
closest_point = ray_o.unsqueeze(1) + t_closest.unsqueeze(-1) * ray_d.unsqueeze(1)
dist_to_mean = (closest_point - means.unsqueeze(0)).norm(dim=-1) # (C, N_gs)
# Filter: only consider Gaussians within their extent
within_range = dist_to_mean < max_extent.unsqueeze(0)
# For qualifying Gaussians, compute alpha-weighted depth
cov_3d = build_covariance_3d(gaussians.scales, gaussians.rotations)
opacities = torch.sigmoid(gaussians.opacities.squeeze())
# Evaluate Gaussian along ray (simplified)
# Full implementation uses the Mahalanobis distance
gaussian_weight = torch.exp(-0.5 * (dist_to_mean ** 2) /
(scales.max(dim=-1).values.unsqueeze(0) ** 2))
gaussian_weight = gaussian_weight * opacities.unsqueeze(0)
gaussian_weight[~within_range] = 0
# Alpha compositing along depth-sorted Gaussians
# Sort by t_closest for each ray
sorted_t, sort_idx = t_closest.sort(dim=-1)
sorted_weights = gaussian_weight.gather(1, sort_idx)
# Compute transmittance and alpha
alpha = sorted_weights.clamp(0, 0.99)
transmittance = torch.cumprod(1 - alpha + 1e-10, dim=-1)
transmittance = torch.cat([
torch.ones(C, 1, device="cuda"),
transmittance[:, :-1]
], dim=-1)
weights = alpha * transmittance # (C, N_gs)
# Accumulated depth
ray_depth = (weights * sorted_t).sum(dim=-1) # (C,)
total_weight = weights.sum(dim=-1)
# Hit detection
ray_hit = total_weight > 0.5 # sufficient accumulated opacity
depths[start:end] = ray_depth
hit_mask[start:end] = ray_hit & (ray_depth > 0.5) & (ray_depth < max_range)
# Intensity (simplified: based on normal and distance)
intensities[start:end] = estimate_lidar_intensity(
ray_depth, ray_d, gaussians, sort_idx, weights
)
return depths, intensities, hit_mask
def simulate_ray_dropping(
depths: torch.Tensor,
intensities: torch.Tensor,
hit_mask: torch.Tensor,
drop_model: Optional[nn.Module] = None,
) -> torch.Tensor:
"""Simulate realistic lidar ray dropping.
Real lidars drop rays due to:
- Specular reflections (puddles, glass)
- Out-of-range returns
- Dark surfaces (low reflectance)
- Atmospheric effects (rain, fog, dust)
"""
if drop_model is not None:
# Learned ray drop model
features = torch.stack([depths, intensities], dim=-1)
drop_prob = drop_model(features).squeeze(-1)
keep = torch.bernoulli(1 - drop_prob).bool()
else:
# Simple heuristic model
# Higher drop probability at long range and low intensity
range_factor = (depths / 120.0).clamp(0, 1)
intensity_factor = (1 - intensities).clamp(0, 1)
drop_prob = 0.05 + 0.1 * range_factor + 0.1 * intensity_factor
keep = torch.bernoulli(1 - drop_prob).bool()
final_mask = hit_mask & keep
return final_mask
Quality Metric Computation
import torch
import torch.nn.functional as F
from torchmetrics.image import (
PeakSignalNoiseRatio,
StructuralSimilarityIndexMeasure,
)
from torchmetrics.image.lpip import LearnedPerceptualImagePatchSimilarity
class NeuralSimMetrics:
"""Compute reconstruction quality metrics for neural sim validation.
Two levels of metrics:
1. Upstream: how well does the reconstruction match ground truth?
2. Downstream: does the perception stack perform the same?
"""
def __init__(self, device: str = "cuda"):
self.device = device
# Image quality metrics
self.psnr = PeakSignalNoiseRatio(data_range=1.0).to(device)
self.ssim = StructuralSimilarityIndexMeasure(data_range=1.0).to(device)
self.lpips = LearnedPerceptualImagePatchSimilarity(
net_type="alex", # AlexNet backbone
normalize=True,
).to(device)
def compute_image_metrics(
self,
rendered: torch.Tensor, # (B, 3, H, W) predicted images [0, 1]
ground_truth: torch.Tensor, # (B, 3, H, W) real images [0, 1]
) -> dict:
"""Compute upstream image quality metrics."""
metrics = {}
# PSNR: Peak Signal-to-Noise Ratio (higher is better)
# Measures pixel-level accuracy. PSNR > 27 dB is typically good.
metrics["psnr"] = self.psnr(rendered, ground_truth).item()
# SSIM: Structural Similarity Index (higher is better, max 1.0)
# Measures structural patterns. SSIM > 0.85 is typically good.
metrics["ssim"] = self.ssim(rendered, ground_truth).item()
# LPIPS: Learned Perceptual Image Patch Similarity (lower is better)
# Uses deep features to measure perceptual quality. LPIPS < 0.15 is good.
metrics["lpips"] = self.lpips(rendered, ground_truth).item()
return metrics
def compute_lidar_metrics(
self,
rendered_points: torch.Tensor, # (N, 3) rendered point cloud
ground_truth_points: torch.Tensor, # (M, 3) real point cloud
rendered_intensity: torch.Tensor, # (N,) rendered intensity
gt_intensity: torch.Tensor, # (M,) real intensity
) -> dict:
"""Compute upstream lidar quality metrics."""
metrics = {}
# Chamfer Distance: average bidirectional nearest-neighbor distance
# Lower is better. < 0.05m is typically good.
from pytorch3d.loss import chamfer_distance
cd, _ = chamfer_distance(
rendered_points.unsqueeze(0),
ground_truth_points.unsqueeze(0),
)
metrics["chamfer_distance_m"] = cd.item()
# Median Absolute Depth Error
# Match rendered and GT points by lidar beam ID, then compare depth
# (simplified here as overall statistics)
metrics["median_depth_error_m"] = torch.median(
torch.abs(rendered_points[:, :3].norm(dim=-1) -
ground_truth_points[:rendered_points.shape[0], :3].norm(dim=-1))
).item()
# Intensity MAE
min_len = min(len(rendered_intensity), len(gt_intensity))
metrics["intensity_mae"] = torch.abs(
rendered_intensity[:min_len] - gt_intensity[:min_len]
).mean().item()
return metrics
def compute_downstream_metrics(
self,
perception_model: torch.nn.Module,
rendered_data: dict, # sensor data from neural sim
real_data: dict, # real sensor data
ground_truth_labels: dict, # 3D bounding boxes, lanes, etc.
) -> dict:
"""Compute downstream perception metrics.
The key question: does the perception stack produce the same
outputs on neural-rendered data vs. real data?
"""
metrics = {}
# Run perception on real data
with torch.no_grad():
real_detections = perception_model(real_data)
rendered_detections = perception_model(rendered_data)
# Detection mAP on real vs rendered
real_map = compute_detection_map(
real_detections, ground_truth_labels
)
rendered_map = compute_detection_map(
rendered_detections, ground_truth_labels
)
metrics["real_detection_map"] = real_map
metrics["rendered_detection_map"] = rendered_map
metrics["detection_map_gap"] = abs(real_map - rendered_map)
# The gap should be small (< 1-2%) for trustworthy simulation
metrics["sim_trustworthy"] = metrics["detection_map_gap"] < 0.02
return metrics
def compute_psnr_manual(
rendered: torch.Tensor,
ground_truth: torch.Tensor,
max_val: float = 1.0,
) -> float:
"""Manual PSNR computation for understanding.
PSNR = 10 * log10(MAX^2 / MSE)
= 20 * log10(MAX / RMSE)
Higher PSNR = less error = better reconstruction.
Typical values for neural rendering:
- 25-28 dB: decent quality
- 28-32 dB: good quality
- 32+ dB: excellent quality
"""
mse = F.mse_loss(rendered, ground_truth)
if mse == 0:
return float("inf")
psnr = 20 * torch.log10(torch.tensor(max_val)) - 10 * torch.log10(mse)
return psnr.item()
def compute_ssim(
img1: torch.Tensor, # (B, C, H, W)
img2: torch.Tensor, # (B, C, H, W)
window_size: int = 11,
C1: float = 0.01 ** 2,
C2: float = 0.03 ** 2,
) -> torch.Tensor:
"""Structural Similarity Index (simplified).
SSIM compares luminance, contrast, and structure:
SSIM(x, y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) /
(mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2)
Values range [0, 1]. Higher = more similar.
"""
# Create Gaussian window
coords = torch.arange(window_size, dtype=torch.float32) - window_size // 2
gauss = torch.exp(-coords ** 2 / (2 * 1.5 ** 2))
gauss = gauss / gauss.sum()
window = gauss.unsqueeze(0) * gauss.unsqueeze(1)
window = window.unsqueeze(0).unsqueeze(0) # (1, 1, K, K)
window = window.expand(img1.shape[1], -1, -1, -1).to(img1.device)
mu1 = F.conv2d(img1, window, groups=img1.shape[1], padding=window_size // 2)
mu2 = F.conv2d(img2, window, groups=img2.shape[1], padding=window_size // 2)
mu1_sq = mu1 ** 2
mu2_sq = mu2 ** 2
mu1_mu2 = mu1 * mu2
sigma1_sq = F.conv2d(img1 * img1, window, groups=img1.shape[1],
padding=window_size // 2) - mu1_sq
sigma2_sq = F.conv2d(img2 * img2, window, groups=img2.shape[1],
padding=window_size // 2) - mu2_sq
sigma12 = F.conv2d(img1 * img2, window, groups=img1.shape[1],
padding=window_size // 2) - mu1_mu2
ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) / \
((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
return ssim_map.mean()
Mental Models & Diagrams
Neural Sim Pipeline (End-to-End)
┌─────────────────────────────────────────────────────────────────────────┐
│ NEURAL SIM: END-TO-END PIPELINE │
│ │
│ PHASE 1: DATA COLLECTION │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Fleet Vehicle │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Cam 0 │ │Cam 1 │ │Cam 2 │ │LiDAR │ │ IMU/ │ │ │
│ │ │Front │ │Left │ │Right │ │360 │ │ GPS │ │ │
│ │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │
│ │ └────────┴────────┴────────┴────────┘ │ │
│ │ │ │ │
│ │ Drive Log │ │
│ │ (images, points, poses, timestamps) │ │
│ └────────────────────────┬───────────────────────────────────────────┘ │
│ │ │
│ PHASE 2: RECONSTRUCTION ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │
│ │ │Pose Refine │ │Dynamic Mask │ │Sky Segmentation │ │ │
│ │ │(SfM/SLAM) │ │(Track+Seg) │ │(Semantic Seg) │ │ │
│ │ └─────┬──────┘ └──────┬───────┘ └────────────┬────────────┘ │ │
│ │ │ │ │ │ │
│ │ └────────────────┼────────────────────────┘ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Gaussian Splatting Training │ │ │
│ │ │ (static background) │ │ │
│ │ │ + Actor Model Training │ │ │
│ │ │ + Sky Model Fitting │ │ │
│ │ └──────────────┬──────────────┘ │ │
│ │ │ │ │
│ │ Neural Scene │ │
│ └──────────────────────────┬─────────────────────────────────────────┘ │
│ │ │
│ PHASE 3: SIMULATION ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Scenario Definition: │ │
│ │ "Ego brakes 0.5s later; lead vehicle cuts in from left" │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Update │───►│ Render │───►│ Run │──┐ │ │
│ │ │ Scene │ │ Sensors │ │ Percep. │ │ │ │
│ │ │ State │ │ (cam+lid)│ │ Stack │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │ │
│ │ ▲ │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ └──────────┤ Update │◄───┤ Run │◄─┘ │ │
│ │ │ Actors │ │ Planner │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Output: Planner decisions, safety metrics, perception KPIs │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 4: VALIDATION │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Upstream: PSNR=29.1 dB | SSIM=0.88 | LPIPS=0.11 | Chamfer=0.03m│ │
│ │ Downstream: det mAP gap=0.8% | lane F1 gap=0.5% | PASS │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
NeRF vs Gaussian Splatting Rendering Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ RENDERING APPROACH COMPARISON │
│ │
│ NeRF (Backward / Ray Marching): │
│ │
│ For each PIXEL: │
│ Cast a ray through the scene │
│ Sample N points along the ray (e.g., 64 + 128 = 192) │
│ For each sample: query MLP --> (color, density) │
│ Accumulate via volume rendering │
│ │
│ Camera Scene (implicit MLP) │
│ │ │
│ │ ray ●──●──●──●──●──●──● (sample points) │
│ ├────────►│ │ │ │ │ │ │ Each ● = MLP forward pass │
│ │ ●──●──●──●──●──●──● │
│ │ ray │ │ │ │ │ │ │ │
│ ├────────►●──●──●──●──●──●──● │
│ │ │
│ Cost: H x W x N_samples x MLP_cost │
│ For 1920x1080: ~400M MLP evaluations per frame │
│ Result: 0.1 - 5 FPS │
│ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ │
│ 3DGS (Forward / Splatting): │
│ │
│ For each GAUSSIAN: │
│ Project onto image plane (one matrix multiply) │
│ Determine which tiles it overlaps │
│ Sort all Gaussians by depth │
│ For each TILE (16x16 pixels): │
│ Alpha-composite sorted Gaussians │
│ │
│ Gaussians Camera / Image │
│ ● ┌─────┬─────┐ │
│ ● project │░░░░░│ │ ░ = contributions │
│ ● ──────────► │░░░░░│░░ │ from projected │
│ ● ├─────┼─────┤ Gaussians │
│ ● │ │░░░░░│ │
│ └─────┴─────┘ │
│ (tile-based compositing) │
│ │
│ Cost: N_gaussians x (project + sort + composite) │
│ Highly parallelizable on GPU (one thread block per tile) │
│ Result: 100 - 300 FPS │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Scene Decomposition Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ SCENE DECOMPOSITION FOR AD │
│ │
│ Input: Single Drive Log Sequence (10-20 seconds) │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Frame t=0 Frame t=1 Frame t=2 │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ road │ │ road │ │ road │ │ │
│ │ │ ┌──┐ │ │ ┌──┐ │ │ ┌──┐ │ │ │
│ │ │ │A │ │ │ │A │ │ │ │A │ │ (A = car) │ │
│ │ │ └──┘ │ │ └──┘ │ │ └──┘ │ │ │
│ │ │ bldg │ │ bldg │ │ bldg │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────┐ │
│ │ DETECTION + TRACKING │ │
│ │ 3D bounding boxes per actor per frame │ │
│ │ Actor A: [(x0,y0,z0,w,h,l,yaw), ...] │ │
│ └────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌─────────────┴──────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌───────────────────────┐ │
│ │ STATIC LAYER │ │ DYNAMIC LAYER │ │
│ │ │ │ │ │
│ │ All pixels NOT │ │ Per-actor cropped │ │
│ │ belonging to │ │ observations: │ │
│ │ tracked actors │ │ │ │
│ │ │ │ ┌──┐ ┌──┐ ┌──┐ │ │
│ │ ┌──────────┐ │ │ │A │ │A │ │A │ │ │
│ │ │ road │ │ │ │t0│ │t1│ │t2│ │ │
│ │ │ ████ │ │ │ └──┘ └──┘ └──┘ │ │
│ │ │ (hole │ │ │ │ │
│ │ │ filled │ │ │ Train per-actor model │ │
│ │ │ by │ │ │ in actor-local coords │ │
│ │ │ inpaint│ │ │ │ │
│ │ │ + lidar│ │ │ OR match to PBR asset │ │
│ │ │ depth) │ │ │ from library │ │
│ │ │ bldg │ │ │ │ │
│ │ └──────────┘ │ └───────────────────────┘ │
│ │ │ │
│ │ Train 3DGS on │ │
│ │ masked frames │ │
│ └──────────────────┘ │
│ │
│ ┌─────────────┴──────────────┐ │
│ ▼ ▼ │
│ AT SIMULATION TIME: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Static BG (fixed) + Actor A at new_pose_A │ │
│ │ + Actor B at new_pose_B │ │
│ │ + New Actor C (from library) │ │
│ │ + Sky dome │ │
│ │ = Composed Scene Render │ │
│ │ │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Hands-On Exercises
Exercise 1: Implement a Minimal 2D Gaussian Splatting Renderer
Goal: Build intuition for how Gaussian splatting works by implementing a 2D version from scratch.
Task:
- Create a set of 2D Gaussians (position, scale, rotation, color, opacity)
- Implement the forward rendering pass: project, sort by depth, alpha-composite
- Implement a training loop that optimizes Gaussian parameters to match a target image
- Visualize the optimization process
# Starter code
import torch
import matplotlib.pyplot as plt
class Gaussian2D:
def __init__(self, n_gaussians=1000, image_size=256):
self.means = torch.randn(n_gaussians, 2) * image_size / 4 + image_size / 2
self.means.requires_grad_(True)
self.scales = torch.full((n_gaussians, 2), 3.0).requires_grad_(True)
self.colors = torch.rand(n_gaussians, 3).requires_grad_(True)
self.opacities = torch.zeros(n_gaussians, 1).requires_grad_(True)
# TODO: implement render() and training loop
# Expected outcome: reproduce a target image using ~5000 Gaussians
# in under 1000 optimization steps
What you will learn: The core splatting algorithm, alpha compositing, gradient-based optimization of explicit primitives.
Exercise 2: Compare NeRF vs 3DGS on a Driving Scene
Goal: Understand the practical trade-offs by training both representations on the same data.
Task:
- Use the nerfstudio framework (supports both NeRF and 3DGS)
- Download a driving scene from the nuScenes mini dataset
- Train both
nerfacto(NeRF variant) andsplatfacto(3DGS variant) - Compare: training time, render speed (FPS), PSNR, SSIM, LPIPS
- Try rendering from novel viewpoints 1m, 2m, 5m off the original trajectory
# Setup
pip install nerfstudio
ns-install-cli
# Process nuScenes data
ns-process-data nuscenes --data /path/to/nuscenes-mini --output-dir data/nuscenes
# Train NeRF
ns-train nerfacto --data data/nuscenes --experiment-name nerf_driving
# Train 3DGS
ns-train splatfacto --data data/nuscenes --experiment-name gs_driving
# Compare metrics
ns-eval --load-config outputs/nerf_driving/config.yml
ns-eval --load-config outputs/gs_driving/config.yml
What you will learn: First-hand experience with training and rendering speed differences, quality trade-offs, and extrapolation behavior.
Exercise 3: Implement Scene Decomposition
Goal: Separate a driving sequence into static background and dynamic actors.
Task:
- Take a sequence of driving images with 2D bounding box labels
- Create binary masks for dynamic objects in each frame
- Inpaint the masked regions using OpenCV or a diffusion model
- Train 3DGS on the masked (static-only) images
- Compare the static reconstruction with and without masking
import cv2
import numpy as np
def create_dynamic_mask(image, bounding_boxes, expansion_px=10):
"""Create a binary mask of dynamic objects.
Args:
image: (H, W, 3) input image
bounding_boxes: list of (x1, y1, x2, y2) for each dynamic object
expansion_px: expand each box by this many pixels to cover shadows
"""
mask = np.zeros(image.shape[:2], dtype=np.uint8)
for (x1, y1, x2, y2) in bounding_boxes:
x1 = max(0, x1 - expansion_px)
y1 = max(0, y1 - expansion_px)
x2 = min(image.shape[1], x2 + expansion_px)
y2 = min(image.shape[0], y2 + expansion_px)
mask[y1:y2, x1:x2] = 255
return mask
def inpaint_static(image, mask):
"""Inpaint dynamic object regions for static scene training."""
return cv2.inpaint(image, mask, inpaintRadius=5, flags=cv2.INPAINT_TELEA)
# TODO: process entire sequence, then train 3DGS on inpainted images
What you will learn: Why decomposition matters, how masking quality affects reconstruction, the challenge of filling in occluded regions.
Exercise 4: Build a Lidar Simulator from a Gaussian Scene
Goal: Generate synthetic lidar point clouds from a trained 3DGS model.
Task:
- Train a 3DGS model on a scene (can reuse from Exercise 2)
- Implement lidar ray generation for a Velodyne VLP-128 configuration
- Intersect rays with the Gaussian scene to produce depth and intensity
- Compare generated point cloud with real lidar ground truth
- Implement a basic ray drop model
Expected output: Side-by-side visualization of real vs. neural-simulated lidar point clouds, with Chamfer distance < 0.1m.
What you will learn: How lidar simulation works with Gaussians, the importance of sensor modeling (beam divergence, ray dropping), and where the current quality limits are.
Exercise 5: Closed-Loop Replay with Neural Rendering
Goal: Implement a minimal closed-loop replay system where the ego trajectory is modified and sensor data is re-rendered.
Task:
- Take a trained neural scene from Exercise 2
- Define an alternative ego trajectory (e.g., lateral offset of 1m)
- Render camera images from the new trajectory
- Run a pretrained object detector on both original and re-rendered images
- Compare detection outputs (are the same objects detected?)
def modify_ego_trajectory(
original_poses: list, # list of (4, 4) poses
lateral_offset_m: float, # how far to shift laterally
) -> list:
"""Create a modified ego trajectory with a lateral offset."""
modified = []
for pose in original_poses:
new_pose = pose.clone()
# Shift in the vehicle's lateral direction (y-axis in ego frame)
lateral_dir = pose[:3, 1] # second column = y-axis
new_pose[:3, 3] += lateral_offset_m * lateral_dir
modified.append(new_pose)
return modified
# TODO: render from modified trajectory, run detector, compare
What you will learn: The complete closed-loop neural sim pipeline, how rendering quality degrades with trajectory deviation, and the practical limits of novel view synthesis.
Exercise 6: Implement and Compare Quality Metrics
Goal: Build a comprehensive evaluation pipeline for neural sim quality.
Task:
- Implement PSNR, SSIM, LPIPS, and FID computation
- Evaluate your 3DGS model from Exercise 2 on held-out test views
- Compute per-image metrics and visualize the distribution
- Identify which image regions have the worst quality (hint: sky boundaries, thin structures, distant objects)
- Compute downstream metrics: run a detection model and compare mAP on real vs. rendered
What you will learn: How to evaluate neural rendering quality at both the pixel level (upstream metrics) and the perception level (downstream metrics). Understanding which metrics matter most for AD simulation.
Interview Questions
1. Why is 3D Gaussian Splatting preferred over NeRF for autonomous driving simulation?
Answer hint: Real-time rendering (100+ FPS vs. <5 FPS), explicit representation (each Gaussian has a position/shape that can be manipulated for scene editing), natural scene decomposition (group Gaussians per object), lidar compatibility (rays can intersect Gaussian primitives), and faster training (minutes vs. hours). For closed-loop simulation, the ego vehicle's perception stack needs to run at real-time rates, which NeRF cannot support.
2. Explain the difference between splatting (forward rendering) and ray marching (backward rendering).
Answer hint: Ray marching (NeRF) starts from each pixel, casts a ray into the scene, and samples the volumetric representation along the ray -- cost is proportional to pixels x samples. Splatting (3DGS) starts from each primitive, projects it onto the image plane, and accumulates contributions -- cost is proportional to primitives. Splatting is more GPU-friendly because it avoids per-pixel ray marching and enables tile-based parallelism. The alpha compositing formula is mathematically equivalent in both cases.
3. How does scene decomposition work in neural rendering for AD, and why is it necessary?
Answer hint: The scene is split into static (roads, buildings, vegetation) and dynamic (vehicles, pedestrians) components. Static elements are reconstructed as a single background model, while dynamic actors get individual models in local coordinates. This is necessary because: (1) dynamic objects must be independently movable for scenario editing, (2) training the static model requires masking out dynamic objects to avoid ghosting, and (3) actors may need to be replaced with PBR assets for flexibility. Shadow handling is particularly tricky -- car shadows must be masked from the static model and re-rendered appropriately.
4. What are the key metrics for validating a neural sim system, and which ones matter most?
Answer hint: Two levels: upstream (PSNR, SSIM, LPIPS, Chamfer distance) measure reconstruction fidelity, and downstream (perception mAP gap, tracking MOTA gap) measure whether the perception stack produces the same output on rendered vs. real data. Downstream metrics matter more -- a rendering could have mediocre PSNR but still produce identical perception outputs if the differences are in regions the detector ignores. Conversely, a high-PSNR rendering could have artifacts exactly in critical regions. The ultimate metric is: "Does the planner make the same decision on neural-rendered data as it would on real data?"
5. How does Applied Intuition's Neural Sim handle dynamic actors differently from purely neural approaches like NeuRAD?
Answer hint: Applied Intuition uses a hybrid approach: Gaussian Splatting for the static background but physics-based rendering (PBR) for dynamic actors. This means dynamic actors are high-quality 3D assets with physically-based materials, not neural reconstructions. Advantage: PBR actors can be placed in arbitrary new poses, animated, and lit correctly -- they are not limited to appearances seen in the training data. The trade-off is that you need a library of matched PBR assets, but this provides much greater flexibility for scenario editing. Purely neural approaches (NeuRAD, SplatAD) reconstruct actors as neural primitives, which are faithful to the training data but limited in how much they can be manipulated.
6. What causes the "novel view synthesis quality cliff" when deviating from the training trajectory, and how can it be mitigated?
Answer hint: Neural rendering works by interpolating between training views. When the novel viewpoint deviates significantly, the system must extrapolate, revealing: (1) regions never observed in training (e.g., behind parked cars), (2) under-constrained geometry (floaters, collapsed surfaces), and (3) view-dependent appearance not captured by limited SH coefficients. Mitigation strategies include: multi-traversal data (driving the same route multiple times from slightly different lanes), lidar depth supervision (constrains geometry even without visual coverage), diffusion-based inpainting (fills hallucinated regions), and conservative trajectory deviation limits (stay within 1-3m laterally). Some systems also use learned priors about common scene structures.
7. How do you generate realistic lidar point clouds from a 3D Gaussian scene?
Answer hint: For each lidar beam, generate a ray from the sensor origin in the beam's direction (accounting for the sensor's rotation during the sweep). Intersect the ray with nearby Gaussians by evaluating each Gaussian's contribution along the ray (using the Mahalanobis distance from the ray to the Gaussian center). Alpha-composite depth values to get the final range measurement. Then apply sensor modeling: beam divergence (the ray has non-zero width), intensity estimation (based on surface normal and material), ray dropping (learned or heuristic model for missing returns), and range noise. Motion compensation is critical -- the lidar rotates over ~100ms, so each azimuth angle corresponds to a slightly different ego pose.
8. Why is rolling shutter modeling important for neural rendering in driving scenes?
Answer hint: Automotive cameras use rolling shutters where each row is exposed at a slightly different time. At highway speeds (30 m/s), the ego vehicle moves ~1m during a single frame's readout time. If the neural renderer assumes a single pose per frame (global shutter), it produces a blurred, misaligned rendering. NeuRAD showed that modeling rolling shutter improves PSNR by 1-2 dB. The solution is to interpolate the ego pose for each row (or group of rows) and render each from its correct pose, then assemble the final image from these sub-renders.
9. What are the main challenges in scaling neural sim to fleet-level data (thousands of scenes)?
Answer hint: (1) Automation: every step (pose refinement, segmentation, masking, training, quality checking) must be fully automated with no manual tuning. (2) Robustness: some scenes have degenerate geometry, poor lighting, or unusual configurations that cause training to diverge. (3) Quality gating: automated metrics must identify failed reconstructions without human review of every scene. (4) Compute cost: training thousands of scenes requires efficient GPU scheduling and resource management. (5) Versioning: as the reconstruction pipeline improves, all scenes must be re-trained and re-validated. (6) Data diversity: the pipeline must handle highways, intersections, parking lots, construction zones, adverse weather, and night scenes.
10. Compare the trade-offs between using neural-reconstructed actors vs. PBR asset-matched actors for dynamic objects in simulation.
Answer hint:
| Aspect | Neural Actors | PBR Asset Actors |
|---|---|---|
| Fidelity to original | Very high | Approximate (matched) |
| Novel poses | Limited to training data | Arbitrary |
| Animation | Difficult | Standard 3D animation |
| Lighting consistency | Baked into reconstruction | Physically correct |
| Asset creation cost | Automated (from data) | Requires library + matching |
| Scenario editing | Limited | Full flexibility |
| Scalability | Easy (reconstruct from data) | Needs large asset library |
The industry trend is toward hybrid approaches: use neural backgrounds (which are hard to build by hand) and PBR actors (which need to be fully controllable). Applied Intuition exemplifies this with GS backgrounds + PBR actors.
References
Foundational Methods
-
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Mildenhall et al., ECCV 2020 arxiv.org/abs/2003.08934
-
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding Mueller et al., SIGGRAPH 2022 arxiv.org/abs/2201.05989
-
3D Gaussian Splatting for Real-Time Radiance Field Rendering Kerbl et al., SIGGRAPH 2023 arxiv.org/abs/2308.14737
-
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields Barron et al., CVPR 2022 arxiv.org/abs/2111.12077
AD-Specific Neural Rendering
-
NeuRAD: Neural Rendering for Autonomous Driving Tonderski et al., CVPR 2024 arxiv.org/abs/2311.15260
-
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving Hultman et al., 2025 arxiv.org/abs/2411.16816
-
HUGSIM: A Real-Time, Photo-Realistic Closed-Loop Simulator for Autonomous Driving Chen et al., 2024 arxiv.org/abs/2403.17712
-
AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction Khan et al., 2024 arxiv.org/abs/2407.02598
Earlier AD Neural Rendering
-
Block-NeRF: Scalable Large Scene Neural View Synthesis Tancik et al., CVPR 2022 arxiv.org/abs/2202.05263
-
UniSim: A Neural Closed-Loop Sensor Simulator Yang et al. (Waabi), CVPR 2023 arxiv.org/abs/2308.01898
-
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision Yang et al., ICLR 2024 arxiv.org/abs/2311.02077
-
MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving Wu et al., CICAI 2023 arxiv.org/abs/2307.15058
-
Street Gaussians for Modeling Dynamic Urban Scenes Yan et al., 2024 arxiv.org/abs/2401.01339
-
DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes Zhou et al., CVPR 2024 arxiv.org/abs/2312.07920
Lidar-Specific Neural Rendering
-
LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields Tao et al., 2024 arxiv.org/abs/2304.10406
-
Neural LiDAR Fields for Novel LiDAR View Synthesis Huang et al., ICCV 2023 arxiv.org/abs/2305.01643
Surveys and Overviews
-
Neural Rendering for Autonomous Driving: A Survey Various authors, 2024 (Multiple survey papers covering the rapidly evolving landscape)
-
A Survey on 3D Gaussian Splatting Chen et al., 2024 arxiv.org/abs/2401.03890
This deep dive covers the core technologies, key papers, and practical considerations for neural rendering in autonomous driving simulation. The field is evolving rapidly -- the transition from NeRF to 3DGS happened in under two years, and production systems like Applied Intuition's Neural Sim are already deploying these techniques at fleet scale. For engineers entering this space, hands-on experience with 3DGS training and rendering is the most valuable starting point.