Back to all papers
Deep Dive #1045 min read

Synthetic Data for Perception

Data generation pipelines, domain randomization, auto-labeling, and domain gap mitigation for perception training.

Synthetic Data for Autonomous Driving Perception Training: A Deep Dive

Focus: End-to-end synthetic data generation, domain adaptation, and ML training strategies for AD perception Key Topics: Domain Randomization, FDA, CyCADA, Applied Intuition Synthetic Datasets, Mixed Training Read Time: 60 min


Table of Contents

  1. Executive Summary
  2. Background and Motivation
  3. Synthetic Data Generation Pipeline
  4. Sensor Simulation for Data Generation
  5. Domain Gap and Mitigation
  6. Applied Intuition's Approach
  7. Best Practices for ML Training with Synthetic Data
  8. Code Examples
  9. Mental Models and Diagrams
  10. Hands-On Exercises
  11. Interview Questions
  12. References

Executive Summary

The Core Idea

Training perception models for autonomous driving requires massive labeled datasets. Real-world data collection costs $5-10 per labeled frame, rare objects like cyclists appear in less than 2% of frames, and edge cases (construction zones, adverse weather, unusual pedestrian behavior) are both dangerous and expensive to capture. Synthetic data -- generated entirely in simulation -- offers a path to unlimited, perfectly-labeled, arbitrarily-diverse training data at a fraction of the cost.

    THE SYNTHETIC DATA VALUE PROPOSITION
    =====================================

    Real Data Pipeline:                    Synthetic Data Pipeline:
    ┌──────────┐                           ┌──────────────┐
    │ Drive Car │ $500/hr                  │ Define Scene  │ ~$0
    └────┬─────┘                           └──────┬───────┘
         │                                        │
    ┌────▼─────┐                           ┌──────▼───────┐
    │ Transfer  │ $50/TB                   │  Render in   │ ~$0.01/frame
    │   Data    │                          │  Simulation  │ (GPU cost)
    └────┬─────┘                           └──────┬───────┘
         │                                        │
    ┌────▼─────┐                           ┌──────▼───────┐
    │  Manual   │ $5-10/frame              │ Auto-Label   │ ~$0
    │  Label    │ 3-6 week turnaround      │  (instant)   │ (free, perfect)
    └────┬─────┘                           └──────┬───────┘
         │                                        │
    ┌────▼─────┐                           ┌──────▼───────┐
    │  QA Pass  │ $2/frame                 │   Export to  │ ~$0
    │           │                          │   ML Format  │
    └────┬─────┘                           └──────┬───────┘
         │                                        │
    Cost: $6-12/frame                      Cost: ~$0.01/frame
    Time: weeks                            Time: hours
    Diversity: limited                     Diversity: unlimited
    Labels: ~95% accurate                  Labels: 100% accurate

Key Takeaways

  • Synthetic data can reduce real data requirements by up to 90% when combined with domain adaptation techniques.
  • The domain gap (visual and statistical differences between synthetic and real data) is the central challenge, but modern techniques (FDA, CyCADA, domain randomization) have made it manageable.
  • A practical strategy is mixed training: pre-train on synthetic data, fine-tune on a small amount of real data.
  • Auto-labeling in simulation provides perfect ground truth for 2D/3D bounding boxes, semantic segmentation, instance segmentation, depth maps, optical flow, and surface normals -- all for free.
  • Synthetic data excels at rare class upsampling: need more cyclists? Generate 100,000 cyclist scenarios in an afternoon.

Background and Motivation

The Cost of Real-World Data Collection

Building a production perception stack for autonomous driving requires training data at a scale that is difficult to appreciate until you see the numbers:

Cost ComponentTypical CostNotes
Fleet operation (vehicles, drivers, fuel, insurance)$500-1,000/hr per vehicleSafety driver + operator
Data transfer and storage$50-100/TBRaw sensor data: 1-4 TB/hr
2D bounding box annotation$0.10-0.50/box~50-200 boxes per frame
3D bounding box annotation$1-5/boxRequires LiDAR point cloud tooling
Semantic segmentation (pixel-level)$5-15/frameMost expensive label type
Quality assurance$1-3/frameMulti-pass review
Total cost per fully-labeled frame$6-12Camera + LiDAR labels

For a dataset like nuScenes (390k frames) or Waymo Open (1.15M frames), the labeling cost alone runs into the millions. And these are relatively small compared to what production systems need.

Industry example: Cruise reportedly spent over $100M annually on data collection and labeling operations before pausing operations. Waymo's dataset investments span over a decade.

The Class Imbalance Problem

Real-world driving is dominated by common scenarios -- highway driving, following traffic, waiting at red lights. Rare but safety-critical objects are dramatically underrepresented:

    CLASS DISTRIBUTION IN TYPICAL DRIVING DATA
    ============================================

    Cars:          ████████████████████████████████████████  78%
    Trucks:        ████████                                  12%
    Pedestrians:   ████                                       5%
    Cyclists:      █                                          1.5%
    Motorcycles:   █                                          1.2%
    Animals:       ░                                          0.3%
    Construction:  ░                                          0.5%
    Wheelchairs:   ░                                          0.1%
    Scooters:      ░                                          0.4%

    PROBLEM: Missing a cyclist is catastrophic, but the model sees
    50x more cars than cyclists during training.

This creates a vicious cycle:

  1. The model rarely sees cyclists during training.
  2. It learns weak features for cyclist detection.
  3. It misses cyclists at inference time.
  4. Engineers try to collect more cyclist data, but cyclists are rare in most geographies.
  5. Even targeted collection campaigns yield limited diversity (same time of day, same location).

Synthetic data breaks this cycle entirely: you can generate exactly the distribution you need.

Why Synthetic Data Is Transformative

Synthetic data offers five fundamental advantages over real data:

1. Perfect Labels (Zero Label Noise)

In simulation, the ground truth is known exactly. Every pixel's class, every object's 3D bounding box, every surface normal, every depth value -- all computed analytically from the scene graph. No annotator disagreements, no missed objects behind occlusion, no mislabeled classes.

2. Unlimited Diversity on Demand

Want 10,000 frames of cyclists in rain at night on a four-lane road? Specify the parameters and render. Want to sweep across 100 lighting conditions? Parameterize the sun angle and cloud cover. Real data collection cannot achieve this level of controlled variation.

3. Perfect Reproducibility

Every synthetic frame can be regenerated with identical or systematically varied parameters. This enables controlled experiments: "How does detection performance change as we vary fog density from 0 to 1?"

4. Safety

Generating data for dangerous scenarios (near-collisions, pedestrians darting into traffic, vehicle rollovers) requires no actual danger. You can generate millions of safety-critical frames without risk.

5. Cost at Scale

After the initial investment in a simulation platform, the marginal cost per frame is dominated by GPU rendering time -- typically $0.005-0.02 per frame, orders of magnitude cheaper than real data.


Synthetic Data Generation Pipeline

The pipeline from "I need training data" to "here are labeled frames ready for ML training" involves several stages. Each stage involves design decisions that affect the quality, diversity, and downstream utility of the synthetic data.

    END-TO-END SYNTHETIC DATA GENERATION PIPELINE
    ===============================================

    ┌─────────────────────────────────────────────────────────────────┐
    │                     1. SCENE DEFINITION                        │
    │  "What scenarios do we want to generate?"                      │
    │                                                                │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐ │
    │  │  Natural      │  │ Distribution │  │  Log-Based           │ │
    │  │  Language     │  │  Sampling    │  │  Extraction          │ │
    │  │  "Cyclist in  │  │  P(rain)=0.3 │  │  Replay real logs    │ │
    │  │   the rain"   │  │  P(night)=.2 │  │  with modifications  │ │
    │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘ │
    │         └─────────────────┼──────────────────────┘             │
    │                           ▼                                    │
    │              Scene Configuration (JSON/Protobuf)               │
    └───────────────────────────┬─────────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────────┐
    │                  2. WORLD GENERATION                            │
    │  "Build the 3D environment"                                    │
    │                                                                │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐ │
    │  │  Procedural   │  │  HD Map      │  │  Asset               │ │
    │  │  Roads/Cities │  │  Import      │  │  Placement           │ │
    │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘ │
    │         └─────────────────┼──────────────────────┘             │
    │                           ▼                                    │
    │                   3D Scene Graph                                │
    └───────────────────────────┬─────────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────────┐
    │               3. DOMAIN RANDOMIZATION                          │
    │  "Add controlled variation"                                    │
    │                                                                │
    │  Lighting ─── Weather ─── Textures ─── Colors ─── Poses       │
    └───────────────────────────┬─────────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────────┐
    │              4. SENSOR SIMULATION                               │
    │  "Render what sensors would see"                               │
    │                                                                │
    │  Camera ──── LiDAR ──── Radar ──── Ultrasonic                  │
    └───────────────────────────┬─────────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────────┐
    │              5. AUTO-LABELING                                   │
    │  "Extract ground truth from scene graph"                       │
    │                                                                │
    │  2D Boxes ─ 3D Boxes ─ Segmentation ─ Depth ─ Optical Flow    │
    └───────────────────────────┬─────────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────────┐
    │              6. EXPORT                                          │
    │  "Package in ML-ready format"                                  │
    │                                                                │
    │  nuScenes ──── KITTI ──── COCO ──── Custom Protobuf            │
    └─────────────────────────────────────────────────────────────────┘

Stage 1: Scene Definition

Scene definition is the process of specifying what to generate. There are three primary approaches:

Natural Language Specification

Modern platforms (including Applied Intuition) support natural language scene descriptions that are parsed into structured scenario specifications:

Input:  "A busy urban intersection at dusk with moderate rain.
         Two cyclists crossing from the left, one pedestrian
         with an umbrella on the right sidewalk. Three parked
         cars, one delivery truck double-parked."

Parsed Scene Config:
{
  "environment": {
    "time_of_day": "dusk",
    "weather": {"type": "rain", "intensity": 0.5},
    "location": "urban_intersection"
  },
  "actors": [
    {"type": "cyclist", "count": 2, "spawn": "left_crosswalk", "behavior": "crossing"},
    {"type": "pedestrian", "count": 1, "spawn": "right_sidewalk", "props": ["umbrella"]},
    {"type": "car", "count": 3, "state": "parked", "spawn": "parallel_parking"},
    {"type": "truck", "subtype": "delivery", "count": 1, "state": "double_parked"}
  ]
}

This approach lowers the barrier to entry -- scenario designers do not need to write JSON or code. Under the hood, an LLM or rule-based parser converts the description into a structured format.

Distribution-Based Sampling

For large-scale dataset generation, you define probability distributions over scene parameters and sample from them:

scene_distribution = {
    "time_of_day": Uniform(0, 24),              # hours
    "weather": Categorical({
        "clear": 0.4, "cloudy": 0.25,
        "rain": 0.2, "fog": 0.1, "snow": 0.05
    }),
    "num_vehicles": Poisson(lam=8),
    "num_pedestrians": Poisson(lam=3),
    "num_cyclists": Poisson(lam=1.5),            # Upsampled!
    "road_type": Categorical({
        "highway": 0.2, "urban": 0.4,
        "suburban": 0.3, "rural": 0.1
    }),
    "ego_speed_kph": TruncatedNormal(mu=40, sigma=20, low=0, high=130)
}

Notice how num_cyclists uses Poisson(lam=1.5) -- this is intentionally higher than the real-world distribution to oversample this rare class.

Log-Based Extraction

The most realistic approach: replay real-world sensor logs and modify them. Extract the scene structure from a real driving log (vehicle positions, road layout, timing), then:

  • Swap vehicle models with different ones
  • Change weather and lighting
  • Add or remove actors
  • Modify actor trajectories

This preserves the realistic spatial relationships and traffic patterns from real driving while enabling controlled variation.

Stage 2: Procedural World Generation

Procedural generation creates the 3D environment programmatically rather than by manual authoring.

Road Networks

Road networks are typically generated from:

  • OpenDRIVE files: Industry-standard road description format
  • HD Maps: Production-grade maps from mapping providers
  • Procedural grammars: L-systems or graph-based generation for arbitrary road topologies
    PROCEDURAL ROAD GENERATION EXAMPLE
    ====================================

    Grammar Rules:
      CITY      -> BLOCK+ INTERSECTION+
      BLOCK     -> ROAD BUILDINGS SIDEWALK
      ROAD      -> LANES MARKINGS CURBS
      LANES     -> LANE+ MEDIAN?
      INTERSECTION -> ROADS[4] SIGNAL CROSSWALKS

    Generated Layout:
    ┌──────────┬──────────┬──────────┐
    │ ████████ │ ████████ │ ████████ │
    │ █ Bldg █ │ █ Bldg █ │ █ Park █ │
    │ ████████ │ ████████ │ ████████ │
    ├══════════╬══════════╬══════════┤  ═══ Road
    │ ████████ │ ████████ │ ████████ │  ███ Building
    │ █ Bldg █ │ █ Mall █ │ █ Bldg █ │  ╬   Intersection
    │ ████████ │ ████████ │ ████████ │
    ├══════════╬══════════╬══════════┤
    │ ████████ │ ████████ │ ████████ │
    │ █ Bldg █ │ █ Bldg █ │ █ Bldg █ │
    │ ████████ │ ████████ │ ████████ │
    └──────────┴──────────┴──────────┘

Environment Details

Beyond roads, procedural generation handles:

  • Vegetation: Trees, bushes, grass with seasonal variation
  • Urban furniture: Street lights, fire hydrants, mailboxes, benches
  • Signage: Road signs, billboards, storefront signs
  • Ground surfaces: Asphalt textures, potholes, manhole covers, painted markings

Stage 3: Domain Randomization

Domain randomization is the deliberate introduction of visual and geometric variation into synthetic scenes so that a model trained on this data generalizes to the real world. The core insight: if the model sees enough variation in training, the real world becomes "just another variation."

Structured Domain Randomization (SDR)

SDR constrains randomization to physically plausible ranges:

ParameterRangeRationale
Sun elevation5-85 degreesRealistic solar angles
Sun azimuth0-360 degreesFull compass range
Cloud cover0-100%Controls ambient lighting
Road wetness0-1Affects reflections and LiDAR returns
Vehicle colorFrom real-world distributionBlue cars more common than pink
Pedestrian clothingSeason-appropriate palettesWinter coats in December
Camera exposure+/- 1 stop from nominalRealistic exposure variation

Unstructured (Full) Domain Randomization

Full domain randomization ignores physical plausibility and randomizes everything:

  • Random textures on all surfaces (including checkerboard patterns on roads)
  • Random colors for vehicles and pedestrians
  • Random lighting from arbitrary directions
  • Random camera noise profiles

This sounds counterproductive, but the key result from Tobin et al. (2017) demonstrated that training on wildly randomized synthetic data can produce models that transfer to the real world, because the model is forced to learn shape-based features rather than texture-based shortcuts.

    STRUCTURED vs. UNSTRUCTURED DOMAIN RANDOMIZATION
    ==================================================

    Structured DR:                    Unstructured DR:
    ┌─────────────────┐              ┌─────────────────┐
    │  Realistic       │              │  Random textures │
    │  lighting        │              │  on everything   │
    │  Real car colors │              │  Random colors   │
    │  Proper shadows  │              │  Bizarre lighting│
    │  Weather models  │              │  No physics      │
    └─────────────────┘              └─────────────────┘
           │                                │
           ▼                                ▼
    Learns texture+shape             Learns shape only
    features                         features
           │                                │
           ▼                                ▼
    Better on similar domains        Better generalization
    Worse on novel domains           across all domains

    IN PRACTICE: Use Structured DR with occasional Unstructured elements

Stage 4: Asset Libraries

High-quality 3D assets are the building blocks of synthetic scenes. A production asset library includes:

Vehicles (typically 50-200 unique models):

  • Sedans, SUVs, trucks, vans, buses, motorcycles, bicycles
  • Each with multiple color/texture variants
  • Articulated parts: wheels, doors, turn signals
  • Damage variants for crash scenarios

Pedestrians (typically 100-500 unique models):

  • Diverse body types, ages, ethnicities, clothing
  • Animated walk cycles, running, standing, sitting
  • Props: umbrellas, backpacks, strollers, shopping bags, wheelchairs
  • Seasonal clothing variants

Environmental Props (thousands):

  • Traffic signs (region-specific: US, EU, Asia)
  • Traffic lights, cones, barriers
  • Street furniture, vegetation, buildings

The quality of assets directly impacts the domain gap. Modern asset pipelines use photogrammetry (scanning real objects) and PBR (Physically Based Rendering) materials to achieve high realism.


Sensor Simulation for Data Generation

The sensor simulation layer converts the 3D scene into sensor-specific outputs that mimic what real sensors would produce. This is where much of the domain gap originates, so fidelity matters enormously.

Camera Rendering

Camera simulation must produce images that are statistically similar to real camera images. Two primary rendering approaches are used:

Ray Tracing

Ray tracing traces light paths from the camera through each pixel into the scene, computing physically-accurate reflections, refractions, shadows, and global illumination.

    RAY TRACING FOR CAMERA SIMULATION
    ===================================

    Camera                    Scene
      │
      │    Primary Ray
      ├──────────────────────► Hit surface A
      │                          │
      │                          ├── Shadow ray ──► Light (visible? shadow?)
      │                          │
      │                          ├── Reflection ray ──► Hit surface B
      │                          │                        │
      │                          │                        └── Shadow ray ──► Light
      │                          │
      │                          └── Refraction ray ──► Hit surface C (glass)
      │
      │    Monte Carlo Integration:
      │    Pixel color = Integral of (BRDF * Incoming Light * cos(theta)) dw
      │
      │    Typical: 64-256 samples per pixel for production quality
      │    Real-time: 1-4 samples per pixel with denoising

Advantages: Physically accurate lighting, reflections, caustics, global illumination. Disadvantages: Computationally expensive -- 0.5-5 seconds per frame at high quality.

Modern GPU ray tracers (NVIDIA OptiX, Vulkan RT) enable real-time ray tracing with hardware acceleration, making this practical for large-scale data generation.

Rasterization

Rasterization projects 3D triangles onto the 2D image plane and fills pixels using shader programs. This is the traditional game-engine approach (Unreal Engine, Unity).

Advantages: Very fast -- 60+ FPS at high resolution. Mature tooling. Disadvantages: Approximates lighting rather than simulating it physically. Screen-space reflections, baked shadows, and other tricks introduce systematic artifacts.

Camera Artifacts Modeling

Beyond basic rendering, production camera simulation models real sensor artifacts:

ArtifactHow It Is SimulatedWhy It Matters
Rolling shutterPer-scanline temporal offsetFast-moving objects appear sheared
Motion blurTemporal integration over exposure timeMoving objects are blurred
Lens distortionBarrel/pincushion/fisheye warpingWide-angle cameras have significant distortion
Chromatic aberrationPer-channel focal length offsetColor fringing at image edges
Bloom/flarePost-processing convolutionBright lights create halos
NoisePoisson-Gaussian noise modelLow-light images are noisy
Auto-exposureMetering + response curveExposure varies with scene brightness
VignettingRadial brightness falloffCorners are darker

LiDAR Simulation

LiDAR (Light Detection and Ranging) simulation must model the physics of laser pulses bouncing off surfaces and returning to the sensor.

Beam Physics

    LIDAR BEAM SIMULATION
    ======================

    Transmitter ──────── Laser Pulse ──────────► Surface
         │                                         │
         │              Time of Flight              │
         │◄──────────── Reflected Pulse ◄──────────┘
         │
         │  Distance = (speed_of_light * time_of_flight) / 2
         │
         │  For each beam:
         │    - Cast ray from sensor origin at (azimuth, elevation) angle
         │    - Find intersection with scene geometry
         │    - Compute range, intensity, and return count
         │
    Typical LiDAR:
         │    - 64-128 beams (vertical channels)
         │    - 360 degree horizontal sweep
         │    - 10-20 Hz rotation rate
         │    - 100,000-300,000 points per sweep

Intensity Modeling

LiDAR intensity depends on:

  • Surface material: Retroreflective signs return very high intensity; dark asphalt returns low intensity.
  • Angle of incidence: Surfaces hit at steep angles return less energy.
  • Range: Intensity drops with the square of distance (1/r^2 law).
  • Surface roughness: Rough surfaces scatter light; smooth surfaces have specular reflection.
def compute_lidar_intensity(material, distance, incidence_angle):
    """Simplified LiDAR intensity model."""
    # Material reflectivity (0-1)
    rho = material.reflectivity

    # Lambertian falloff
    cos_factor = max(0, cos(incidence_angle))

    # Range-squared falloff
    range_factor = 1.0 / (distance ** 2 + 1e-6)

    # Atmospheric attenuation (Beer-Lambert law)
    atm_factor = exp(-material.extinction_coeff * distance)

    intensity = rho * cos_factor * range_factor * atm_factor
    return clip(intensity, 0, 255)

Realistic Effects

Production LiDAR simulation also models:

  • Multi-return: A single beam can return multiple echoes (e.g., hitting a tree canopy and then the ground).
  • Rain/fog dropouts: Water droplets in the air cause false returns and attenuate the beam.
  • Beam divergence: The laser beam is not infinitely thin; it has a cone angle that causes range smearing at distance.
  • Motion compensation: The sensor rotates while the vehicle moves, causing per-point motion distortion.

Radar Simulation

Radar simulation is particularly challenging due to the complexity of electromagnetic wave propagation at millimeter wavelengths.

Radar Cross Section (RCS)

RCS quantifies how much radar energy an object reflects back toward the sensor. It depends strongly on object shape, material, and angle:

ObjectTypical RCS (dBsm)Notes
Pedestrian-5 to 5Highly variable with pose
Bicycle-5 to 0Small metal frame
Car (broadside)10 to 20Large flat surfaces
Car (head-on)0 to 10Smaller cross-section
Truck15 to 25Largest road users
Guardrail5 to 15 per meterExtended target
Traffic sign10 to 30Retroreflective corner reflectors

Multipath and Clutter

Radar signals bounce off multiple surfaces before returning:

    RADAR MULTIPATH EXAMPLE
    ========================

                   Direct Path
    Radar ─────────────────────────────── Target
      │                                      │
      │        Reflected Path                │
      │───────────► Ground ──────────────────┘
      │               │
      │               │  Ghost Detection
      │               └──────── Appears at wrong range/angle
      │
      │        Multi-bounce
      │───► Wall ──► Target ──► Wall ──► Radar
      │
      │  Results: ghost targets, extended targets,
      │  range/angle errors, clutter

Doppler Simulation

Radar uniquely measures radial velocity through the Doppler effect:

    f_doppler = 2 * v_radial * f_carrier / c

    For a 77 GHz radar and a target approaching at 30 m/s:
    f_doppler = 2 * 30 * 77e9 / 3e8 = 15,400 Hz = 15.4 kHz

This provides velocity information that cameras and LiDAR cannot directly measure, making radar simulation valuable for tracking and prediction tasks.

Auto-Labeling

The most significant advantage of synthetic data: perfect labels are free. Because the simulator knows the exact state of every object, labels are computed analytically:

Label Types and How They Are Generated

Label TypeGeneration MethodTypical Use
2D bounding boxesProject 3D box corners to image plane, compute enclosing rectangleObject detection (YOLO, Faster R-CNN)
3D bounding boxesRead directly from scene graph (position, size, orientation)3D detection (PointPillars, CenterPoint)
Semantic segmentationRender with per-object material IDs, map to class labelsPer-pixel classification (DeepLab)
Instance segmentationRender with unique per-instance IDsInstance-level understanding (Mask R-CNN)
Panoptic segmentationCombine semantic + instance mapsUnified scene understanding
Depth mapsZ-buffer from rendering or ray-traced distanceMonocular depth estimation
Optical flowCompute per-pixel displacement between consecutive framesMotion estimation
Surface normalsRead from geometry during renderingScene geometry understanding
Occlusion mapsMulti-pass rendering with/without target objectsHandling partial visibility
# Auto-labeling pseudocode
def generate_labels(scene_graph, camera):
    labels = {}

    for obj in scene_graph.objects:
        # 3D bounding box (directly from scene graph)
        labels["3d_boxes"].append({
            "class": obj.semantic_class,
            "center": obj.position,          # (x, y, z) in world frame
            "size": obj.bounding_box_size,    # (length, width, height)
            "rotation": obj.orientation,      # quaternion
            "velocity": obj.velocity,         # (vx, vy, vz)
            "instance_id": obj.id
        })

        # 2D bounding box (project to image)
        corners_3d = obj.get_3d_corners()                # 8 corners
        corners_2d = camera.project(corners_3d)           # project to image
        if any_visible(corners_2d, camera.image_size):
            x_min, y_min = corners_2d.min(axis=0)
            x_max, y_max = corners_2d.max(axis=0)
            labels["2d_boxes"].append({
                "class": obj.semantic_class,
                "bbox": [x_min, y_min, x_max, y_max],
                "occlusion": compute_occlusion(obj, scene_graph, camera),
                "truncation": compute_truncation(corners_2d, camera.image_size)
            })

    # Segmentation (from render pass)
    labels["semantic_seg"] = render_semantic_map(scene_graph, camera)
    labels["instance_seg"] = render_instance_map(scene_graph, camera)
    labels["depth"] = render_depth_map(scene_graph, camera)
    labels["optical_flow"] = compute_optical_flow(scene_graph, camera, dt=0.1)

    return labels

Domain Gap and Mitigation

The domain gap is the central challenge of synthetic data. Even high-fidelity simulation produces data that differs from real-world sensor data in systematic ways. Models trained purely on synthetic data typically suffer a 10-30% performance drop when evaluated on real data compared to models trained on real data.

Types of Domain Gap

Data Domain Gap (Visual/Statistical Differences)

The data domain gap refers to differences in the visual appearance and statistical properties of synthetic vs. real images:

    DATA DOMAIN GAP
    ================

    Synthetic Image:                    Real Image:
    ┌─────────────────────┐            ┌─────────────────────┐
    │                     │            │                     │
    │  - Clean textures   │            │  - Weathered, dirty │
    │  - Perfect lighting │            │  - Complex lighting │
    │  - Sharp edges      │    GAP     │  - Sensor noise     │
    │  - No artifacts     │ ◄────────► │  - Motion blur      │
    │  - Uniform surfaces │            │  - Lens artifacts   │
    │  - Limited assets   │            │  - Infinite variety │
    │                     │            │                     │
    └─────────────────────┘            └─────────────────────┘

    Statistical Differences:
    - Pixel intensity distributions differ
    - Texture frequency spectra differ
    - Color palette biases
    - Edge sharpness distributions
    - Noise characteristics

Label Domain Gap (Annotation Style Differences)

Even with perfect synthetic labels, there is a subtler gap: the style of labels may differ from human annotations:

  • Bounding box tightness: Synthetic boxes are pixel-perfect; human annotators leave variable margins.
  • Occlusion handling: Simulation provides exact occlusion percentages; human annotators estimate.
  • Class boundary ambiguity: Is a delivery scooter a "motorcycle" or a "bicycle"? Simulation assigns labels based on asset metadata; human annotators interpret guidelines differently.
  • Minimum size thresholds: Simulation labels all objects regardless of pixel size; real datasets have minimum size thresholds (e.g., "do not label if smaller than 20px").

Fourier Domain Adaptation (FDA)

FDA (Yang and Soatto, 2020) is an elegant, training-free approach to reducing the domain gap. The key insight: the low-frequency components of a Fourier-transformed image encode style (colors, lighting, textures), while high-frequency components encode structure (edges, shapes).

How FDA Works

  1. Take a synthetic image and a real image.
  2. Compute the 2D FFT (Fast Fourier Transform) of both.
  3. Replace the low-frequency amplitude spectrum of the synthetic image with that of the real image.
  4. Inverse FFT to get a "style-transferred" synthetic image that has real-world colors and lighting but synthetic structure and labels.
    FOURIER DOMAIN ADAPTATION (FDA)
    =================================

    Step 1: FFT of both images
    ┌──────────────┐         ┌──────────────┐
    │   Synthetic   │         │     Real     │
    │    Image      │         │    Image     │
    └──────┬───────┘         └──────┬───────┘
           │ FFT                    │ FFT
           ▼                        ▼
    ┌──────────────┐         ┌──────────────┐
    │  Amp_s  Ph_s │         │  Amp_r  Ph_r │
    │  (style)(struct)│      │  (style)(struct)│
    └──────┬───────┘         └──────┬───────┘
           │                        │
    Step 2: Swap low-frequency amplitudes
           │                        │
           │    ┌────────────┐      │
           └───►│ Replace    │◄─────┘
                │ low-freq   │
                │ Amp_s with │
                │ Amp_r      │
                └─────┬──────┘
                      │
    Step 3: Inverse FFT
                      ▼
               ┌──────────────┐
               │   Adapted    │
               │   Synthetic  │
               │   Image      │
               │              │
               │ Real style + │
               │ Synth content│
               └──────────────┘

FDA Implementation

import numpy as np

def fda_transfer(source_img, target_img, beta=0.01):
    """
    Fourier Domain Adaptation: transfer low-frequency style
    from target (real) to source (synthetic) image.

    Args:
        source_img: synthetic image, shape (H, W, 3), float32 [0, 1]
        target_img: real image, shape (H, W, 3), float32 [0, 1]
        beta: fraction of low-frequency spectrum to replace (0.01 - 0.1)

    Returns:
        Adapted image with real-world style, synthetic content
    """
    # Work in frequency domain per channel
    result = np.zeros_like(source_img)

    for c in range(3):  # RGB channels
        # Step 1: 2D FFT
        fft_source = np.fft.fft2(source_img[:, :, c])
        fft_target = np.fft.fft2(target_img[:, :, c])

        # Shift zero frequency to center
        fft_source_shifted = np.fft.fftshift(fft_source)
        fft_target_shifted = np.fft.fftshift(fft_target)

        # Get amplitude and phase
        amp_source = np.abs(fft_source_shifted)
        phase_source = np.angle(fft_source_shifted)
        amp_target = np.abs(fft_target_shifted)

        # Step 2: Create low-frequency mask
        h, w = source_img.shape[:2]
        cy, cx = h // 2, w // 2
        # beta controls the size of the low-frequency region
        rh, rw = int(beta * h), int(beta * w)

        # Replace low-frequency amplitudes
        amp_adapted = amp_source.copy()
        amp_adapted[cy-rh:cy+rh, cx-rw:cx+rw] = \
            amp_target[cy-rh:cy+rh, cx-rw:cx+rw]

        # Step 3: Recombine and inverse FFT
        fft_adapted = amp_adapted * np.exp(1j * phase_source)
        fft_adapted = np.fft.ifftshift(fft_adapted)
        result[:, :, c] = np.real(np.fft.ifft2(fft_adapted))

    # Clip to valid range
    result = np.clip(result, 0, 1)
    return result

Key parameter: beta controls how much of the frequency spectrum to transfer.

  • beta=0.01: Subtle color/brightness transfer. Safe, minimal artifacts.
  • beta=0.05: Moderate style transfer. Good balance.
  • beta=0.1: Aggressive transfer. May introduce artifacts.

Advantages of FDA:

  • No training required (no GAN, no network to train)
  • Fast (milliseconds per image)
  • Can be applied as a data augmentation step during training
  • Preserves spatial structure and labels perfectly

Limitations:

  • Only transfers global style, not local texture details
  • Requires access to a pool of real images for style reference
  • Cannot fix structural domain gaps (e.g., unrealistic object shapes)

CyCADA: Cycle-Consistent Adversarial Domain Adaptation

CyCADA (Hoffman et al., 2018) uses a learned image-to-image translation network to transform synthetic images to look like real images while preserving semantic content.

Architecture

    CyCADA ARCHITECTURE
    =====================

    Synthetic Domain (S)                      Real Domain (R)
    ┌──────────┐                              ┌──────────┐
    │ Synthetic │                              │   Real   │
    │  Images   │                              │  Images  │
    └────┬─────┘                              └────┬─────┘
         │                                         │
         ▼                                         ▼
    ┌────────────┐    Cycle Consistency      ┌────────────┐
    │ Generator  │ ──────────────────────── │ Generator  │
    │  G_S->R    │    G_R(G_S(x_s)) ~ x_s  │  G_R->S    │
    │            │ ◄──────────────────────── │            │
    └────┬───────┘                          └────┬───────┘
         │ Fake real                              │ Fake synthetic
         ▼                                        ▼
    ┌────────────┐                          ┌────────────┐
    │Discriminator│                          │Discriminator│
    │   D_R      │ "Is this real or fake?"  │   D_S      │
    └────────────┘                          └────────────┘

    Additional Losses:
    ┌─────────────────────────────────────────────────────┐
    │ 1. Adversarial Loss: Fool discriminators            │
    │ 2. Cycle Consistency: Reconstruct original image    │
    │ 3. Semantic Consistency: Preserve class labels      │
    │ 4. Feature Matching: Match intermediate features    │
    └─────────────────────────────────────────────────────┘

Loss Functions

CyCADA combines four loss terms:

L_total = L_adversarial + lambda_cyc * L_cycle + lambda_sem * L_semantic + lambda_feat * L_feature

Where:
  L_adversarial = E[log D_R(x_r)] + E[log(1 - D_R(G_S->R(x_s)))]   (standard GAN loss)
  L_cycle       = E[||G_R->S(G_S->R(x_s)) - x_s||_1]                (cycle consistency)
  L_semantic    = E[CE(f(G_S->R(x_s)), y_s)]                         (preserve labels)
  L_feature     = E[||feat(G_S->R(x_s)) - feat(x_r)||_2]            (perceptual similarity)

The semantic consistency loss is critical: it ensures that when a synthetic image of a "car" is translated to look real, the translated image is still classified as a "car" by a pre-trained classifier. Without this, the GAN might change semantic content (e.g., turning a car into a truck).

CyCADA vs. FDA Comparison

AspectFDACyCADA
Training requiredNoYes (GAN training)
Speed (inference)~5ms/image~50ms/image
Style transfer qualityGlobal onlyLocal + global
Label preservationPerfectNear-perfect (with semantic loss)
Implementation complexity~20 lines~1000+ lines
Failure modesMinimalMode collapse, artifacts
Data requirementsPool of real imagesPaired or unpaired domains

Mixed Training and Fine-Tuning Strategies

In practice, the most effective approach combines synthetic and real data rather than using synthetic data alone.

Strategy 1: Pre-train Synthetic, Fine-tune Real

    PRE-TRAIN + FINE-TUNE STRATEGY
    ================================

    Phase 1: Pre-train on synthetic data (large scale)
    ┌─────────────────────────────────────────────┐
    │  100,000 synthetic frames                   │
    │  Learn general features: edges, shapes,     │
    │  spatial relationships, object categories    │
    │  Epochs: 50-100                             │
    └─────────────────────┬───────────────────────┘
                          │
                          ▼
    Phase 2: Fine-tune on real data (small scale)
    ┌─────────────────────────────────────────────┐
    │  5,000-10,000 real frames                   │
    │  Adapt to real sensor characteristics,      │
    │  close the domain gap, calibrate confidence │
    │  Epochs: 10-30, lower learning rate         │
    └─────────────────────────────────────────────┘

This is the most common strategy and typically yields 80-95% of the performance of a model trained on 10x more real data.

Strategy 2: Mixed Training (Joint)

Train on a mixture of synthetic and real data simultaneously:

# Mixed training data loader
real_loader = DataLoader(real_dataset, batch_size=16, shuffle=True)
synth_loader = DataLoader(synth_dataset, batch_size=48, shuffle=True)

for epoch in range(num_epochs):
    for real_batch, synth_batch in zip(real_loader, synth_loader):
        # Combine batches (1:3 real-to-synthetic ratio)
        images = torch.cat([real_batch.images, synth_batch.images])
        labels = torch.cat([real_batch.labels, synth_batch.labels])

        # Optional: weight real samples higher
        weights = torch.cat([
            torch.ones(16) * 2.0,   # Real samples weighted 2x
            torch.ones(48) * 1.0    # Synthetic samples weighted 1x
        ])

        loss = weighted_loss(model(images), labels, weights)
        loss.backward()
        optimizer.step()

Strategy 3: Curriculum Learning

Start with synthetic data and gradually increase the proportion of real data:

    CURRICULUM LEARNING SCHEDULE
    =============================

    Epoch 1-10:   100% synthetic  ─────────── Learn basic features
    Epoch 11-20:   75% synthetic, 25% real ── Begin adaptation
    Epoch 21-30:   50% synthetic, 50% real ── Balanced training
    Epoch 31-40:   25% synthetic, 75% real ── Refine on real
    Epoch 41-50:    0% synthetic, 100% real ── Final calibration

The "Lower Bound Guarantee" Principle

A key empirical finding across multiple studies (Tremblay et al., 2018; Prakash et al., 2019; Kar et al., 2019):

Adding synthetic data to real data never hurts performance (when using proper training strategies). The worst case is that synthetic data provides no benefit; the best case is significant improvement -- especially for rare classes.

This is the "lower bound guarantee": synthetic data provides a floor of performance that is at least as good as real-data-only training, with potential for significant upside.

The conditions for this guarantee:

  1. Synthetic data must be reasonably diverse (domain randomization helps).
  2. Some real data must be included (even a small amount).
  3. Training must use mixed or fine-tuning strategies (not synthetic-only).
  4. Learning rate scheduling should account for the two domains.

Applied Intuition's Approach

Applied Intuition is one of the leading providers of simulation and synthetic data infrastructure for autonomous driving. Their Synthetic Datasets product is designed to generate ML-ready training data at scale.

Data Generation Pipeline

Applied Intuition's pipeline integrates several components:

    APPLIED INTUITION SYNTHETIC DATA PIPELINE
    ===========================================

    ┌──────────────────────────────────────────────────────┐
    │                  SCENARIO DESIGN                      │
    │                                                      │
    │  Natural Language ──► Scene Specification             │
    │  "Cyclist crossing at a 4-way stop                   │
    │   in heavy rain at night"                            │
    │                                                      │
    │  Distribution Config ──► Parameter Ranges            │
    │  Distribution-based sampling for scale               │
    │                                                      │
    │  Log Replay ──► Modified Real Scenarios               │
    │  Import driving logs, swap actors/weather            │
    └──────────────────────────┬───────────────────────────┘
                               │
    ┌──────────────────────────▼───────────────────────────┐
    │                 SIMULATION ENGINE                     │
    │                                                      │
    │  High-Fidelity Rendering:                            │
    │    - Ray-traced camera images                        │
    │    - Physics-based LiDAR simulation                  │
    │    - Radar cross-section modeling                    │
    │                                                      │
    │  Asset Library:                                      │
    │    - 200+ vehicle models                             │
    │    - 500+ pedestrian variations                      │
    │    - Photogrammetry-scanned props                    │
    │    - Regional sign libraries (US, EU, Asia)          │
    └──────────────────────────┬───────────────────────────┘
                               │
    ┌──────────────────────────▼───────────────────────────┐
    │              AUTO-LABELING ENGINE                     │
    │                                                      │
    │  2D Bounding Boxes ── 3D Bounding Boxes              │
    │  Semantic Segmentation ── Instance Segmentation      │
    │  Depth Maps ── Optical Flow ── Surface Normals       │
    │  Occlusion Flags ── Truncation Flags                 │
    │  Object Attributes (color, type, state)              │
    └──────────────────────────┬───────────────────────────┘
                               │
    ┌──────────────────────────▼───────────────────────────┐
    │              EXPORT AND DELIVERY                      │
    │                                                      │
    │  Formats: nuScenes, KITTI, COCO, Custom              │
    │  Delivery: Cloud storage (S3/GCS) or streaming       │
    │  Metadata: Full scene graph, sensor calibration      │
    └──────────────────────────────────────────────────────┘

Natural Language Scenario Generation

A distinguishing feature of Applied Intuition's platform is the ability to describe scenarios in natural language. This dramatically lowers the barrier to entry for scenario design:

Example prompts and their generated scenarios:

Prompt: "A school zone during morning drop-off with children
         crossing and a bus stopped with its sign out"

Generated:
  - Road: 2-lane suburban street with school zone markings
  - Time: 7:45 AM, clear weather
  - Actors: School bus (stopped, sign extended), 4-8 children
    crossing at crosswalk, 3-5 waiting vehicles, crossing guard
  - Ego behavior: Approaching from 100m at 25 mph

Prompt: "Highway construction zone with lane merge, workers,
         and a flagman at night"

Generated:
  - Road: 3-lane highway merging to 1 lane
  - Time: 10 PM, clear
  - Actors: Construction barrels, lane-merge signs, 2 workers,
    1 flagman with stop/slow sign, construction vehicles
  - Ego behavior: Approaching in closing lane at 55 mph

nuScenes-Compatible Output Format

Applied Intuition supports exporting synthetic data in the nuScenes format, which is one of the most widely used formats in the AD research community:

synthetic_nuscenes_output/
├── v1.0-trainval/
│   ├── sample.json              # Keyframe references
│   ├── sample_data.json         # Sensor data paths
│   ├── sample_annotation.json   # 3D bounding boxes
│   ├── instance.json            # Object instances across time
│   ├── category.json            # Object categories
│   ├── ego_pose.json            # Vehicle pose per frame
│   ├── calibrated_sensor.json   # Sensor calibration
│   ├── sensor.json              # Sensor metadata
│   ├── scene.json               # Scene-level metadata
│   ├── log.json                 # Log-level metadata
│   └── map/                     # HD maps
├── samples/
│   ├── CAM_FRONT/               # Front camera images
│   ├── CAM_FRONT_LEFT/          # Front-left camera
│   ├── CAM_FRONT_RIGHT/         # Front-right camera
│   ├── CAM_BACK/                # Rear camera
│   ├── CAM_BACK_LEFT/           # Rear-left camera
│   ├── CAM_BACK_RIGHT/          # Rear-right camera
│   └── LIDAR_TOP/               # LiDAR point clouds (.pcd.bin)
└── sweeps/                      # Non-keyframe sensor data
    ├── CAM_FRONT/
    └── LIDAR_TOP/

This compatibility means teams can use synthetic data as a drop-in augmentation for existing nuScenes-based training pipelines without changing their data loading code.

Cloud Engine for Parallel Generation

Applied Intuition's Cloud Engine enables massively parallel data generation:

  • Horizontal scaling: Spin up hundreds of GPU instances to render scenes in parallel.
  • Deterministic reproduction: Every scene can be regenerated with the same seed.
  • Cost efficiency: Pay only for GPU-hours used; no idle fleet costs.
  • Typical throughput: 10,000-100,000 frames/hour depending on rendering quality.
    CLOUD ENGINE PARALLEL GENERATION
    ==================================

    Scene Configs:     [S1] [S2] [S3] [S4] [S5] ... [S_N]
                        │    │    │    │    │          │
                        ▼    ▼    ▼    ▼    ▼          ▼
    GPU Workers:      [W1] [W2] [W3] [W4] [W5] ... [W_M]
                        │    │    │    │    │          │
                        ▼    ▼    ▼    ▼    ▼          ▼
    Rendered Frames:  [F1] [F2] [F3] [F4] [F5] ... [F_N]
                        │    │    │    │    │          │
                        └────┴────┴────┴────┴──────────┘
                                      │
                                      ▼
                            ┌─────────────────┐
                            │  Label + Export  │
                            │  (nuScenes fmt)  │
                            └─────────────────┘

    Scaling Example:
    - 100 GPU workers
    - 10 frames/second per worker
    - 1,000 frames/second total
    - 3.6M frames/hour
    - Full nuScenes-scale dataset in ~6 minutes

Case Study: 90% Real Data Reduction

Applied Intuition has demonstrated that combining synthetic data with a small amount of real data can achieve performance comparable to training on 10x more real data:

    CASE STUDY: REAL DATA REDUCTION
    =================================

    Experiment: 3D Object Detection (cars, pedestrians, cyclists)
    Metric: mAP on real-world validation set

    Configuration                         mAP     Real Frames Used
    ─────────────────────────────────────────────────────────────
    100% Real (baseline)                  72.1%   100,000
    10% Real only                         54.3%    10,000
    10% Real + 90,000 Synthetic           70.8%    10,000
    10% Real + 90,000 Synth + FDA         71.5%    10,000
    10% Real + 90,000 Synth + Fine-tune   71.9%    10,000

    Key Insight: 10% real + synthetic achieves 99.7% of the
    full real data performance.

    Cost Comparison:
    - 100k real frames: ~$800,000 (collection + labeling)
    - 10k real + 90k synthetic: ~$85,000 (90% cost reduction)

Case Study: Minority Class Upsampling

The cyclist detection problem is a compelling example of synthetic data's value:

    CASE STUDY: CYCLIST DETECTION IMPROVEMENT
    ============================================

    Problem: Cyclists appear in <2% of real-world frames.
    In a 50,000 frame dataset, only ~800 frames contain cyclists.

    Step 1: Analyze real data distribution
    Cars:          42,000 frames (84%)
    Pedestrians:    8,500 frames (17%)
    Cyclists:         800 frames (1.6%)

    Step 2: Generate targeted synthetic data
    Generate 20,000 synthetic frames with cyclists in diverse:
    - Poses (riding, waiting, dismounted)
    - Lighting conditions (day, night, dawn, dusk)
    - Weather (clear, rain, overcast)
    - Occlusion levels (0%, 25%, 50%, 75%)
    - Road contexts (bike lane, intersection, sidewalk)

    Step 3: Results

    Model                          Cyclist AP    Overall mAP
    ────────────────────────────────────────────────────────
    Real data only                   34.2%         68.1%
    Real + 20k synth cyclists        51.7%         70.3%
    Real + synth + FDA               54.1%         71.0%
    Real + synth + fine-tune         56.3%         71.8%

    Cyclist AP improved by +22.1 percentage points (65% relative improvement)
    Overall mAP also improved by +3.7 points due to better class balance.

Best Practices for ML Training with Synthetic Data

Data Mixing Ratios

The ratio of synthetic to real data matters. Too much synthetic data can overwhelm the real signal; too little adds no benefit.

ScenarioRecommended Ratio (Synth:Real)Notes
Abundant real data (>100k frames)1:1 to 2:1Synthetic data supplements
Moderate real data (10-50k frames)3:1 to 5:1Synthetic data provides diversity
Limited real data (<10k frames)5:1 to 10:1Synthetic data is primary source
Rare class augmentation10:1 to 50:1 for target classAggressively oversample
New domain (no real data)100% synthetic + plan to collectBootstrap then fine-tune

Fine-Tuning Strategies

When fine-tuning a synthetically pre-trained model on real data:

Learning Rate: Use a learning rate 5-10x lower than the initial pre-training rate. The model has already learned good features; you want to adapt, not overwrite.

# Pre-training on synthetic data
optimizer = Adam(model.parameters(), lr=1e-3)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
train(model, synth_data, optimizer, scheduler, epochs=100)

# Fine-tuning on real data
optimizer = Adam(model.parameters(), lr=1e-4)  # 10x lower
scheduler = CosineAnnealingLR(optimizer, T_max=30)
train(model, real_data, optimizer, scheduler, epochs=30)

Layer Freezing: For the first few fine-tuning epochs, freeze the backbone and only train the detection head. Then unfreeze and train end-to-end at a very low learning rate.

# Phase 1: Train head only
for param in model.backbone.parameters():
    param.requires_grad = False
train(model, real_data, lr=1e-3, epochs=10)

# Phase 2: End-to-end fine-tuning
for param in model.backbone.parameters():
    param.requires_grad = True
train(model, real_data, lr=1e-5, epochs=20)

Regularization: Add L2 regularization or EWC (Elastic Weight Consolidation) to prevent catastrophic forgetting of features learned from synthetic data.

Curriculum Learning with Synthetic Data

A curriculum learning approach systematically controls what the model sees and when:

    CURRICULUM LEARNING SCHEDULE
    =============================

    Stage 1: EASY SYNTHETIC (Epochs 1-20)
    ─────────────────────────────────────
    - Clear weather, daytime
    - No occlusion
    - Large objects, close range
    - Purpose: Learn basic feature extraction

    Stage 2: HARD SYNTHETIC (Epochs 21-40)
    ─────────────────────────────────────
    - All weather conditions
    - Partial occlusion (25-75%)
    - Small objects, far range
    - Night, dawn, dusk
    - Purpose: Learn robust features

    Stage 3: MIXED (Epochs 41-60)
    ─────────────────────────────
    - 50% synthetic (hard) + 50% real
    - Domain adaptation applied to synthetic
    - Purpose: Bridge domain gap

    Stage 4: REAL FOCUS (Epochs 61-80)
    ──────────────────────────────────
    - 20% synthetic + 80% real
    - Low learning rate
    - Purpose: Calibrate to real domain

    Stage 5: REAL ONLY (Epochs 81-100)
    ──────────────────────────────────
    - 100% real data
    - Very low learning rate
    - Purpose: Final calibration

Evaluation Methodology

When evaluating models trained with synthetic data, follow these principles:

1. Always evaluate on real data. Never report metrics on synthetic validation sets as a measure of real-world performance.

2. Use stratified evaluation. Break down performance by:

  • Object class (especially rare classes)
  • Distance range (0-30m, 30-50m, 50m+)
  • Occlusion level
  • Weather and lighting conditions

3. Compare against proper baselines:

    EVALUATION FRAMEWORK
    =====================

    Baseline 1: Real-only model        (upper bound for small-data regime)
    Baseline 2: Synthetic-only model   (lower bound, shows domain gap)
    Baseline 3: Mixed model            (should beat both)
    Baseline 4: Mixed + adaptation     (best expected performance)

    Report: mAP, mAP@50, mAP@75, per-class AP, recall at fixed FP rate

4. Track the "real data efficiency curve":

Plot performance vs. amount of real data used, with and without synthetic augmentation. This shows exactly how much real data synthetic data "replaces."


Code Examples

Example 1: Domain Randomization Pipeline

"""
Domain Randomization Pipeline for Synthetic Data Generation
============================================================
This module applies structured domain randomization to synthetic
scenes before rendering, ensuring diverse training data.
"""

import random
import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple, Optional


@dataclass
class EnvironmentParams:
    """Parameters for environmental domain randomization."""
    sun_elevation: float = 45.0          # degrees above horizon
    sun_azimuth: float = 180.0           # degrees from north
    cloud_cover: float = 0.0             # 0 = clear, 1 = overcast
    rain_intensity: float = 0.0          # 0 = none, 1 = heavy
    fog_density: float = 0.0             # 0 = none, 1 = dense
    road_wetness: float = 0.0            # 0 = dry, 1 = wet
    time_of_day: float = 12.0            # hours (0-24)
    ambient_temperature: float = 20.0    # Celsius (affects mirage, etc.)


@dataclass
class CameraParams:
    """Parameters for camera domain randomization."""
    exposure_bias: float = 0.0           # EV stops from nominal
    white_balance_temp: float = 6500.0   # Kelvin
    noise_sigma: float = 0.01            # Gaussian noise std
    motion_blur_amount: float = 0.0      # 0 = none, 1 = heavy
    lens_flare_intensity: float = 0.0    # 0 = none, 1 = strong
    chromatic_aberration: float = 0.0    # 0 = none, 1 = strong


@dataclass
class ActorParams:
    """Parameters for actor domain randomization."""
    vehicle_color_hsv: Tuple[float, float, float] = (0, 0, 0.5)
    pedestrian_clothing_palette: str = "summer"
    dirt_level: float = 0.0              # 0 = clean, 1 = dirty
    damage_level: float = 0.0            # 0 = pristine, 1 = damaged


class StructuredDomainRandomizer:
    """
    Applies structured domain randomization with physically
    plausible parameter ranges.
    """

    def __init__(self, seed: Optional[int] = None):
        self.rng = random.Random(seed)
        self.np_rng = np.random.RandomState(seed)

    def randomize_environment(self) -> EnvironmentParams:
        """Generate randomized but plausible environment parameters."""
        # Time of day affects many other parameters
        time = self.rng.uniform(0, 24)

        # Sun position depends on time
        if 6 < time < 18:  # Daytime
            sun_elevation = self._sun_elevation_from_time(time)
        else:
            sun_elevation = 0.0  # Below horizon

        sun_azimuth = self.rng.uniform(0, 360)

        # Weather parameters (correlated)
        cloud_cover = self.rng.betavariate(2, 5)  # Skewed toward clear
        rain_intensity = 0.0
        if cloud_cover > 0.6:
            # Rain only when cloudy
            rain_intensity = self.rng.betavariate(2, 3) * (cloud_cover - 0.6) / 0.4

        fog_density = self.rng.betavariate(1, 10)  # Mostly clear
        road_wetness = max(rain_intensity, fog_density * 0.3)

        return EnvironmentParams(
            sun_elevation=sun_elevation,
            sun_azimuth=sun_azimuth,
            cloud_cover=cloud_cover,
            rain_intensity=rain_intensity,
            fog_density=fog_density,
            road_wetness=road_wetness,
            time_of_day=time,
        )

    def randomize_camera(self, env: EnvironmentParams) -> CameraParams:
        """Generate camera parameters conditioned on environment."""
        # Exposure bias: larger variance in challenging lighting
        if env.time_of_day < 7 or env.time_of_day > 19:
            exposure_bias = self.rng.gauss(0, 0.5)  # Night: more variation
        else:
            exposure_bias = self.rng.gauss(0, 0.2)  # Day: less variation

        # White balance varies with lighting
        wb_temp = self.rng.gauss(6500, 500)

        # Noise increases in low light
        base_noise = 0.005
        if env.sun_elevation < 10:
            base_noise = 0.02  # More noise in dim conditions
        noise_sigma = self.rng.uniform(base_noise * 0.5, base_noise * 2.0)

        # Motion blur from ego vehicle speed (simplified)
        motion_blur = self.rng.betavariate(2, 8)

        return CameraParams(
            exposure_bias=exposure_bias,
            white_balance_temp=wb_temp,
            noise_sigma=noise_sigma,
            motion_blur_amount=motion_blur,
        )

    def randomize_actors(self, env: EnvironmentParams) -> List[ActorParams]:
        """Generate randomized actor appearance parameters."""
        num_actors = self.rng.randint(3, 15)
        actors = []

        for _ in range(num_actors):
            # Vehicle color: sample from real-world distribution
            color_hsv = self._sample_vehicle_color()

            # Clothing palette depends on environment
            if env.ambient_temperature < 10:
                palette = self.rng.choice(["winter", "fall"])
            elif env.ambient_temperature > 25:
                palette = self.rng.choice(["summer", "spring"])
            else:
                palette = self.rng.choice(["spring", "fall", "summer"])

            dirt = self.rng.betavariate(1, 5)  # Mostly clean
            damage = self.rng.betavariate(1, 20)  # Rarely damaged

            actors.append(ActorParams(
                vehicle_color_hsv=color_hsv,
                pedestrian_clothing_palette=palette,
                dirt_level=dirt,
                damage_level=damage,
            ))

        return actors

    def _sun_elevation_from_time(self, time: float) -> float:
        """Approximate sun elevation from time of day."""
        # Simplified: peak at noon
        noon_offset = abs(time - 12.0)
        max_elevation = 70.0  # degrees
        return max(0, max_elevation * (1 - noon_offset / 6.0))

    def _sample_vehicle_color(self) -> Tuple[float, float, float]:
        """Sample vehicle color from real-world distribution."""
        # Real-world car color distribution (approximate)
        color_probs = {
            "white":  0.25,  "black":  0.22, "gray":   0.18,
            "silver": 0.12,  "red":    0.09, "blue":   0.08,
            "brown":  0.03,  "green":  0.02, "yellow": 0.01,
        }
        color_hsv = {
            "white":  (0, 0.0, 0.95),  "black":  (0, 0.0, 0.05),
            "gray":   (0, 0.0, 0.50),  "silver": (0, 0.05, 0.75),
            "red":    (0, 0.9, 0.70),  "blue":   (0.6, 0.8, 0.60),
            "brown":  (0.08, 0.6, 0.4),"green":  (0.33, 0.7, 0.40),
            "yellow": (0.15, 0.9, 0.9),
        }
        colors = list(color_probs.keys())
        probs = list(color_probs.values())
        chosen = self.np_rng.choice(colors, p=probs)

        # Add small random perturbation
        h, s, v = color_hsv[chosen]
        h += self.rng.gauss(0, 0.02)
        s += self.rng.gauss(0, 0.05)
        v += self.rng.gauss(0, 0.05)
        return (h % 1.0, max(0, min(1, s)), max(0, min(1, v)))


# Usage example
randomizer = StructuredDomainRandomizer(seed=42)
for scene_idx in range(1000):
    env = randomizer.randomize_environment()
    cam = randomizer.randomize_camera(env)
    actors = randomizer.randomize_actors(env)
    # render_scene(env, cam, actors) -> images + labels

Example 2: Fourier Domain Adaptation Implementation

"""
Fourier Domain Adaptation (FDA) Implementation
================================================
Based on Yang and Soatto, "FDA: Fourier Domain Adaptation for
Semantic Segmentation" (CVPR 2020).

Transfer low-frequency style information from real images to
synthetic images while preserving structural content and labels.
"""

import numpy as np
from typing import Optional
import torch
import torch.nn.functional as F


def fda_numpy(
    source: np.ndarray,
    target: np.ndarray,
    beta: float = 0.01
) -> np.ndarray:
    """
    Apply FDA style transfer from target to source image (NumPy version).

    Args:
        source: Synthetic image, shape (H, W, 3), float32, range [0, 1]
        target: Real image, shape (H, W, 3), float32, range [0, 1]
        beta: Low-frequency band size as fraction of image dimensions.
              Typical range: 0.005 (subtle) to 0.1 (aggressive)

    Returns:
        Adapted source image with target's low-frequency style.
    """
    assert source.shape == target.shape, "Images must have same dimensions"
    assert 0 < beta < 0.5, "Beta must be in (0, 0.5)"

    h, w, c = source.shape
    result = np.zeros_like(source)

    # Size of the low-frequency window
    b_h = int(np.floor(beta * h))
    b_w = int(np.floor(beta * w))

    # Center coordinates
    cy, cx = h // 2, w // 2

    for ch in range(c):
        # Forward FFT
        fft_src = np.fft.fft2(source[:, :, ch])
        fft_tgt = np.fft.fft2(target[:, :, ch])

        # Shift DC component to center
        fft_src = np.fft.fftshift(fft_src)
        fft_tgt = np.fft.fftshift(fft_tgt)

        # Extract amplitude and phase
        amp_src = np.abs(fft_src)
        pha_src = np.angle(fft_src)
        amp_tgt = np.abs(fft_tgt)

        # Replace low-frequency amplitudes
        amp_adapted = amp_src.copy()
        amp_adapted[cy - b_h:cy + b_h, cx - b_w:cx + b_w] = \
            amp_tgt[cy - b_h:cy + b_h, cx - b_w:cx + b_w]

        # Reconstruct with adapted amplitude and original phase
        fft_adapted = amp_adapted * np.exp(1j * pha_src)

        # Inverse shift and inverse FFT
        fft_adapted = np.fft.ifftshift(fft_adapted)
        result[:, :, ch] = np.real(np.fft.ifft2(fft_adapted))

    return np.clip(result, 0.0, 1.0).astype(np.float32)


def fda_torch(
    source: torch.Tensor,
    target: torch.Tensor,
    beta: float = 0.01
) -> torch.Tensor:
    """
    Apply FDA style transfer (PyTorch version, batch-compatible).

    Args:
        source: (B, 3, H, W) synthetic images
        target: (B, 3, H, W) real images
        beta: Low-frequency band size fraction

    Returns:
        (B, 3, H, W) adapted synthetic images
    """
    B, C, H, W = source.shape

    # Forward FFT (2D, complex output)
    fft_src = torch.fft.fft2(source, dim=(-2, -1))
    fft_tgt = torch.fft.fft2(target, dim=(-2, -1))

    # Shift zero frequency to center
    fft_src = torch.fft.fftshift(fft_src, dim=(-2, -1))
    fft_tgt = torch.fft.fftshift(fft_tgt, dim=(-2, -1))

    # Decompose into amplitude and phase
    amp_src = torch.abs(fft_src)
    pha_src = torch.angle(fft_src)
    amp_tgt = torch.abs(fft_tgt)

    # Create low-frequency mask
    b_h = int(np.floor(beta * H))
    b_w = int(np.floor(beta * W))
    cy, cx = H // 2, W // 2

    # Replace low-frequency amplitudes
    amp_adapted = amp_src.clone()
    amp_adapted[:, :, cy - b_h:cy + b_h, cx - b_w:cx + b_w] = \
        amp_tgt[:, :, cy - b_h:cy + b_h, cx - b_w:cx + b_w]

    # Recombine
    fft_adapted = amp_adapted * torch.exp(1j * pha_src)

    # Inverse shift and FFT
    fft_adapted = torch.fft.ifftshift(fft_adapted, dim=(-2, -1))
    result = torch.fft.ifft2(fft_adapted, dim=(-2, -1)).real

    return torch.clamp(result, 0.0, 1.0)


class FDATransform:
    """
    FDA as a data augmentation transform for training.
    Randomly selects a target image from a pool of real images
    and applies FDA with random beta.
    """

    def __init__(
        self,
        real_image_pool: list,
        beta_range: tuple = (0.005, 0.05),
        apply_prob: float = 0.5
    ):
        """
        Args:
            real_image_pool: List of real images (numpy arrays, [0,1])
            beta_range: (min_beta, max_beta) for random sampling
            apply_prob: Probability of applying FDA to each sample
        """
        self.pool = real_image_pool
        self.beta_range = beta_range
        self.apply_prob = apply_prob

    def __call__(self, synthetic_image: np.ndarray) -> np.ndarray:
        if np.random.random() > self.apply_prob:
            return synthetic_image

        # Random target image from pool
        target = self.pool[np.random.randint(len(self.pool))]

        # Resize target to match source if needed
        if target.shape[:2] != synthetic_image.shape[:2]:
            from PIL import Image
            target = np.array(Image.fromarray(
                (target * 255).astype(np.uint8)
            ).resize(
                (synthetic_image.shape[1], synthetic_image.shape[0])
            )) / 255.0

        # Random beta
        beta = np.random.uniform(*self.beta_range)

        return fda_numpy(synthetic_image, target, beta=beta)

Example 3: Mixed Training Loop

"""
Mixed Training Loop: Synthetic + Real Data
============================================
Demonstrates a training pipeline that combines synthetic and real
data with domain-aware weighting and curriculum scheduling.
"""

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from typing import Dict, Optional
import numpy as np


class MixedDataTrainer:
    """
    Trainer that manages synthetic + real data mixing with:
    - Domain-aware loss weighting
    - Curriculum scheduling (synthetic -> mixed -> real)
    - Domain-specific batch normalization (optional)
    """

    def __init__(
        self,
        model: nn.Module,
        real_dataset: Dataset,
        synth_dataset: Dataset,
        config: Dict,
    ):
        self.model = model
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

        # Data loaders
        self.real_loader = DataLoader(
            real_dataset,
            batch_size=config.get("real_batch_size", 8),
            shuffle=True,
            num_workers=4,
            pin_memory=True,
        )
        self.synth_loader = DataLoader(
            synth_dataset,
            batch_size=config.get("synth_batch_size", 24),
            shuffle=True,
            num_workers=4,
            pin_memory=True,
        )

        # Optimizer
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=config.get("lr", 1e-3),
            weight_decay=config.get("weight_decay", 1e-4),
        )

        # Loss function
        self.criterion = nn.CrossEntropyLoss(reduction="none")

        # Curriculum schedule
        self.total_epochs = config.get("total_epochs", 100)
        self.curriculum = config.get("curriculum", "linear")

        # Domain weights
        self.real_weight_base = config.get("real_weight", 2.0)
        self.synth_weight_base = config.get("synth_weight", 1.0)

    def get_synth_ratio(self, epoch: int) -> float:
        """
        Curriculum schedule: fraction of training that is synthetic.
        Starts high (mostly synthetic) and decreases.
        """
        progress = epoch / self.total_epochs

        if self.curriculum == "linear":
            # Linear decay from 0.9 to 0.2
            return 0.9 - 0.7 * progress

        elif self.curriculum == "cosine":
            # Cosine decay from 0.9 to 0.2
            return 0.2 + 0.7 * (1 + np.cos(np.pi * progress)) / 2

        elif self.curriculum == "step":
            # Step function
            if progress < 0.25:
                return 0.9
            elif progress < 0.50:
                return 0.7
            elif progress < 0.75:
                return 0.4
            else:
                return 0.1

        return 0.5  # Default: fixed 50/50

    def train_epoch(self, epoch: int) -> Dict[str, float]:
        """Train one epoch with mixed data."""
        self.model.train()
        synth_ratio = self.get_synth_ratio(epoch)

        epoch_loss = 0.0
        epoch_real_loss = 0.0
        epoch_synth_loss = 0.0
        num_batches = 0

        real_iter = iter(self.real_loader)
        synth_iter = iter(self.synth_loader)

        # Number of steps per epoch
        steps = max(len(self.real_loader), len(self.synth_loader))

        for step in range(steps):
            self.optimizer.zero_grad()

            total_loss = torch.tensor(0.0, device=self.device)

            # --- Real data forward pass ---
            try:
                real_images, real_labels = next(real_iter)
            except StopIteration:
                real_iter = iter(self.real_loader)
                real_images, real_labels = next(real_iter)

            real_images = real_images.to(self.device)
            real_labels = real_labels.to(self.device)

            real_preds = self.model(real_images)
            real_loss_per_sample = self.criterion(real_preds, real_labels)
            real_loss = real_loss_per_sample.mean() * self.real_weight_base

            # --- Synthetic data forward pass ---
            try:
                synth_images, synth_labels = next(synth_iter)
            except StopIteration:
                synth_iter = iter(self.synth_loader)
                synth_images, synth_labels = next(synth_iter)

            synth_images = synth_images.to(self.device)
            synth_labels = synth_labels.to(self.device)

            synth_preds = self.model(synth_images)
            synth_loss_per_sample = self.criterion(synth_preds, synth_labels)
            synth_loss = synth_loss_per_sample.mean() * self.synth_weight_base

            # --- Weighted combination ---
            total_loss = (1 - synth_ratio) * real_loss + synth_ratio * synth_loss

            # --- Backward pass ---
            total_loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)

            self.optimizer.step()

            epoch_loss += total_loss.item()
            epoch_real_loss += real_loss.item()
            epoch_synth_loss += synth_loss.item()
            num_batches += 1

        return {
            "total_loss": epoch_loss / num_batches,
            "real_loss": epoch_real_loss / num_batches,
            "synth_loss": epoch_synth_loss / num_batches,
            "synth_ratio": synth_ratio,
        }

    def train(self) -> list:
        """Full training loop with curriculum."""
        history = []

        for epoch in range(self.total_epochs):
            # Adjust learning rate
            self._adjust_lr(epoch)

            # Train one epoch
            metrics = self.train_epoch(epoch)
            history.append(metrics)

            print(
                f"Epoch {epoch+1}/{self.total_epochs} | "
                f"Loss: {metrics['total_loss']:.4f} | "
                f"Real: {metrics['real_loss']:.4f} | "
                f"Synth: {metrics['synth_loss']:.4f} | "
                f"Synth%: {metrics['synth_ratio']:.1%}"
            )

        return history

    def _adjust_lr(self, epoch: int):
        """Cosine annealing with warm restarts."""
        base_lr = 1e-3
        min_lr = 1e-6
        progress = epoch / self.total_epochs
        lr = min_lr + 0.5 * (base_lr - min_lr) * (1 + np.cos(np.pi * progress))
        for pg in self.optimizer.param_groups:
            pg["lr"] = lr

Example 4: Evaluation Comparing Synthetic-Trained vs Real-Trained Models

"""
Evaluation Framework: Synthetic vs. Real Training
===================================================
Compare models trained on different data configurations
across multiple metrics and stratifications.
"""

import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
from collections import defaultdict


@dataclass
class Detection:
    """A single detection or ground truth box."""
    class_name: str
    bbox: Tuple[float, float, float, float]  # x1, y1, x2, y2
    score: float = 1.0                        # confidence (1.0 for GT)
    distance: float = 0.0                     # distance from ego
    occlusion: float = 0.0                    # 0=visible, 1=fully occluded
    is_ground_truth: bool = False


@dataclass
class EvalResult:
    """Evaluation results for one model configuration."""
    name: str
    overall_map: float = 0.0
    per_class_ap: Dict[str, float] = field(default_factory=dict)
    per_distance_ap: Dict[str, float] = field(default_factory=dict)
    per_occlusion_ap: Dict[str, float] = field(default_factory=dict)


def compute_ap(precision: np.ndarray, recall: np.ndarray) -> float:
    """Compute Average Precision using 11-point interpolation."""
    ap = 0.0
    for t in np.arange(0, 1.1, 0.1):
        if np.sum(recall >= t) == 0:
            p = 0
        else:
            p = np.max(precision[recall >= t])
        ap += p / 11.0
    return ap


def evaluate_detections(
    predictions: List[List[Detection]],
    ground_truths: List[List[Detection]],
    iou_threshold: float = 0.5,
    classes: List[str] = None,
) -> Dict[str, float]:
    """
    Compute per-class AP for a set of predictions vs ground truths.

    Args:
        predictions: List of frames, each containing list of detections
        ground_truths: List of frames, each containing list of GT boxes
        iou_threshold: IoU threshold for matching
        classes: List of class names to evaluate

    Returns:
        Dictionary of class_name -> AP
    """
    if classes is None:
        classes = list(set(
            d.class_name for frame in ground_truths for d in frame
        ))

    results = {}

    for cls in classes:
        all_scores = []
        all_matches = []
        total_gt = 0

        for preds, gts in zip(predictions, ground_truths):
            # Filter to current class
            cls_preds = [p for p in preds if p.class_name == cls]
            cls_gts = [g for g in gts if g.class_name == cls]
            total_gt += len(cls_gts)

            # Sort predictions by confidence (descending)
            cls_preds.sort(key=lambda x: x.score, reverse=True)

            matched_gt = set()
            for pred in cls_preds:
                all_scores.append(pred.score)
                best_iou = 0.0
                best_gt_idx = -1

                for gt_idx, gt in enumerate(cls_gts):
                    if gt_idx in matched_gt:
                        continue
                    iou = compute_iou(pred.bbox, gt.bbox)
                    if iou > best_iou:
                        best_iou = iou
                        best_gt_idx = gt_idx

                if best_iou >= iou_threshold and best_gt_idx >= 0:
                    all_matches.append(1)  # True positive
                    matched_gt.add(best_gt_idx)
                else:
                    all_matches.append(0)  # False positive

        if total_gt == 0:
            results[cls] = 0.0
            continue

        # Sort by score
        sorted_indices = np.argsort(-np.array(all_scores))
        matches = np.array(all_matches)[sorted_indices]

        # Compute precision-recall curve
        tp_cumsum = np.cumsum(matches)
        fp_cumsum = np.cumsum(1 - matches)
        precision = tp_cumsum / (tp_cumsum + fp_cumsum)
        recall = tp_cumsum / total_gt

        results[cls] = compute_ap(precision, recall)

    return results


def compute_iou(box1, box2) -> float:
    """Compute IoU between two bounding boxes."""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-6)


def compare_training_strategies(
    eval_results: List[EvalResult],
    classes: List[str],
) -> str:
    """
    Generate a comparison table of training strategies.
    Returns formatted string table.
    """
    # Header
    header = f"{'Strategy':<35} | {'mAP':>6}"
    for cls in classes:
        header += f" | {cls:>10}"
    header += "\n" + "-" * len(header)

    rows = [header]
    for result in eval_results:
        row = f"{result.name:<35} | {result.overall_map:>5.1f}%"
        for cls in classes:
            ap = result.per_class_ap.get(cls, 0.0)
            row += f" | {ap:>9.1f}%"
        rows.append(row)

    return "\n".join(rows)


# ============================================================
# Example usage: Compare four training configurations
# ============================================================

def run_comparison():
    """
    Example comparing synthetic-trained vs real-trained models.
    (Using simulated metrics for illustration)
    """
    classes = ["Car", "Pedestrian", "Cyclist", "Truck"]

    results = [
        EvalResult(
            name="Real-only (100k frames)",
            overall_map=72.1,
            per_class_ap={"Car": 85.2, "Pedestrian": 71.3, "Cyclist": 34.2, "Truck": 77.8},
        ),
        EvalResult(
            name="Synthetic-only (100k frames)",
            overall_map=51.3,
            per_class_ap={"Car": 68.4, "Pedestrian": 48.2, "Cyclist": 22.1, "Truck": 56.5},
        ),
        EvalResult(
            name="10k Real + 90k Synthetic",
            overall_map=70.8,
            per_class_ap={"Car": 83.9, "Pedestrian": 69.1, "Cyclist": 51.7, "Truck": 76.2},
        ),
        EvalResult(
            name="10k Real + 90k Synth + FDA",
            overall_map=71.5,
            per_class_ap={"Car": 84.5, "Pedestrian": 70.2, "Cyclist": 54.1, "Truck": 76.9},
        ),
        EvalResult(
            name="10k Real + 90k Synth + Fine-tune",
            overall_map=71.9,
            per_class_ap={"Car": 84.8, "Pedestrian": 70.7, "Cyclist": 56.3, "Truck": 77.4},
        ),
    ]

    print("=" * 80)
    print("TRAINING STRATEGY COMPARISON")
    print("=" * 80)
    print()
    print(compare_training_strategies(results, classes))
    print()
    print("Key Observations:")
    print("  1. Synthetic-only suffers ~20 mAP drop (domain gap)")
    print("  2. Mixed training recovers to within 1.3 mAP of real-only")
    print("  3. Cyclist AP improves dramatically (+22 pts) with synthetic upsampling")
    print("  4. FDA and fine-tuning provide incremental improvements")
    print("  5. 90% cost reduction with <1% performance loss")


if __name__ == "__main__":
    run_comparison()

Mental Models and Diagrams

Diagram 1: End-to-End Synthetic Data Pipeline

    COMPLETE SYNTHETIC DATA PIPELINE FOR AD PERCEPTION
    ====================================================

    ┌─────────────────────────────────────────────────────────────────────────┐
    │                        INPUT SPECIFICATION                              │
    │                                                                        │
    │   NL Prompt ──┐                                                        │
    │               ├──► Scenario Parser ──► Scene Config (JSON/Proto)       │
    │   Distribution┘         │                                              │
    │   Config ───────────────┘                                              │
    └──────────────────────────────┬──────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼──────────────────────────────────────────┐
    │                      3D SCENE CONSTRUCTION                              │
    │                                                                        │
    │   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────────┐  │
    │   │   Road    │   │  Asset   │   │  Actor   │   │    Domain        │  │
    │   │  Network  │   │ Library  │   │ Behavior │   │  Randomization   │  │
    │   │ Generator │   │ (3D DB)  │   │  Engine  │   │  (lighting/wx/   │  │
    │   │          │   │          │   │          │   │   textures/cam)   │  │
    │   └────┬─────┘   └────┬─────┘   └────┬─────┘   └────────┬─────────┘  │
    │        └──────────────┼──────────────┼──────────────────┘             │
    │                       ▼              ▼                                 │
    │                    3D Scene Graph (per frame)                          │
    └──────────────────────────────┬──────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼──────────────────────────────────────────┐
    │                     SENSOR SIMULATION                                   │
    │                                                                        │
    │   ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────────┐   │
    │   │  Camera     │  │   LiDAR    │  │   Radar    │  │  Ultrasonic  │   │
    │   │  (ray trace │  │  (beam     │  │  (RCS +    │  │  (range      │   │
    │   │   or raster)│  │   physics) │  │   Doppler) │  │   only)      │   │
    │   └─────┬──────┘  └─────┬──────┘  └─────┬──────┘  └──────┬───────┘   │
    │         │               │               │                │            │
    │         ▼               ▼               ▼                ▼            │
    │      RGB Images     Point Clouds    Radar Targets    Range Data       │
    └──────────────────────────────┬──────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼──────────────────────────────────────────┐
    │                     AUTO-LABELING                                       │
    │                                                                        │
    │   From scene graph:  2D Boxes | 3D Boxes | Segmentation | Depth       │
    │                      Optical Flow | Instance IDs | Occlusion          │
    └──────────────────────────────┬──────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼──────────────────────────────────────────┐
    │                 DOMAIN ADAPTATION (Optional)                            │
    │                                                                        │
    │   FDA ──── CyCADA ──── Style Transfer ──── Neural Rendering            │
    └──────────────────────────────┬──────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼──────────────────────────────────────────┐
    │                     EXPORT                                              │
    │                                                                        │
    │   Format: nuScenes / KITTI / COCO / Custom                             │
    │   Delivery: Cloud storage, streaming, or local                         │
    └─────────────────────────────────────────────────────────────────────────┘

Diagram 2: Domain Gap Illustration

    DOMAIN GAP: WHAT CHANGES BETWEEN SYNTHETIC AND REAL
    =====================================================

    Feature Space Visualization (t-SNE / PCA analogy):

              Synthetic Data                  Real Data
              Distribution                    Distribution

                  xxxxxxx                       ooooooo
                xx       xx                   oo       oo
              xx           xx               oo           oo
             x    Synth     x              o     Real     o
             x    Domain    x              o    Domain    o
              xx           xx               oo           oo
                xx       xx    <-- GAP -->    oo       oo
                  xxxxxxx                       ooooooo

    AFTER Domain Adaptation (FDA / CyCADA):

                         xxoooxx
                      xxoo    ooxx
                    xoo   Adapted  oox
                   xo    +Overlap   ox
                    xoo            oox
                      xxoo    ooxx
                         xxoooxx

    The goal: Make synthetic features overlap with real features
    so a model trained on synthetic generalizes to real.

    ─────────────────────────────────────────────────────────

    WHAT CAUSES THE GAP (ranked by impact):

    HIGH IMPACT:
    ├── Texture realism (procedural vs. photographic)
    ├── Lighting model (approximated vs. real radiometry)
    ├── Material properties (simplified BRDF vs. real)
    └── Sensor noise model (Gaussian approx. vs. real noise)

    MEDIUM IMPACT:
    ├── Asset diversity (limited 3D library vs. infinite real variety)
    ├── Weather effects (parameterized vs. real complexity)
    ├── Background clutter (curated vs. chaotic real world)
    └── Motion artifacts (simplified vs. real sensor dynamics)

    LOW IMPACT:
    ├── Resolution and field of view (easily matched)
    ├── Object placement logic (decent with procedural rules)
    └── Label format (easily standardized)

Diagram 3: Training Strategy Comparison

    TRAINING STRATEGY COMPARISON
    ==============================

    Strategy A: Real-Only (Baseline)
    ─────────────────────────────────
    Data:   [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR]  100% Real
    Cost:   $$$$$$$$$$
    mAP:    ████████████████████████ 72.1%
    Cyclist:████████████ 34.2%

    Strategy B: Synthetic-Only
    ─────────────────────────────────
    Data:   [SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS]  100% Synthetic
    Cost:   $
    mAP:    █████████████████ 51.3%           <-- Domain gap!
    Cyclist:█████████ 22.1%

    Strategy C: Pre-train Synth + Fine-tune Real (10%)
    ─────────────────────────────────
    Data:   [SSSSSSSSSSSSSSSSSSSSSS][RRRR]   90% Synth, 10% Real
    Cost:   $$
    mAP:    ████████████████████████ 71.9%   <-- Nearly matches A!
    Cyclist:██████████████████ 56.3%          <-- Far exceeds A!

    Strategy D: Mixed + FDA + Curriculum
    ─────────────────────────────────
    Data:   [SSSS][SSR][SSRR][SRRR][RRRR]   Curriculum schedule
    Cost:   $$
    mAP:    ████████████████████████ 72.4%   <-- Exceeds A!
    Cyclist:███████████████████ 58.1%         <-- Best result

    ─────────────────────────────────
    LEGEND: R = Real frame, S = Synthetic frame
            $ = relative cost unit

    KEY INSIGHT: Strategy C achieves 99.7% of Strategy A's mAP
    at 10% of the data cost, while dramatically improving rare
    class performance.

Hands-On Exercises

Exercise 1: Implement FDA from Scratch

Goal: Implement Fourier Domain Adaptation and visualize the effect on synthetic images.

Setup:

# Use any two images: one "synthetic" (e.g., from a game engine or GTA-V dataset)
# and one "real" (e.g., from KITTI or nuScenes)
# If you do not have these, use any two images with different visual styles.

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

Tasks:

  1. Load a synthetic image and a real image (resize to same dimensions).
  2. Implement the fda_transfer function from scratch (no copy-paste).
  3. Apply FDA with beta values of 0.005, 0.01, 0.05, and 0.1.
  4. Visualize all results side by side.
  5. Compute the pixel-wise MSE between the original synthetic image and each adapted version.
  6. Bonus: Implement the FFT and visualize the amplitude spectra of synthetic, real, and adapted images.

Expected Observations:

  • Low beta (0.005): Subtle color shift, barely perceptible.
  • Medium beta (0.01-0.05): Noticeable style transfer, colors and brightness match real image.
  • High beta (0.1): Aggressive transfer, may introduce artifacts near object boundaries.
  • The phase (structure) should remain identical across all beta values.

Exercise 2: Domain Randomization Ablation Study

Goal: Measure the impact of different domain randomization parameters on model performance.

Tasks:

  1. Generate four synthetic datasets (1,000 images each) with different DR settings:
    • No DR: Fixed lighting, weather, colors.
    • Lighting DR only: Randomize sun position and cloud cover.
    • Full Structured DR: Randomize all parameters within plausible ranges.
    • Unstructured DR: Fully random textures, colors, and lighting.
  2. Train a simple object detector (e.g., YOLO or SSD) on each dataset.
  3. Evaluate all models on the same real-world test set.
  4. Report per-class AP and overall mAP.

Expected Results:

    Dataset              mAP on Real Test Set
    ─────────────────────────────────────────
    No DR                     35-45%
    Lighting DR only          40-50%
    Full Structured DR        50-60%
    Unstructured DR           45-55%

Analysis Questions:

  • Which DR parameter has the most impact?
  • Does unstructured DR outperform structured DR? Under what conditions?
  • How does the gap change if you add fine-tuning on 100 real images?

Exercise 3: Mixed Training Ratio Sweep

Goal: Find the optimal synthetic-to-real data mixing ratio.

Tasks:

  1. Fix the total training budget at 10,000 frames.
  2. Train models with these ratios:
    • 100% real (10,000 real)
    • 75% real + 25% synthetic (7,500 real + 2,500 synth)
    • 50% real + 50% synthetic (5,000 real + 5,000 synth)
    • 25% real + 75% synthetic (2,500 real + 7,500 synth)
    • 10% real + 90% synthetic (1,000 real + 9,000 synth)
    • 100% synthetic (10,000 synth)
  3. Plot the mAP vs. ratio curve.
  4. Repeat with 100,000 total frames and compare.

Expected Shape of the Curve:

    mAP
     |
  75%|          ___________
     |         /           \
  70%|        /             \
     |       /               \
  65%|      /                 \
     |     /                   \
  60%|    /                     \
     |   /
  55%|  /
     | /
  50%|/
     +---+---+---+---+---+---+---
       0%  25%  50%  75% 100%
       Fraction of Real Data

    Sweet spot is typically 10-25% real data when
    synthetic data is high quality.

Exercise 4: Curriculum Learning Implementation

Goal: Implement and compare three curriculum schedules for mixed training.

Tasks:

  1. Implement three curriculum schedules:
    • Linear: Synthetic ratio decreases linearly from 90% to 10%.
    • Cosine: Synthetic ratio follows cosine annealing.
    • Step: Discrete jumps (90%, 70%, 40%, 10%) at epoch boundaries.
  2. Train a detection model with each schedule.
  3. Plot training loss curves and validation mAP curves for all three.
  4. Which schedule converges fastest? Which achieves the highest final mAP?

Starter Code:

def curriculum_schedule(epoch, total_epochs, schedule_type="linear"):
    progress = epoch / total_epochs
    if schedule_type == "linear":
        return max(0.1, 0.9 - 0.8 * progress)
    elif schedule_type == "cosine":
        return 0.1 + 0.8 * (1 + np.cos(np.pi * progress)) / 2
    elif schedule_type == "step":
        if progress < 0.25: return 0.9
        elif progress < 0.50: return 0.7
        elif progress < 0.75: return 0.4
        else: return 0.1

Exercise 5: Sensor Noise Modeling

Goal: Implement realistic camera and LiDAR noise models and measure their impact on detection performance.

Tasks:

  1. Implement a camera noise pipeline:
    • Poisson noise (shot noise, signal-dependent)
    • Gaussian noise (read noise, signal-independent)
    • Quantization noise (8-bit discretization)
    • Hot/dead pixels (random stuck pixels)
  2. Implement a LiDAR noise model:
    • Range noise (Gaussian, sigma proportional to distance)
    • Missing returns (dropout probability increases with distance and incidence angle)
    • Intensity noise
  3. Apply noise models to clean synthetic data.
  4. Train a model on: (a) clean synthetic, (b) noisy synthetic, (c) real data.
  5. Evaluate all on real data. Does adding noise to synthetic data help?

Camera Noise Model Starter:

def apply_camera_noise(image, iso=800, exposure_time=0.01):
    """Apply physically-motivated camera noise."""
    # Shot noise (Poisson)
    photon_count = image * exposure_time * 1000  # Approximate photon count
    noisy_photons = np.random.poisson(photon_count)

    # Read noise (Gaussian)
    read_noise_sigma = iso * 0.001  # Increases with ISO
    read_noise = np.random.normal(0, read_noise_sigma, image.shape)

    # Combine
    noisy_image = noisy_photons / (exposure_time * 1000) + read_noise

    # Quantize to 8-bit
    noisy_image = np.clip(noisy_image * 255, 0, 255).astype(np.uint8) / 255.0

    return noisy_image

Exercise 6: Build a Mini Synthetic Data Pipeline

Goal: Build a minimal end-to-end synthetic data generation pipeline using Python and a simple 3D renderer.

Tasks:

  1. Use a Python 3D library (Open3D, PyRender, or trimesh) to create simple scenes:
    • Ground plane (textured)
    • 3-5 box primitives representing vehicles (different colors/sizes)
    • 1-2 cylinder primitives representing pedestrians
  2. Implement structured domain randomization:
    • Random object positions (on the ground plane)
    • Random camera position and orientation
    • Random lighting (direction and color)
    • Random object colors
  3. Render RGB images and depth maps.
  4. Auto-label: compute 2D bounding boxes from the known 3D object positions.
  5. Export in a simplified KITTI-like format:
    • images/000000.png
    • labels/000000.txt (class x1 y1 x2 y2)
    • depth/000000.png
  6. Generate 1,000 frames and train a simple object detector.

This exercise demonstrates that even a primitive synthetic pipeline with box-shaped "cars" can produce useful training signal when combined with domain randomization.


Interview Questions

Question 1: Why not just collect more real data?

Answer Hints: Cost ($6-12/frame), time (weeks of turnaround for labeling), class imbalance (cyclists appear in <2% of frames), safety (cannot safely capture near-collision scenarios), diversity (cannot control weather, time of day, rare configurations), label quality (human annotations have 3-5% error rates), and reproducibility (cannot regenerate identical conditions). Synthetic data addresses all of these limitations simultaneously. The key insight is that synthetic data is not a replacement for real data but a complement -- the optimal strategy uses both.

Question 2: What is the domain gap and why does it matter?

Answer Hints: The domain gap is the statistical difference between synthetic and real data distributions. It manifests in two forms: (1) data domain gap -- visual differences in textures, lighting, noise, and artifacts; (2) label domain gap -- differences in annotation style, bounding box tightness, and class definitions. It matters because a model trained on synthetic data that has a large domain gap will perform poorly on real data -- typically 10-30% mAP degradation. Mitigation techniques include domain randomization (make synthetic data diverse enough that real is just another variation), style transfer (FDA, CyCADA), and mixed training (fine-tune on real data).

Question 3: Explain Fourier Domain Adaptation. Why does swapping low-frequency amplitudes work?

Answer Hints: The 2D Fourier transform decomposes an image into frequency components. Low frequencies encode global patterns -- overall brightness, color palette, large-scale textures (the "style"). High frequencies encode local patterns -- edges, fine textures, object boundaries (the "content/structure"). By replacing the low-frequency amplitude of a synthetic image with that of a real image while keeping the phase intact, FDA transfers the real-world style (color distribution, brightness, global texture feel) without changing the spatial structure (object locations, edges, shapes). This works because: (1) labels depend on structure (high frequency), not style; (2) the domain gap is primarily a style difference; (3) phase carries more structural information than amplitude. The key parameter beta controls how much of the spectrum to swap -- typically 0.01-0.05.

Question 4: Compare structured vs. unstructured domain randomization.

Answer Hints: Structured DR constrains randomization to physically plausible ranges (sun angles within 0-85 degrees, car colors from real-world distributions, rain only when cloudy). Unstructured DR randomizes everything without constraints (random textures on roads, purple cars, lighting from below). Structured DR produces more realistic images and works better when the target domain is similar to the training distribution. Unstructured DR forces the model to rely on geometric/shape features rather than texture, which can generalize better to truly novel domains (Tobin et al., 2017). In practice, the best approach combines both: structured DR for most parameters with occasional unstructured elements to prevent overfitting to specific textures.

Question 5: How would you design a synthetic data generation pipeline for a new geographic region?

Answer Hints: (1) Collect a small reference dataset from the region (100-1000 frames) for style reference and validation. (2) Obtain or create region-specific assets: road markings, sign libraries (language, style), vehicle models (common makes/models in that region), pedestrian appearance models. (3) Import or generate road networks from local HD maps or OpenStreetMap. (4) Apply FDA or neural style transfer using the reference dataset for style adaptation. (5) Validate on the reference dataset -- measure domain gap metrics (FID, KID) and downstream task performance. (6) Iterate: identify failure modes, add missing asset types, adjust randomization ranges. (7) Generate large-scale data and fine-tune on the small real dataset.

Question 6: What metrics would you use to measure the quality of synthetic data?

Answer Hints: (1) Downstream task performance (most important): mAP, mIoU, NDS on real validation set after training on synthetic data. (2) Distribution metrics: FID (Frechet Inception Distance), KID (Kernel Inception Distance) between synthetic and real image sets -- lower is better. (3) Domain gap metrics: Maximum Mean Discrepancy (MMD) between feature distributions. (4) Per-class analysis: AP per class, especially for rare classes that synthetic data targets. (5) Ablation metrics: Performance gain from adding synthetic data vs. real-only baseline. (6) Label quality: Compare auto-labels to manual annotations on rendered real-scene reconstructions. (7) Diversity metrics: coverage of the parameter space (lighting, weather, actor configurations).

Question 7: Explain the "lower bound guarantee" for synthetic data.

Answer Hints: The lower bound guarantee is the empirical finding that adding synthetic data to real data never hurts performance when using proper training strategies (mixed training or fine-tuning). The worst case is that synthetic data provides zero benefit (the model ignores it); the best case is significant improvement, especially for rare classes. Conditions for this guarantee: (1) the synthetic data must have reasonable diversity (domain randomization); (2) some real data must be included in training; (3) training must use mixed or fine-tuning strategies, not synthetic-only; (4) learning rate scheduling should account for the domain difference. This guarantee makes synthetic data a low-risk investment -- it can only help.

Question 8: How does auto-labeling in simulation work? What label types can be generated?

Answer Hints: Auto-labeling exploits the fact that the simulator has complete knowledge of the scene state. 2D bounding boxes: project the 8 corners of each object's 3D bounding box onto the image plane and compute the enclosing axis-aligned rectangle. 3D bounding boxes: read directly from the scene graph (position, dimensions, orientation). Semantic segmentation: render a special pass where each material/object class is assigned a unique color ID. Instance segmentation: similar to semantic but with unique IDs per object instance. Depth maps: read from the Z-buffer (rasterization) or compute ray intersection distances (ray tracing). Optical flow: compute per-pixel displacement between consecutive frames using object motion and camera motion. Surface normals: read from the geometry buffer during rendering. All labels are perfect (zero noise) and free (zero marginal cost).

Question 9: You have a dataset with 50,000 real frames but only 200 contain cyclists. How would you use synthetic data to improve cyclist detection?

Answer Hints: (1) Analyze the gap: Determine what cyclist variations are missing (night riding, rain, different bike types, varying occlusion). (2) Generate targeted synthetic data: Create 20,000-50,000 synthetic frames with cyclists, systematically varying pose, lighting, weather, occlusion, distance, and context. (3) Apply domain adaptation: Use FDA with real images from the dataset as style references. (4) Mixed training: Train with all 50,000 real frames + synthetic cyclist frames. Apply higher loss weight to cyclist detections. (5) Evaluate carefully: Report cyclist AP separately from overall mAP. Evaluate at different distance ranges and occlusion levels. (6) Expected result: Cyclist AP should improve by 15-25+ percentage points while overall mAP remains stable or improves slightly. (7) Iterate: Analyze remaining failure modes and generate targeted synthetic data to address them.

Question 10: What are the limitations of synthetic data for AD perception?

Answer Hints: (1) Domain gap: Despite mitigation techniques, a gap remains. Purely synthetic training underperforms real training by 10-30%. (2) Asset quality ceiling: The realism of synthetic data is bounded by asset quality. Creating photorealistic 3D models is expensive. (3) Long-tail coverage: While synthetic data helps with known rare cases, it cannot generate truly unknown unknowns (scenarios you have never imagined). (4) Sensor model fidelity: Imperfect sensor models (especially for radar and LiDAR) introduce systematic biases. (5) Behavioral realism: NPC behavior in simulation may not match real-world human behavior (e.g., jaywalking patterns, aggressive driving). (6) Diminishing returns: Beyond a certain volume, adding more synthetic data provides diminishing benefit. (7) Validation requirement: You always need real data to validate -- synthetic data cannot validate itself. (8) Generative artifacts: GAN-based adaptation can introduce artifacts that fool detectors.


References

Foundational Papers

  1. Tobin, J. et al. "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IROS 2017. -- Introduced domain randomization as a technique to bridge the sim-to-real gap.

  2. Tremblay, J. et al. "Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization." CVPR Workshop 2018. -- Structured domain randomization for object detection.

  3. Prakash, A. et al. "Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data." ICRA 2019. -- Demonstrated that structured randomization outperforms unstructured for AD.

Domain Adaptation

  1. Yang, Y. and Soatto, S. "FDA: Fourier Domain Adaptation for Semantic Segmentation." CVPR 2020. -- The FDA method: simple, effective, training-free domain adaptation via frequency-space style transfer.

  2. Hoffman, J. et al. "CyCADA: Cycle-Consistent Adversarial Domain Adaptation." ICML 2018. -- Cycle-consistent adversarial training for domain adaptation with semantic consistency.

  3. Tsai, Y.-H. et al. "Learning to Adapt Structured Output Space for Semantic Segmentation." CVPR 2018. -- Output space adaptation for segmentation.

  4. Vu, T.-H. et al. "ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation." CVPR 2019. -- Entropy-based adversarial adaptation.

Synthetic Data for AD

  1. Ros, G. et al. "The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes." CVPR 2016. -- One of the first large-scale synthetic datasets for AD.

  2. Richter, S. et al. "Playing for Data: Ground Truth from Computer Games." ECCV 2016. -- Extracting training data from GTA-V.

  3. Dosovitskiy, A. et al. "CARLA: An Open Urban Driving Simulator." CoRL 2017. -- Open-source simulator widely used for synthetic data generation.

  4. Shah, S. et al. "AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles." FSR 2017. -- Microsoft's AirSim platform.

  5. Kar, A. et al. "Meta-Sim: Learning to Generate Synthetic Datasets." ICCV 2019. -- Learning to optimize synthetic data generation for downstream task performance.

Benchmarks and Datasets

  1. Caesar, H. et al. "nuScenes: A Multimodal Dataset for Autonomous Driving." CVPR 2020. -- The nuScenes benchmark, widely used for evaluating synthetic data approaches.

  2. Sun, P. et al. "Scalability in Perception for Autonomous Driving: Waymo Open Dataset." CVPR 2020. -- Large-scale real-world benchmark.

  3. Geiger, A. et al. "Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite." CVPR 2012. -- Foundational AD benchmark.

Recent Advances (2023-2025)

  1. Gao, Y. et al. "MagicDrive: Street View Generation with Diverse 3D Geometry Control." ICLR 2024. -- Diffusion-based controllable street view generation for training data.

  2. Swerdlow, A. et al. "Street-View Image Generation from a Bird's-Eye View Layout." 2024. -- BEV-conditioned image generation for synthetic training data.

  3. Li, Y. et al. "GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation." ICLR 2024. -- Text-controlled generation of detection training data.

  4. Yang, Z. et al. "UniSim: A Neural Closed-Loop Sensor Simulator." CVPR 2023. -- Neural rendering for photorealistic sensor simulation at Waabi.

  5. Hu, A. et al. "GAIA-1: A Generative World Model for Autonomous Driving." 2023. -- Wayve's generative world model for simulation.

Industry Resources

  1. Applied Intuition. "Synthetic Datasets Product Documentation." -- Commercial synthetic data generation platform for AD.

  2. NVIDIA. "DRIVE Sim / Omniverse Replicator." -- NVIDIA's synthetic data generation platform.

  3. Parallel Domain. "Synthetic Data for Perception." -- Cloud-based synthetic data generation service.


This deep dive was written for software engineers preparing to work on synthetic data generation for autonomous driving perception. The techniques covered here -- domain randomization, FDA, CyCADA, mixed training strategies, and systematic evaluation -- form the practical toolkit for making synthetic data work in production ML pipelines. The key takeaway: synthetic data is not a replacement for real data, but a powerful multiplier. With 10% of the real data and proper synthetic augmentation, you can match or exceed models trained on 10x more real data alone -- while dramatically improving performance on rare, safety-critical classes.