Minority Class Augmentation

Use targeted synthetic data generation to rebalance long-tail class distributions in driving datasets and measurably improve detection of rare but safety-critical objects such as cyclists, construction workers, and animals.

Track: B -- Synthetic Data & Sensor Sim | Level: Beginner | Total Time: ~12-15 hours

Overview

Object detectors are only as good as the data they train on. In autonomous driving, the vast majority of labeled frames contain cars -- they make up 70-80% of annotated instances in most public datasets. Meanwhile, safety-critical categories like cyclists, construction workers, scooter riders, and animals together account for less than 5% of all labels. This is the long-tail distribution problem, and it has direct consequences: detectors trained on imbalanced data systematically underperform on rare classes, precisely the categories where a miss could be fatal.

The imbalance is not a labeling failure -- it reflects reality. On a typical highway drive you encounter hundreds of cars for every cyclist. But an autonomous vehicle must detect every cyclist with the same reliability it detects every car, because the consequences of a miss are equally severe. This creates a fundamental tension: real-world data collection naturally reproduces the long-tail distribution, so collecting more data does not fix the problem. You need a targeted intervention.

Synthetic data augmentation provides that intervention. Instead of waiting to encounter rare objects in the wild, you generate them programmatically and inject them into training images. The simplest version -- copy-paste augmentation -- extracts object instances from a small bank of source images and composites them onto training backgrounds. More sophisticated versions use placement heuristics (ground plane consistency, scale by depth, occlusion awareness) to make the composites more realistic and therefore more effective as training data.

In this project you will build a complete minority-class augmentation pipeline:

Analyze a driving dataset to quantify class imbalance and identify underrepresented categories.
Generate synthetic training samples by copy-pasting minority-class instances with context-aware placement.
Train an object detector on the original, augmented, and mixed datasets, then evaluate per-class AP to measure the improvement on minority classes.

Everything runs on CPU or a modest GPU. The dataset is COCO-format (you will create a synthetic driving subset for reproducibility), the detector is a torchvision Faster R-CNN with a ResNet-50 backbone, and all code is standard Python with PyTorch. No exotic dependencies, no cloud infrastructure, no multi-day training runs.

By the end you will have empirical evidence that even simple synthetic augmentation can lift minority-class AP by 10-25 percentage points, along with a reusable pipeline you can apply to any detection dataset with class imbalance.

Learning Objectives

After completing this project, you will be able to:

Quantify class imbalance -- compute instance counts, class frequencies, and imbalance ratios for an object detection dataset in COCO format.
Identify the long-tail -- visualize the class distribution, distinguish head/torso/tail classes, and explain why standard training degrades on rare categories.
Implement copy-paste augmentation -- extract object instances from source images using segmentation masks, composite them onto target backgrounds with alpha blending.
Apply placement heuristics -- enforce ground-plane consistency, depth-aware scaling, and occlusion-aware positioning so that pasted objects look spatially plausible.
Control appearance variation -- randomize brightness, contrast, horizontal flip, and color jitter on pasted instances to increase diversity.
Train and evaluate a detector -- fine-tune a Faster R-CNN on original and augmented datasets, compute overall mAP and per-class AP using COCO evaluation.
Analyze augmentation trade-offs -- plot performance vs. augmentation ratio, identify diminishing returns, and determine the optimal mixing strategy for a given dataset.
Draw safety-relevant conclusions -- relate per-class detection performance to real-world safety implications for autonomous driving.

Prerequisites

Required

Python basics -- comfortable with loops, functions, dictionaries, file I/O, and pip package management.
Object detection concepts -- understand what bounding boxes are, what AP (average precision) measures, and the general idea of anchor-based detectors (Faster R-CNN).
Dataset handling -- ability to load images, read JSON annotation files, and work with NumPy arrays and PIL images.

PyTorch basics -- familiarity with tensors, DataLoader, and the training loop pattern (forward pass, loss, backward pass, optimizer step).
Matplotlib -- ability to create plots, subplots, and annotate images.
COCO format -- having seen a COCO-style annotations.json file before (images, annotations, categories).

Deep Dive Reading

Before starting, read the companion deep dives for theoretical background:

Synthetic Data for AD Perception Training -- covers domain randomization, mixed training strategies, and the cost-benefit analysis of synthetic data in autonomous driving perception.

Key Concepts

The Long-Tail Distribution in Driving Datasets

Real driving datasets exhibit a power-law distribution of object categories. A small number of classes (cars, trucks) dominate, while most classes (cyclists, construction workers, animals, wheelchairs, scooters) appear rarely. This is not unique to driving -- it appears in virtually every real-world visual recognition dataset -- but in driving it has safety implications.

Typical Class Distribution in a Driving Dataset
================================================

Category            | Instances  | % of Total  | Frequency Tier
--------------------|------------|-------------|---------------
Car                 |    85,000  |    68.0%    | HEAD
Truck               |    12,000  |     9.6%    | HEAD
Pedestrian          |    10,500  |     8.4%    | TORSO
Bus                 |     5,200  |     4.2%    | TORSO
Motorcycle          |     3,800  |     3.0%    | TORSO
Traffic Cone        |     2,500  |     2.0%    | TORSO
Cyclist             |     1,800  |     1.4%    | TAIL   <-- safety-critical
Construction Worker |       900  |     0.7%    | TAIL   <-- safety-critical
Barrier             |     1,500  |     1.2%    | TAIL
Animal              |       300  |     0.2%    | TAIL   <-- safety-critical
Scooter             |       500  |     0.4%    | TAIL
Wheelchair          |       100  |     0.1%    | TAIL
--------------------|------------|-------------|
Total               |   125,000  |   100.0%    |

The imbalance ratio is the count of the most frequent class divided by the count of the least frequent class. In the table above: 85,000 / 100 = 850x. This means the model sees 850 cars for every wheelchair during training.

Standard cross-entropy loss treats every sample equally, so the gradient signal is dominated by the abundant classes. The model learns excellent car features and poor wheelchair features. At evaluation time, per-class AP for rare categories can be 20-40 points lower than for common categories.

Why More Data Does Not Fix It

A naive response is "collect more data." But collecting 850x more wheelchair data to match the car count is impractical. Even if you drove 850x more miles, you would collect 850x more car instances too, preserving the ratio. Targeted data collection (e.g., deliberately driving near construction sites) helps but is expensive and still cannot cover all rare categories.

Copy-Paste Augmentation

Copy-paste augmentation is a simple, effective technique introduced by Ghiasi et al. (2021) and Dwibedi et al. (2017). The idea:

Build an instance bank: Collect cropped instances of rare objects from the existing dataset (or external sources). Each instance consists of an RGB crop and a binary mask.
Select a target image: Pick a training image to augment.
Paste instances: Composite one or more rare-object instances onto the target image at valid locations.
Update annotations: Add bounding box annotations for the pasted instances.

Copy-Paste Pipeline
====================

Instance Bank              Target Image               Augmented Image
+------------------+       +------------------+       +------------------+
| [cyclist_01.png] |       |                  |       |                  |
| [cyclist_02.png] | ----> |  Original scene  | ----> | Scene + pasted   |
| [worker_01.png]  |       |  with cars only  |       | cyclist on road  |
| [worker_02.png]  |       |                  |       |                  |
+------------------+       +------------------+       +------------------+

Steps:
  1. Random select instance from bank
  2. Random transform (flip, scale, brightness)
  3. Choose placement location (heuristic)
  4. Alpha-blend onto target
  5. Add bbox annotation

The key advantage is simplicity: you need no 3D rendering engine, no domain randomization pipeline, no generative model. Just image compositing. Research shows this approach can improve rare-class AP by 5-25 points depending on the quality of the instance bank and the placement strategy.

Placement Heuristics

Naive random placement (paste the cyclist at a random pixel location) creates unrealistic images that confuse the detector. The cyclist might float in the sky, appear tiny in the foreground, or be pasted on top of a building. Better placement uses these heuristics:

Ground-plane consistency: Objects should appear to stand on the road or sidewalk surface. In a simple approximation, the bottom of the bounding box should align with a plausible ground line. For a driving scene, the ground line typically follows a linear or parabolic curve from the bottom of the image toward the horizon.

Depth-aware scaling: Objects farther from the camera appear smaller. If you paste a cyclist at a location near the horizon, it should be scaled down; near the bottom of the image, it should be larger. A simple model: scale = reference_height / (y_position_from_horizon).

Occlusion awareness: The pasted object should not overlap excessively with existing objects. A simple check: compute IoU between the proposed paste location and all existing bounding boxes; reject if IoU > threshold (e.g., 0.3).

def compute_scale_from_position(y_bottom, image_height,
                                 reference_height=80, horizon_y=0.35):
    """Estimate appropriate object scale based on vertical position.

    Objects near the bottom of the image (close to camera) should be larger.
    Objects near the horizon (far from camera) should be smaller.

    Args:
        y_bottom: Y coordinate of the object's bottom edge (pixels)
        image_height: Total image height (pixels)
        reference_height: Expected object height in pixels at y_bottom = image_height
        horizon_y: Horizon position as fraction of image height (0 = top)

    Returns:
        scale: Multiplicative scale factor for the instance
    """
    horizon_px = horizon_y * image_height
    # Distance from horizon (in pixels) -- larger means closer to camera
    dist_from_horizon = max(y_bottom - horizon_px, 1.0)
    max_dist = image_height - horizon_px

    # Linear scale model
    scale = reference_height * (dist_from_horizon / max_dist)
    return max(scale / reference_height, 0.1)  # clamp minimum

Evaluation Methodology

Standard object detection evaluation uses mean Average Precision (mAP) averaged across all classes. But mAP can mask poor performance on rare classes because the dominant classes (cars) pull the average up.

For minority-class augmentation, the right evaluation strategy is:

Overall mAP: sanity check that augmentation does not degrade general performance.
Per-class AP: the primary metric. Compare AP for each class between baseline and augmented models.
Tail-class mAP: average AP only over the tail classes (the ones you augmented).
Precision-recall curves: for tail classes, show how the trade-off shifts after augmentation.

The evaluation should use a held-out test set that was not augmented -- you want to measure performance on real data, not on synthetic composites.

Diminishing Returns and Optimal Mixing

Adding more synthetic instances helps up to a point, then performance plateaus or even decreases:

Per-Class AP vs. Augmentation Ratio (conceptual)
==================================================

AP (cyclist)
  |
  |                    .........plateau.........
  |               .....
  |           ....
  |        ...
  |      ..
  |    ..
  |  ..    <-- steep improvement zone
  | .
  |.
  +------------------------------------------- Augmentation ratio
  0x   1x   2x   5x   10x   20x   50x

Typical finding:
  - 0x -> 2x:  Rapid improvement (+10-20 AP points)
  - 2x -> 10x: Moderate improvement (+5-10 AP points)
  - >10x:      Diminishing returns, possible degradation

The optimal ratio depends on:

Instance bank diversity: If you only have 5 cyclist images, pasting them 100 times each causes overfitting. With 50+ diverse instances, higher ratios work better.
Placement quality: Realistic placement allows higher ratios before the model starts learning paste artifacts.
Domain gap: If the instance bank comes from a different domain (different camera, different country), the gap limits how much you can paste before the artifacts outweigh the benefit.

A practical strategy is to sweep ratios (1x, 2x, 5x, 10x) and pick the one that maximizes tail-class mAP without degrading head-class mAP by more than 1-2 points.

Step-by-Step Implementation Guide

Step 1: Environment Setup (30 min)

Goal: Install dependencies, create the project structure, and verify everything works.

1.1 Create the project

cd projects/track-b-synthetic-data
mkdir -p minority-class-augmentation/notebooks
cd minority-class-augmentation

python -m venv .venv
source .venv/bin/activate

1.2 Install dependencies

pip install torch torchvision numpy matplotlib Pillow tqdm ipykernel jupyter pandas seaborn scikit-learn opencv-python pycocotools

Package	Purpose
`torch`, `torchvision`	Detector model (Faster R-CNN), transforms, COCO evaluation
`numpy`	Array operations
`matplotlib`, `seaborn`	Visualization and plotting
`Pillow`	Image loading and manipulation
`opencv-python`	Image processing (resize, blend, contours)
`pandas`	Tabular data analysis
`scikit-learn`	Stratified splits, utility metrics
`pycocotools`	COCO evaluation (mAP, per-class AP)
`tqdm`	Progress bars
`ipykernel`, `jupyter`	Notebook environment

1.3 Register Jupyter kernel

python -m ipykernel install --user --name minority-aug --display-name "Minority Class Augmentation"

1.4 Verify setup

import torch
import torchvision
import numpy as np
import matplotlib
import cv2
from PIL import Image
from pycocotools.coco import COCO

print(f"PyTorch:      {torch.__version__}")
print(f"TorchVision:  {torchvision.__version__}")
print(f"NumPy:        {np.__version__}")
print(f"OpenCV:       {cv2.__version__}")
print(f"CUDA:         {torch.cuda.is_available()}")

# Quick model load test
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
    weights=None, num_classes=10
)
print(f"Faster R-CNN: {sum(p.numel() for p in model.parameters()):,} parameters")
print("\nAll checks passed.")

1.5 Project structure

minority-class-augmentation/
  notebooks/
    01_dataset_analysis.ipynb        # Analyze class distribution
    02_synthetic_generation.ipynb    # Build copy-paste augmentation pipeline
    03_training_and_evaluation.ipynb # Train detector, measure improvement
  requirements.txt
  README.md

Step 2: Dataset Analysis and Class Distribution (Notebook 01, ~45 min)

Goal: Load a driving detection dataset, compute class statistics, visualize the long-tail distribution, and identify which classes need augmentation.

2.1 Create a synthetic driving dataset

For reproducibility and to avoid large downloads, we create a small synthetic dataset programmatically. The dataset mimics the structure of a real driving detection dataset with intentional class imbalance:

import numpy as np
from PIL import Image, ImageDraw
import json
import os

def create_synthetic_driving_dataset(
    output_dir: str,
    num_images: int = 200,
    image_size: tuple = (640, 480),
    seed: int = 42,
):
    """Generate a toy driving detection dataset with realistic class imbalance.

    Creates images with colored rectangles representing different object classes,
    with frequency distribution mimicking real driving data.
    """
    rng = np.random.RandomState(seed)

    # Class definitions with target frequencies (mimicking real distribution)
    classes = {
        1: {"name": "car",                 "color": (70, 130, 180),  "freq": 0.55, "size_range": (40, 80)},
        2: {"name": "truck",               "color": (255, 165, 0),   "freq": 0.12, "size_range": (60, 100)},
        3: {"name": "pedestrian",          "color": (220, 20, 60),   "freq": 0.10, "size_range": (20, 50)},
        4: {"name": "bus",                 "color": (255, 215, 0),   "freq": 0.06, "size_range": (80, 120)},
        5: {"name": "motorcycle",          "color": (148, 103, 189), "freq": 0.05, "size_range": (25, 45)},
        6: {"name": "cyclist",             "color": (44, 160, 44),   "freq": 0.04, "size_range": (20, 45)},
        7: {"name": "construction_worker", "color": (255, 127, 14),  "freq": 0.03, "size_range": (20, 50)},
        8: {"name": "traffic_cone",        "color": (214, 39, 40),   "freq": 0.03, "size_range": (10, 25)},
        9: {"name": "animal",              "color": (140, 86, 75),   "freq": 0.02, "size_range": (15, 40)},
    }
    # ... generate images and annotations in COCO format

The key design choice: realistic imbalance. Cars appear in almost every image (55% of all instances), while animals appear in only 2%. This mirrors the distribution in real datasets like nuScenes, Waymo, and KITTI.

2.2 Load and inspect the dataset

from pycocotools.coco import COCO

coco = COCO("data/annotations.json")

print(f"Images:      {len(coco.imgs)}")
print(f"Annotations: {len(coco.anns)}")
print(f"Categories:  {len(coco.cats)}")

# Per-class instance counts
for cat_id, cat_info in coco.cats.items():
    ann_ids = coco.getAnnIds(catIds=[cat_id])
    print(f"  {cat_info['name']:25s} {len(ann_ids):6d} instances")

2.3 Visualize the long-tail distribution

Create a bar chart sorted by frequency (descending) with clear visual separation between head, torso, and tail classes:

import matplotlib.pyplot as plt
import numpy as np

# Sort categories by count
categories = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
names = [c[0] for c in categories]
counts = [c[1] for c in categories]

# Color by tier
colors = []
for i, count in enumerate(counts):
    if count > np.percentile(counts, 75):
        colors.append("#2ecc71")   # head (green)
    elif count > np.percentile(counts, 25):
        colors.append("#f39c12")   # torso (orange)
    else:
        colors.append("#e74c3c")   # tail (red)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
axes[0].barh(names[::-1], counts[::-1], color=colors[::-1], edgecolor="white")
axes[0].set_xlabel("Instance Count")
axes[0].set_title("Class Distribution (Long-Tail)")

# Log-scale version
axes[1].barh(names[::-1], counts[::-1], color=colors[::-1], edgecolor="white")
axes[1].set_xscale("log")
axes[1].set_xlabel("Instance Count (log scale)")
axes[1].set_title("Class Distribution (Log Scale)")

plt.tight_layout()
plt.show()

2.4 Compute imbalance metrics

max_count = max(counts)
min_count = min(counts)
imbalance_ratio = max_count / min_count

print(f"Most frequent class:  {names[0]} ({max_count} instances)")
print(f"Least frequent class: {names[-1]} ({min_count} instances)")
print(f"Imbalance ratio:      {imbalance_ratio:.0f}x")
print(f"Gini coefficient:     {gini(counts):.3f}")

# Identify tail classes (bottom quartile)
threshold = np.percentile(counts, 25)
tail_classes = [n for n, c in zip(names, counts) if c <= threshold]
print(f"\nTail classes (need augmentation): {tail_classes}")

2.5 Visualize sample images with annotations

def visualize_annotated_image(coco, image_id, ax=None):
    """Draw an image with its bounding box annotations."""
    img_info = coco.imgs[image_id]
    img = Image.open(os.path.join("data/images", img_info["file_name"]))

    ann_ids = coco.getAnnIds(imgIds=[image_id])
    anns = coco.loadAnns(ann_ids)

    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 7))

    ax.imshow(img)
    for ann in anns:
        x, y, w, h = ann["bbox"]
        cat_name = coco.cats[ann["category_id"]]["name"]
        rect = plt.Rectangle((x, y), w, h, linewidth=2,
                              edgecolor=CLASS_COLORS[cat_name],
                              facecolor="none")
        ax.add_patch(rect)
        ax.text(x, y - 3, cat_name, fontsize=8, color="white",
                bbox=dict(boxstyle="round,pad=0.2", facecolor=CLASS_COLORS[cat_name]))

    ax.axis("off")
    return ax

2.6 Per-image class co-occurrence analysis

Understand which classes tend to appear together. This informs placement decisions later:

# Build co-occurrence matrix
n_cats = len(coco.cats)
co_occurrence = np.zeros((n_cats, n_cats), dtype=int)

for img_id in coco.imgs:
    ann_ids = coco.getAnnIds(imgIds=[img_id])
    anns = coco.loadAnns(ann_ids)
    cats_in_image = list(set(a["category_id"] for a in anns))
    for i in cats_in_image:
        for j in cats_in_image:
            co_occurrence[i - 1][j - 1] += 1

# Plot heatmap
import seaborn as sns
cat_names = [coco.cats[i+1]["name"] for i in range(n_cats)]
sns.heatmap(co_occurrence, xticklabels=cat_names, yticklabels=cat_names,
            annot=True, fmt="d", cmap="YlOrRd")
plt.title("Class Co-occurrence (images containing both classes)")
plt.tight_layout()
plt.show()

Exercise (Notebook 01)

Compute the effective number of samples for each class using the formula from Cui et al. (2019): E_n = (1 - beta^n) / (1 - beta) where beta = (N - 1) / N and n is the instance count. How does this differ from raw counts?
Calculate how many synthetic instances of each tail class you would need to bring the imbalance ratio below 10x. Create a table showing the current count, target count, and number of synthetic instances needed.
Plot the cumulative distribution function (CDF) of instance counts. What percentage of total instances belong to the bottom 50% of classes?

Step 3: Synthetic Minority Generation (Notebook 02, ~60 min)

Goal: Build a copy-paste augmentation pipeline that extracts minority-class instances, applies appearance transforms, places them in target images using spatial heuristics, and produces a COCO-format augmented dataset.

3.1 Instance extraction

Extract individual object instances from the dataset. Each instance is stored as an RGB crop plus a binary mask:

from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
from PIL import Image

@dataclass
class ObjectInstance:
    """A cropped object instance with mask."""
    image: np.ndarray       # (H, W, 3) RGB crop (tight bbox)
    mask: np.ndarray        # (H, W) binary mask (1 = object, 0 = background)
    category_id: int
    category_name: str
    original_bbox: List[float]  # [x, y, w, h] in original image
    source_image_id: int

def extract_instances(coco, image_dir: str,
                      category_ids: List[int],
                      min_area: int = 200) -> List[ObjectInstance]:
    """Extract all instances of specified categories from the dataset.

    Args:
        coco: COCO API object
        image_dir: Path to image directory
        category_ids: Category IDs to extract
        min_area: Minimum bbox area to include (filters tiny objects)

    Returns:
        List of ObjectInstance objects
    """
    instances = []
    for cat_id in category_ids:
        ann_ids = coco.getAnnIds(catIds=[cat_id])
        anns = coco.loadAnns(ann_ids)

        for ann in anns:
            if ann["area"] < min_area:
                continue

            img_info = coco.imgs[ann["image_id"]]
            img = np.array(Image.open(
                os.path.join(image_dir, img_info["file_name"])
            ))

            x, y, w, h = [int(v) for v in ann["bbox"]]
            # Crop with small padding
            pad = 5
            x1 = max(0, x - pad)
            y1 = max(0, y - pad)
            x2 = min(img.shape[1], x + w + pad)
            y2 = min(img.shape[0], y + h + pad)

            crop = img[y1:y2, x1:x2].copy()

            # Create mask (rectangle approximation if no segmentation)
            mask = np.zeros((y2 - y1, x2 - x1), dtype=np.uint8)
            # Fill the object area within the crop
            obj_x1 = x - x1
            obj_y1 = y - y1
            mask[obj_y1:obj_y1+h, obj_x1:obj_x1+w] = 1

            instances.append(ObjectInstance(
                image=crop,
                mask=mask,
                category_id=cat_id,
                category_name=coco.cats[cat_id]["name"],
                original_bbox=ann["bbox"],
                source_image_id=ann["image_id"],
            ))

    return instances

3.2 Appearance augmentation

Apply random transforms to increase instance diversity:

import cv2

def augment_instance(instance: ObjectInstance, rng: np.random.RandomState
                     ) -> ObjectInstance:
    """Apply random appearance transforms to an object instance.

    Transforms applied:
      - Horizontal flip (50% chance)
      - Brightness adjustment (+/- 30%)
      - Contrast adjustment (+/- 20%)
      - Slight color jitter
    """
    img = instance.image.copy().astype(np.float32)
    mask = instance.mask.copy()

    # Horizontal flip
    if rng.random() > 0.5:
        img = img[:, ::-1, :].copy()
        mask = mask[:, ::-1].copy()

    # Brightness
    brightness = rng.uniform(0.7, 1.3)
    img = np.clip(img * brightness, 0, 255)

    # Contrast
    contrast = rng.uniform(0.8, 1.2)
    mean = img.mean()
    img = np.clip((img - mean) * contrast + mean, 0, 255)

    # Color jitter (per-channel)
    for c in range(3):
        jitter = rng.uniform(-15, 15)
        img[:, :, c] = np.clip(img[:, :, c] + jitter, 0, 255)

    return ObjectInstance(
        image=img.astype(np.uint8),
        mask=mask,
        category_id=instance.category_id,
        category_name=instance.category_name,
        original_bbox=instance.original_bbox,
        source_image_id=instance.source_image_id,
    )

3.3 Placement heuristics

Choose where to paste instances so they look spatially plausible:

@dataclass
class PlacementConfig:
    """Configuration for instance placement."""
    min_y_fraction: float = 0.3    # Don't paste above this (too close to sky)
    max_y_fraction: float = 0.95   # Don't paste below this (too close to bottom edge)
    min_x_fraction: float = 0.05   # Left margin
    max_x_fraction: float = 0.95   # Right margin
    max_overlap_iou: float = 0.2   # Max IoU with existing boxes
    horizon_y_fraction: float = 0.35  # Approximate horizon location
    reference_height_px: int = 80     # Expected height at bottom of image


def find_valid_placement(
    instance: ObjectInstance,
    target_size: Tuple[int, int],  # (width, height)
    existing_boxes: List[List[float]],  # [[x, y, w, h], ...]
    config: PlacementConfig,
    rng: np.random.RandomState,
    max_attempts: int = 50,
) -> Tuple[int, int, float]:
    """Find a valid (x, y, scale) placement for the instance.

    Returns:
        (paste_x, paste_y, scale) or None if no valid placement found.
    """
    W, H = target_size
    inst_h, inst_w = instance.image.shape[:2]

    for _ in range(max_attempts):
        # Sample a y position (bottom of the pasted object)
        y_bottom = rng.randint(
            int(H * config.min_y_fraction),
            int(H * config.max_y_fraction),
        )

        # Compute depth-aware scale
        horizon_px = config.horizon_y_fraction * H
        dist_from_horizon = max(y_bottom - horizon_px, 1.0)
        max_dist = H - horizon_px
        scale = (dist_from_horizon / max_dist)
        scale = max(scale, 0.15)  # minimum scale

        # Compute scaled instance dimensions
        scaled_h = int(inst_h * scale)
        scaled_w = int(inst_w * scale)

        if scaled_h < 10 or scaled_w < 5:
            continue

        # Sample x position
        x_min = int(W * config.min_x_fraction)
        x_max = int(W * config.max_x_fraction) - scaled_w
        if x_max <= x_min:
            continue
        paste_x = rng.randint(x_min, x_max)
        paste_y = y_bottom - scaled_h

        if paste_y < 0:
            continue

        # Check overlap with existing boxes
        proposed_box = [paste_x, paste_y, scaled_w, scaled_h]
        max_iou = max(
            [compute_iou_xywh(proposed_box, box) for box in existing_boxes],
            default=0.0,
        )

        if max_iou <= config.max_overlap_iou:
            return paste_x, paste_y, scale

    return None


def compute_iou_xywh(box1, box2):
    """Compute IoU between two [x, y, w, h] boxes."""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[0] + box1[2], box2[0] + box2[2])
    y2 = min(box1[1] + box1[3], box2[1] + box2[3])

    if x2 <= x1 or y2 <= y1:
        return 0.0

    intersection = (x2 - x1) * (y2 - y1)
    area1 = box1[2] * box1[3]
    area2 = box2[2] * box2[3]
    union = area1 + area2 - intersection

    return intersection / union if union > 0 else 0.0

3.4 Copy-paste compositing

Blend the instance onto the target image:

def paste_instance(
    target_image: np.ndarray,
    instance: ObjectInstance,
    paste_x: int,
    paste_y: int,
    scale: float,
    blend_edge: bool = True,
) -> Tuple[np.ndarray, List[float]]:
    """Paste an instance onto a target image with alpha blending.

    Args:
        target_image: (H, W, 3) target image (will be modified in-place)
        instance: The object instance to paste
        paste_x, paste_y: Top-left corner of paste location
        scale: Scale factor for the instance
        blend_edge: If True, feather the edges for smoother blending

    Returns:
        (modified_image, bbox) where bbox is [x, y, w, h]
    """
    # Resize instance
    h_orig, w_orig = instance.image.shape[:2]
    new_h = int(h_orig * scale)
    new_w = int(w_orig * scale)

    resized_img = cv2.resize(instance.image, (new_w, new_h),
                              interpolation=cv2.INTER_LINEAR)
    resized_mask = cv2.resize(instance.mask.astype(np.float32), (new_w, new_h),
                               interpolation=cv2.INTER_LINEAR)

    # Feather edges for smoother blending
    if blend_edge:
        kernel_size = max(3, int(min(new_h, new_w) * 0.05)) | 1  # ensure odd
        resized_mask = cv2.GaussianBlur(resized_mask, (kernel_size, kernel_size), 0)

    # Clip to image bounds
    H, W = target_image.shape[:2]
    src_y1 = max(0, -paste_y)
    src_x1 = max(0, -paste_x)
    src_y2 = min(new_h, H - paste_y)
    src_x2 = min(new_w, W - paste_x)

    dst_y1 = max(0, paste_y)
    dst_x1 = max(0, paste_x)
    dst_y2 = dst_y1 + (src_y2 - src_y1)
    dst_x2 = dst_x1 + (src_x2 - src_x1)

    # Alpha blend
    alpha = resized_mask[src_y1:src_y2, src_x1:src_x2, np.newaxis]
    target_image[dst_y1:dst_y2, dst_x1:dst_x2] = (
        alpha * resized_img[src_y1:src_y2, src_x1:src_x2] +
        (1 - alpha) * target_image[dst_y1:dst_y2, dst_x1:dst_x2]
    ).astype(np.uint8)

    # Return bounding box [x, y, w, h]
    bbox = [float(dst_x1), float(dst_y1),
            float(dst_x2 - dst_x1), float(dst_y2 - dst_y1)]

    return target_image, bbox

3.5 Full augmentation pipeline

Combine extraction, augmentation, placement, and compositing into a single function:

def augment_dataset(
    coco,
    image_dir: str,
    output_dir: str,
    instance_bank: List[ObjectInstance],
    target_category_ids: List[int],
    instances_per_image: Tuple[int, int] = (1, 3),
    augmentation_ratio: float = 1.0,
    seed: int = 42,
) -> dict:
    """Generate an augmented version of the dataset.

    Args:
        coco: COCO API object for the original dataset
        image_dir: Directory containing original images
        output_dir: Directory for augmented images and annotations
        instance_bank: List of extracted minority-class instances
        target_category_ids: Category IDs to augment
        instances_per_image: (min, max) instances to paste per image
        augmentation_ratio: Fraction of images to augment (1.0 = all images)
        seed: Random seed for reproducibility

    Returns:
        COCO-format annotation dict for the augmented dataset
    """
    rng = np.random.RandomState(seed)
    config = PlacementConfig()

    os.makedirs(os.path.join(output_dir, "images"), exist_ok=True)

    augmented_annotations = {
        "images": [],
        "annotations": [],
        "categories": list(coco.dataset["categories"]),
    }

    ann_id_counter = max(a["id"] for a in coco.dataset["annotations"]) + 1

    image_ids = list(coco.imgs.keys())
    n_to_augment = int(len(image_ids) * augmentation_ratio)
    augment_ids = set(rng.choice(image_ids, n_to_augment, replace=False))

    for img_id in tqdm(image_ids, desc="Augmenting"):
        img_info = coco.imgs[img_id]
        img = np.array(Image.open(os.path.join(image_dir, img_info["file_name"])))

        # Copy existing annotations
        ann_ids = coco.getAnnIds(imgIds=[img_id])
        existing_anns = coco.loadAnns(ann_ids)
        existing_boxes = [a["bbox"] for a in existing_anns]

        new_anns = []
        for ann in existing_anns:
            new_ann = dict(ann)
            new_anns.append(new_ann)

        # Paste minority instances
        if img_id in augment_ids:
            n_paste = rng.randint(*instances_per_image)
            for _ in range(n_paste):
                # Pick a random instance from the bank
                inst = rng.choice(instance_bank)
                inst = augment_instance(inst, rng)

                # Find valid placement
                placement = find_valid_placement(
                    inst, (img.shape[1], img.shape[0]),
                    existing_boxes, config, rng,
                )
                if placement is None:
                    continue

                paste_x, paste_y, scale = placement
                img, bbox = paste_instance(img, inst, paste_x, paste_y, scale)

                # Add annotation
                new_anns.append({
                    "id": ann_id_counter,
                    "image_id": img_id,
                    "category_id": inst.category_id,
                    "bbox": bbox,
                    "area": bbox[2] * bbox[3],
                    "iscrowd": 0,
                })
                existing_boxes.append(bbox)
                ann_id_counter += 1

        # Save augmented image
        out_path = os.path.join(output_dir, "images", img_info["file_name"])
        Image.fromarray(img).save(out_path)

        augmented_annotations["images"].append(dict(img_info))
        augmented_annotations["annotations"].extend(new_anns)

    return augmented_annotations

3.6 Visualize augmented samples

Show before/after comparisons to verify the augmentation looks reasonable:

fig, axes = plt.subplots(3, 2, figsize=(14, 15))

for i, img_id in enumerate(sample_image_ids[:3]):
    # Original
    draw_annotated_image(coco_original, img_id, axes[i, 0])
    axes[i, 0].set_title(f"Original (Image {img_id})")

    # Augmented
    draw_annotated_image(coco_augmented, img_id, axes[i, 1])
    axes[i, 1].set_title(f"Augmented (Image {img_id})")

plt.suptitle("Before / After Copy-Paste Augmentation", fontsize=14)
plt.tight_layout()
plt.show()

3.7 Verify augmented class distribution

# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original
axes[0].barh(cat_names, original_counts, color=original_colors)
axes[0].set_title("Original Distribution")
axes[0].set_xlabel("Instance Count")

# Augmented
axes[1].barh(cat_names, augmented_counts, color=augmented_colors)
axes[1].set_title("Augmented Distribution")
axes[1].set_xlabel("Instance Count")

plt.tight_layout()
plt.show()

# Print improvement
print("\nClass Distribution Comparison:")
print(f"{'Category':25s} {'Original':>10s} {'Augmented':>10s} {'Change':>10s}")
print("-" * 60)
for name, orig, aug in zip(cat_names, original_counts, augmented_counts):
    change = aug - orig
    print(f"{name:25s} {orig:10d} {aug:10d} {'+' + str(change) if change > 0 else str(change):>10s}")

Exercise (Notebook 02)

Context-aware placement scoring: Implement a function that scores candidate placement locations based on multiple criteria: (a) proximity to road area, (b) scale consistency with nearby objects, (c) lighting direction consistency. Use a weighted sum of these scores and pick the best placement instead of the first valid one.
Multi-instance pasting: Modify the pipeline to sometimes paste groups of instances (e.g., 2-3 cyclists riding together) with correlated placement. Group members should be at similar depth with small lateral offsets.
Augmentation visualization gallery: Create a 4x4 grid showing the same base image augmented 16 different ways (different instance selections, placements, and transforms). Assess visual quality by eye.

Step 4: Training and Evaluation (Notebook 03, ~60 min)

Goal: Train a Faster R-CNN detector on original and augmented datasets, measure overall mAP and per-class AP, and quantify the improvement on minority classes.

4.1 Dataset and DataLoader setup

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms as T
from pycocotools.coco import COCO
from PIL import Image

class CocoDetectionDataset(Dataset):
    """PyTorch Dataset for COCO-format object detection."""

    def __init__(self, image_dir: str, annotation_file: str, transforms=None):
        self.image_dir = image_dir
        self.coco = COCO(annotation_file)
        self.image_ids = list(self.coco.imgs.keys())
        self.transforms = transforms

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        img_id = self.image_ids[idx]
        img_info = self.coco.imgs[img_id]
        img = Image.open(
            os.path.join(self.image_dir, img_info["file_name"])
        ).convert("RGB")

        ann_ids = self.coco.getAnnIds(imgIds=[img_id])
        anns = self.coco.loadAnns(ann_ids)

        boxes = []
        labels = []
        for ann in anns:
            x, y, w, h = ann["bbox"]
            if w > 0 and h > 0:
                boxes.append([x, y, x + w, y + h])  # xyxy format
                labels.append(ann["category_id"])

        if len(boxes) == 0:
            boxes = torch.zeros((0, 4), dtype=torch.float32)
            labels = torch.zeros((0,), dtype=torch.int64)
        else:
            boxes = torch.as_tensor(boxes, dtype=torch.float32)
            labels = torch.as_tensor(labels, dtype=torch.int64)

        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([img_id]),
        }

        if self.transforms:
            img = self.transforms(img)
        else:
            img = T.ToTensor()(img)

        return img, target


def collate_fn(batch):
    return tuple(zip(*batch))

4.2 Model setup

import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_model(num_classes: int, pretrained_backbone: bool = True):
    """Create a Faster R-CNN with custom number of classes.

    Args:
        num_classes: Number of classes (including background as class 0)
        pretrained_backbone: Use ImageNet-pretrained ResNet-50 backbone
    """
    # Load model with pretrained backbone
    model = fasterrcnn_resnet50_fpn(
        weights="DEFAULT" if pretrained_backbone else None,
    )

    # Replace the classification head
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

# Create model: 9 object classes + 1 background = 10
model = create_model(num_classes=10, pretrained_backbone=True)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

4.3 Training loop

def train_one_epoch(model, data_loader, optimizer, device, epoch):
    """Train for one epoch."""
    model.train()
    total_loss = 0.0

    pbar = tqdm(data_loader, desc=f"Epoch {epoch}")
    for images, targets in pbar:
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        total_loss += losses.item()
        pbar.set_postfix({"loss": f"{losses.item():.4f}"})

    return total_loss / len(data_loader)


def train_detector(
    model,
    train_loader: DataLoader,
    num_epochs: int = 10,
    lr: float = 0.005,
    device: str = "cpu",
):
    """Full training procedure."""
    model.to(device)

    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=lr, momentum=0.9, weight_decay=0.0005)
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

    history = []
    for epoch in range(1, num_epochs + 1):
        loss = train_one_epoch(model, train_loader, optimizer, device, epoch)
        lr_scheduler.step()
        history.append({"epoch": epoch, "loss": loss})
        print(f"  Epoch {epoch}: loss = {loss:.4f}")

    return history

4.4 COCO evaluation

from pycocotools.cocoeval import COCOeval

def evaluate_model(model, data_loader, coco_gt, device="cpu"):
    """Evaluate model using COCO metrics.

    Returns:
        results: dict with overall mAP and per-class AP
    """
    model.eval()
    model.to(device)

    all_predictions = []

    with torch.no_grad():
        for images, targets in tqdm(data_loader, desc="Evaluating"):
            images = [img.to(device) for img in images]
            outputs = model(images)

            for target, output in zip(targets, outputs):
                image_id = target["image_id"].item()
                boxes = output["boxes"].cpu().numpy()
                scores = output["scores"].cpu().numpy()
                labels = output["labels"].cpu().numpy()

                for box, score, label in zip(boxes, scores, labels):
                    x1, y1, x2, y2 = box
                    all_predictions.append({
                        "image_id": image_id,
                        "category_id": int(label),
                        "bbox": [float(x1), float(y1),
                                 float(x2 - x1), float(y2 - y1)],
                        "score": float(score),
                    })

    if not all_predictions:
        print("No predictions generated!")
        return None

    # Run COCO evaluation
    coco_dt = coco_gt.loadRes(all_predictions)
    coco_eval = COCOeval(coco_gt, coco_dt, "bbox")
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()

    # Extract per-class AP
    per_class_ap = {}
    for cat_id in coco_gt.getCatIds():
        coco_eval_cls = COCOeval(coco_gt, coco_dt, "bbox")
        coco_eval_cls.params.catIds = [cat_id]
        coco_eval_cls.evaluate()
        coco_eval_cls.accumulate()
        coco_eval_cls.summarize()
        cat_name = coco_gt.cats[cat_id]["name"]
        per_class_ap[cat_name] = coco_eval_cls.stats[0]  # AP @ IoU=0.50:0.95

    return {
        "mAP": coco_eval.stats[0],
        "mAP_50": coco_eval.stats[1],
        "mAP_75": coco_eval.stats[2],
        "per_class_ap": per_class_ap,
    }

4.5 Three-experiment comparison

Run the core experiment: train on original data, augmented data, and mixed data at different ratios:

experiments = {
    "baseline": {
        "train_images": "data/train/images",
        "train_annotations": "data/train/annotations.json",
        "description": "Original dataset (no augmentation)",
    },
    "augmented_2x": {
        "train_images": "data/augmented_2x/images",
        "train_annotations": "data/augmented_2x/annotations.json",
        "description": "2x minority class augmentation",
    },
    "augmented_5x": {
        "train_images": "data/augmented_5x/images",
        "train_annotations": "data/augmented_5x/annotations.json",
        "description": "5x minority class augmentation",
    },
    "augmented_10x": {
        "train_images": "data/augmented_10x/images",
        "train_annotations": "data/augmented_10x/annotations.json",
        "description": "10x minority class augmentation",
    },
}

results = {}
for exp_name, exp_config in experiments.items():
    print(f"\n{'='*60}")
    print(f"Experiment: {exp_name}")
    print(f"{'='*60}")

    # Create dataset and loader
    train_dataset = CocoDetectionDataset(
        exp_config["train_images"],
        exp_config["train_annotations"],
    )
    train_loader = DataLoader(
        train_dataset, batch_size=4, shuffle=True,
        collate_fn=collate_fn, num_workers=0,
    )

    # Train
    model = create_model(num_classes=10)
    history = train_detector(model, train_loader, num_epochs=8, device=device)

    # Evaluate on original (non-augmented) test set
    eval_results = evaluate_model(model, test_loader, coco_test, device=device)
    results[exp_name] = eval_results

    print(f"\n{exp_name} Results:")
    print(f"  Overall mAP:  {eval_results['mAP']:.4f}")
    print(f"  mAP@50:       {eval_results['mAP_50']:.4f}")
    for cls_name, ap in eval_results["per_class_ap"].items():
        print(f"  AP({cls_name:20s}): {ap:.4f}")

4.6 Results visualization

# Per-class AP comparison bar chart
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

cat_names = list(results["baseline"]["per_class_ap"].keys())
x = np.arange(len(cat_names))
width = 0.2

for i, (exp_name, exp_results) in enumerate(results.items()):
    aps = [exp_results["per_class_ap"].get(c, 0) for c in cat_names]
    axes[0].bar(x + i * width, aps, width, label=exp_name, alpha=0.85)

axes[0].set_xticks(x + width * 1.5)
axes[0].set_xticklabels(cat_names, rotation=30, ha="right")
axes[0].set_ylabel("AP @ IoU=0.50:0.95")
axes[0].set_title("Per-Class AP Comparison Across Experiments")
axes[0].legend()
axes[0].grid(axis="y", alpha=0.3)

# Improvement over baseline (tail classes only)
tail_classes = ["cyclist", "construction_worker", "animal"]
baseline_aps = [results["baseline"]["per_class_ap"][c] for c in tail_classes]

for exp_name in ["augmented_2x", "augmented_5x", "augmented_10x"]:
    improvements = [
        results[exp_name]["per_class_ap"][c] - results["baseline"]["per_class_ap"][c]
        for c in tail_classes
    ]
    axes[1].bar(
        [f"{c}\n({exp_name})" for c in tail_classes],
        improvements,
        alpha=0.85,
    )

axes[1].axhline(y=0, color="black", linewidth=0.5)
axes[1].set_ylabel("AP Improvement over Baseline")
axes[1].set_title("Minority Class AP Improvement by Augmentation Ratio")
axes[1].grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

4.7 Augmentation ratio sweep analysis

# Plot AP vs augmentation ratio for each tail class
ratios = [0, 2, 5, 10]
ratio_labels = ["baseline", "augmented_2x", "augmented_5x", "augmented_10x"]

fig, ax = plt.subplots(1, 1, figsize=(10, 6))

for cls_name in tail_classes:
    aps = [results[r]["per_class_ap"][cls_name] for r in ratio_labels]
    ax.plot(ratios, aps, marker="o", linewidth=2, markersize=8, label=cls_name)

ax.set_xlabel("Augmentation Ratio (x)")
ax.set_ylabel("AP @ IoU=0.50:0.95")
ax.set_title("Minority Class AP vs. Augmentation Ratio")
ax.legend()
ax.grid(alpha=0.3)
ax.set_xticks(ratios)
ax.set_xticklabels(["0x\n(baseline)", "2x", "5x", "10x"])

plt.tight_layout()
plt.show()

4.8 Statistical summary

# Summary table
print("\n" + "=" * 80)
print("RESULTS SUMMARY")
print("=" * 80)

header = f"{'Experiment':20s} | {'mAP':>8s} | {'mAP@50':>8s} | "
header += " | ".join(f"{c:>12s}" for c in cat_names)
print(header)
print("-" * len(header))

for exp_name, exp_results in results.items():
    row = f"{exp_name:20s} | {exp_results['mAP']:8.4f} | {exp_results['mAP_50']:8.4f} | "
    row += " | ".join(
        f"{exp_results['per_class_ap'][c]:12.4f}" for c in cat_names
    )
    print(row)

# Tail-class summary
print("\n\nTail-Class Summary:")
print(f"{'Class':25s} {'Baseline AP':>12s} {'Best Aug AP':>12s} {'Improvement':>12s} {'Best Ratio':>12s}")
print("-" * 75)

for cls_name in tail_classes:
    baseline_ap = results["baseline"]["per_class_ap"][cls_name]
    best_ap = baseline_ap
    best_ratio = "baseline"
    for exp_name in ["augmented_2x", "augmented_5x", "augmented_10x"]:
        ap = results[exp_name]["per_class_ap"][cls_name]
        if ap > best_ap:
            best_ap = ap
            best_ratio = exp_name
    improvement = best_ap - baseline_ap
    print(f"{cls_name:25s} {baseline_ap:12.4f} {best_ap:12.4f} {improvement:+12.4f} {best_ratio:>12s}")

4.9 Qualitative examples

Show detection results on sample images, highlighting where the augmented model succeeds and the baseline model fails:

def visualize_detections(model, image, coco_gt, image_id, device,
                          score_threshold=0.5, ax=None):
    """Show model predictions on an image."""
    model.eval()
    img_tensor = T.ToTensor()(image).unsqueeze(0).to(device)

    with torch.no_grad():
        predictions = model(img_tensor)[0]

    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 7))

    ax.imshow(image)

    for box, score, label in zip(
        predictions["boxes"].cpu(),
        predictions["scores"].cpu(),
        predictions["labels"].cpu(),
    ):
        if score < score_threshold:
            continue
        x1, y1, x2, y2 = box.numpy()
        cat_name = coco_gt.cats[label.item()]["name"]

        rect = plt.Rectangle(
            (x1, y1), x2 - x1, y2 - y1,
            linewidth=2, edgecolor=CLASS_COLORS.get(cat_name, "white"),
            facecolor="none",
        )
        ax.add_patch(rect)
        ax.text(
            x1, y1 - 3, f"{cat_name} {score:.2f}",
            fontsize=8, color="white",
            bbox=dict(boxstyle="round,pad=0.2",
                      facecolor=CLASS_COLORS.get(cat_name, "gray")),
        )

    ax.axis("off")
    return ax


# Compare baseline vs best augmented model on test images containing tail classes
fig, axes = plt.subplots(3, 2, figsize=(16, 18))
for i, img_id in enumerate(tail_class_test_images[:3]):
    img = load_test_image(img_id)
    visualize_detections(baseline_model, img, coco_test, img_id, device, ax=axes[i, 0])
    axes[i, 0].set_title(f"Baseline Model (Image {img_id})")

    visualize_detections(best_augmented_model, img, coco_test, img_id, device, ax=axes[i, 1])
    axes[i, 1].set_title(f"Augmented Model (Image {img_id})")

plt.suptitle("Detection Comparison: Baseline vs. Augmented", fontsize=14)
plt.tight_layout()
plt.show()

Exercise (Notebook 03)

Class-weighted loss: Instead of augmenting the data, try using a class-weighted loss function where rare classes have higher loss weights. Compare the results with the copy-paste approach. Which works better for which classes?
Combined approach: Use both data augmentation AND class-weighted loss together. Does the combination outperform either technique alone?
Head-class impact analysis: Check whether augmenting minority classes degrades performance on head classes (cars, trucks). Plot head-class AP across experiments to verify that the augmentation is not simply trading off performance.

Extensions and Next Steps

Intermediate Extensions

Segmentation-aware copy-paste: Instead of rectangular crops, use instance segmentation masks for more accurate compositing. This eliminates the rectangular artifact around pasted objects.
Style transfer blending: Apply neural style transfer between the pasted instance and the target image so that lighting and color palette match the target scene.
Hard negative mining: Identify the specific images/scenarios where the baseline model fails on minority classes, then preferentially augment those scenarios.

Advanced Extensions

Generative augmentation: Use a generative model (e.g., stable diffusion with ControlNet) to inpaint minority-class objects into scenes instead of copy-pasting. This avoids compositing artifacts entirely.
3D-aware augmentation: Use estimated depth maps and camera geometry to place synthetic objects with physically correct perspective and occlusion.
Active learning loop: Train a detector, find its worst-performing minority-class scenarios, generate augmented data targeting those scenarios, retrain, and iterate.

Synthetic Data Pipeline -- build a full 3D rendering pipeline that can generate photorealistic minority-class data.
Domain Adaptation Benchmark -- bridge the gap between synthetic and real data using domain adaptation techniques.

References

Ghiasi, G., Cui, Y., Srinivas, A., et al. (2021). "Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation." CVPR 2021.
Dwibedi, D., Misra, I., Hebert, M. (2017). "Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection." ICCV 2017.
Cui, Y., Jia, M., Lin, T.Y., et al. (2019). "Class-Balanced Loss Based on Effective Number of Samples." CVPR 2019.
Gupta, A., Dollar, P., Girshick, R. (2019). "LVIS: A Dataset for Large Vocabulary Instance Segmentation." CVPR 2019.
Caesar, H., Bankiti, V., Lang, A.H., et al. (2020). "nuScenes: A Multimodal Dataset for Autonomous Driving." CVPR 2020.
Sun, P., Kretzschmar, H., Dotiwalla, X., et al. (2020). "Scalability in Perception for Autonomous Driving: Waymo Open Dataset." CVPR 2020.

Appendix: COCO Annotation Format Reference

For this project, all datasets use the COCO detection format:

{
    "images": [
        {"id": 1, "file_name": "000001.png", "width": 640, "height": 480}
    ],
    "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "category_id": 1,
            "bbox": [100.0, 200.0, 50.0, 80.0],
            "area": 4000.0,
            "iscrowd": 0
        }
    ],
    "categories": [
        {"id": 1, "name": "car", "supercategory": "vehicle"},
        {"id": 2, "name": "truck", "supercategory": "vehicle"},
        {"id": 3, "name": "pedestrian", "supercategory": "person"},
        {"id": 4, "name": "bus", "supercategory": "vehicle"},
        {"id": 5, "name": "motorcycle", "supercategory": "vehicle"},
        {"id": 6, "name": "cyclist", "supercategory": "person"},
        {"id": 7, "name": "construction_worker", "supercategory": "person"},
        {"id": 8, "name": "traffic_cone", "supercategory": "object"},
        {"id": 9, "name": "animal", "supercategory": "animal"}
    ]
}

Key fields:

bbox is in [x, y, width, height] format (top-left corner + dimensions)
area is the bounding box area in pixels
category_id references the id field in the categories list
iscrowd is 0 for individual objects (we do not use crowd annotations in this project)