Welcome back to our computer vision series! In Part 1, we explored image classification, learning to categorize entire images. Now we'll tackle object detection - a more complex task that requires not only identifying what objects are in an image, but also precisely locating where they are.

Object detection is fundamental to many real-world applications: autonomous driving, surveillance systems, medical imaging, and robotics. Unlike classification, detection models must handle variable numbers of objects per image and predict both their class labels and spatial locations.

From Classification to Detection

The key differences between classification and detection:

  • Classification: "What is in this image?" → Single label per image
  • Detection: "What objects are in this image and where are they?" → Multiple objects with bounding boxes

Understanding Bounding Boxes

Bounding boxes are typically represented as [x_min, y_min, x_max, y_max] or [x_center, y_center, width, height].

Python
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np

def visualize_bbox(image, bbox, label, color='red'):
    """
    Visualize bounding box on image
    bbox format: [x_min, y_min, x_max, y_max]
    """
    fig, ax = plt.subplots(1, figsize=(8, 8))
    ax.imshow(image)

    # Create rectangle patch
    x_min, y_min, x_max, y_max = bbox
    width = x_max - x_min
    height = y_max - y_min

    rect = patches.Rectangle((x_min, y_min), width, height, 
                           linewidth=2, edgecolor=color, facecolor='none')
    ax.add_patch(rect)

    # Add label
    ax.text(x_min, y_min - 5, label, fontsize=12, color=color, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor=color, alpha=0.7))

    ax.axis('off')
    plt.show()

# Example usage
# image = Image.open('example.jpg')
# bbox = [50, 30, 200, 180]  # [x_min, y_min, x_max, y_max]
# visualize_bbox(image, bbox, 'Cat')

Dataset Preparation

We'll work with the COCO dataset format, which is the standard for object detection. For this tutorial, we'll create a simplified example using a subset of data.

Python
import json
from torch.utils.data import Dataset, DataLoader
import cv2

class COCODataset(Dataset):
    def __init__(self, root_dir, annotation_file, transforms=None):
        self.root_dir = root_dir
        self.transforms = transforms

        # Load COCO annotations
        with open(annotation_file, 'r') as f:
            self.coco = json.load(f)

        # Create mappings
        self.image_id_to_filename = {
            img['id']: img['file_name'] 
            for img in self.coco['images']
        }

        self.category_id_to_name = {
            cat['id']: cat['name'] 
            for cat in self.coco['categories']
        }

        # Group annotations by image
        self.image_annotations = {}
        for ann in self.coco['annotations']:
            image_id = ann['image_id']
            if image_id not in self.image_annotations:
                self.image_annotations[image_id] = []
            self.image_annotations[image_id].append(ann)

        self.image_ids = list(self.image_annotations.keys())

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        image_id = self.image_ids[idx]

        # Load image
        filename = self.image_id_to_filename[image_id]
        image_path = f"{self.root_dir}/{filename}"
        image = Image.open(image_path).convert('RGB')

        # Get annotations for this image
        annotations = self.image_annotations[image_id]

        boxes = []
        labels = []

        for ann in annotations:
            # COCO bbox format: [x, y, width, height]
            x, y, w, h = ann['bbox']
            # Convert to [x_min, y_min, x_max, y_max]
            boxes.append([x, y, x + w, y + h])
            labels.append(ann['category_id'])

        # Convert to tensors
        boxes = torch.FloatTensor(boxes)
        labels = torch.LongTensor(labels)

        target = {
            'boxes': boxes,
            'labels': labels,
            'image_id': torch.tensor([image_id])
        }

        if self.transforms:
            image, target = self.transforms(image, target)

        return image, target

# Custom transforms for detection
class DetectionTransforms:
    def __init__(self, train=True):
        self.train = train

    def __call__(self, image, target):
        # Convert PIL to tensor
        image = transforms.ToTensor()(image)

        if self.train:
            # Add data augmentation here
            # For now, just normalize
            image = transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )(image)

        return image, target

# Example dataset setup (you would use your actual data paths)
# train_dataset = COCODataset(
#     root_dir='path/to/images',
#     annotation_file='path/to/annotations.json',
#     transforms=DetectionTransforms(train=True)
# )

Detection Architectures

YOLO (You Only Look Once) - One-Stage Detector

YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly.

Python
import torch.nn as nn

class YOLOv1(nn.Module):
    def __init__(self, num_classes=20, num_boxes=2):
        super(YOLOv1, self).__init__()
        self.num_classes = num_classes
        self.num_boxes = num_boxes

        # Backbone (simplified version of Darknet)
        self.backbone = nn.Sequential(
            # Conv Layer 1
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),

            # Conv Layer 2
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),

            # Conv Layer 3-4
            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1),
            nn.Conv2d(256, 256, kernel_size=1, stride=1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),

            # Conv Layer 5-8
            nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1),
            nn.Conv2d(512, 256, kernel_size=1, stride=1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1),
            nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),
        )

        # Detection head
        self.detection_head = nn.Sequential(
            nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1),
            nn.Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1),
        )

        # Final prediction layer
        # Each grid cell predicts: num_boxes * 5 + num_classes
        # 5 = (x, y, w, h, confidence)
        output_size = num_boxes * 5 + num_classes
        self.final_layer = nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * 7 * 7, 4096),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.5),
            nn.Linear(4096, 7 * 7 * output_size)
        )

    def forward(self, x):
        x = self.backbone(x)
        x = self.detection_head(x)
        x = self.final_layer(x)

        # Reshape to (batch_size, 7, 7, num_boxes * 5 + num_classes)
        batch_size = x.size(0)
        output_size = self.num_boxes * 5 + self.num_classes
        x = x.view(batch_size, 7, 7, output_size)

        return x

# Initialize model
yolo_model = YOLOv1(num_classes=20, num_boxes=2)
print(f"YOLO model parameters: {sum(p.numel() for p in yolo_model.parameters()):,}")

Using Pre-trained Faster R-CNN

For practical applications, it's often better to use pre-trained models from torchvision:

Python
import torchvision.models as models
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

def create_faster_rcnn_model(num_classes):
    # Load pre-trained Faster R-CNN model
    model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # Replace the classifier head for our number of classes
    # (background is class 0, so we need num_classes + 1)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = models.detection.faster_rcnn.FastRCNNPredictor(
        in_features, num_classes + 1
    )

    return model

# Create model for COCO dataset (80 classes + background)
faster_rcnn = create_faster_rcnn_model(num_classes=80)

def create_mobilenet_detector(num_classes):
    # Use MobileNetV3 backbone for faster inference
    backbone = models.mobilenet_v3_large(pretrained=True).features
    backbone.out_channels = 960

    # Define anchor generator
    anchor_generator = AnchorGenerator(
        sizes=((32, 64, 128, 256, 512),),
        aspect_ratios=((0.5, 1.0, 2.0),)
    )

    # Define ROI pooler
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(
        featmap_names=['0'],
        output_size=7,
        sampling_ratio=2
    )

    # Create the model
    model = FasterRCNN(
        backbone,
        num_classes=num_classes + 1,
        rpn_anchor_generator=anchor_generator,
        box_roi_pool=roi_pooler
    )

    return model

# Lightweight model for mobile deployment
mobile_detector = create_mobilenet_detector(num_classes=80)

Training Object Detection Models

YOLO Loss Function

Python
class YOLOLoss(nn.Module):
    def __init__(self, S=7, B=2, C=20):
        super(YOLOLoss, self).__init__()
        self.S = S  # Grid size
        self.B = B  # Number of bounding boxes per cell
        self.C = C  # Number of classes
        self.lambda_coord = 5
        self.lambda_noobj = 0.5

    def forward(self, predictions, targets):
        """
        predictions: (batch_size, S, S, B*5 + C)
        targets: (batch_size, S, S, B*5 + C)
        """
        batch_size = predictions.size(0)

        # Split predictions
        # Class probabilities
        class_pred = predictions[:, :, :, :self.C]

        # Bounding box predictions
        bbox_pred = predictions[:, :, :, self.C:].contiguous().view(
            batch_size, self.S, self.S, self.B, 5
        )

        # Split targets similarly
        class_target = targets[:, :, :, :self.C]
        bbox_target = targets[:, :, :, self.C:].contiguous().view(
            batch_size, self.S, self.S, self.B, 5
        )

        # Object mask (where objects exist)
        obj_mask = bbox_target[:, :, :, :, 4] > 0  # confidence > 0
        noobj_mask = bbox_target[:, :, :, :, 4] == 0

        # Coordinate loss (only for cells with objects)
        coord_loss = 0
        if obj_mask.sum() > 0:
            coord_pred = bbox_pred[obj_mask]  # [N, 5]
            coord_target = bbox_target[obj_mask]  # [N, 5]

            # xy loss
            xy_loss = torch.sum((coord_pred[:, :2] - coord_target[:, :2]) ** 2)

            # wh loss (square root)
            wh_loss = torch.sum((torch.sqrt(coord_pred[:, 2:4]) - 
                               torch.sqrt(coord_target[:, 2:4])) ** 2)

            coord_loss = self.lambda_coord * (xy_loss + wh_loss)

        # Confidence loss
        # Object confidence loss
        obj_conf_loss = 0
        if obj_mask.sum() > 0:
            obj_conf_pred = bbox_pred[obj_mask][:, 4]
            obj_conf_target = bbox_target[obj_mask][:, 4]
            obj_conf_loss = torch.sum((obj_conf_pred - obj_conf_target) ** 2)

        # No object confidence loss
        noobj_conf_loss = 0
        if noobj_mask.sum() > 0:
            noobj_conf_pred = bbox_pred[noobj_mask][:, 4]
            noobj_conf_loss = torch.sum(noobj_conf_pred ** 2)

        conf_loss = obj_conf_loss + self.lambda_noobj * noobj_conf_loss

        # Classification loss (only for cells with objects)
        class_loss = 0
        if torch.sum(bbox_target[:, :, :, 0, 4] > 0) > 0:
            obj_cells = bbox_target[:, :, :, 0, 4] > 0
            class_pred_obj = class_pred[obj_cells]
            class_target_obj = class_target[obj_cells]
            class_loss = torch.sum((class_pred_obj - class_target_obj) ** 2)

        total_loss = coord_loss + conf_loss + class_loss
        return total_loss / batch_size

# Initialize loss function
criterion = YOLOLoss(S=7, B=2, C=20)

Training Loop for Detection

Python
def train_detection_model(model, train_loader, val_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    # For pre-trained models like Faster R-CNN, we can use SGD
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=0.005,
        momentum=0.9,
        weight_decay=0.0005
    )

    # Learning rate scheduler
    scheduler = torch.optim.lr_scheduler.StepLR(
        optimizer, step_size=3, gamma=0.1
    )

    for epoch in range(num_epochs):
        print(f'Epoch {epoch+1}/{num_epochs}')
        print('-' * 10)

        # Training phase
        model.train()
        running_loss = 0.0

        for images, targets in train_loader:
            images = [img.to(device) for img in images]
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

            optimizer.zero_grad()

            # Forward pass
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())

            # Backward pass
            losses.backward()
            optimizer.step()

            running_loss += losses.item()

        scheduler.step()

        epoch_loss = running_loss / len(train_loader)
        print(f'Training Loss: {epoch_loss:.4f}')

        # Validation
        if val_loader:
            val_loss = evaluate_detection_model(model, val_loader, device)
            print(f'Validation Loss: {val_loss:.4f}')

    return model

def evaluate_detection_model(model, val_loader, device):
    model.train()  # Keep in train mode for loss calculation
    running_loss = 0.0

    with torch.no_grad():
        for images, targets in val_loader:
            images = [img.to(device) for img in images]
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            running_loss += losses.item()

    return running_loss / len(val_loader)

# Train the model
# trained_model = train_detection_model(faster_rcnn, train_loader, val_loader, num_epochs=10)

Evaluation and Inference

Non-Maximum Suppression

Python
def non_max_suppression(boxes, scores, score_threshold=0.5, iou_threshold=0.5):
    """
    Apply Non-Maximum Suppression to remove overlapping boxes
    """
    # Filter out low-confidence boxes
    mask = scores > score_threshold
    boxes = boxes[mask]
    scores = scores[mask]

    if len(boxes) == 0:
        return torch.empty((0, 4)), torch.empty((0,))

    # Sort by scores in descending order
    sorted_indices = torch.argsort(scores, descending=True)

    keep = []
    while len(sorted_indices) > 0:
        # Keep the box with highest score
        current = sorted_indices[0]
        keep.append(current)

        if len(sorted_indices) == 1:
            break

        # Calculate IoU with remaining boxes
        current_box = boxes[current].unsqueeze(0)
        remaining_boxes = boxes[sorted_indices[1:]]

        ious = calculate_iou(current_box, remaining_boxes)

        # Keep only boxes with IoU < threshold
        mask = ious < iou_threshold
        sorted_indices = sorted_indices[1:][mask]

    return boxes[keep], scores[keep]

def calculate_iou(boxes1, boxes2):
    """Calculate Intersection over Union (IoU)"""
    # Calculate intersection coordinates
    x1 = torch.max(boxes1[:, 0].unsqueeze(1), boxes2[:, 0].unsqueeze(0))
    y1 = torch.max(boxes1[:, 1].unsqueeze(1), boxes2[:, 1].unsqueeze(0))
    x2 = torch.min(boxes1[:, 2].unsqueeze(1), boxes2[:, 2].unsqueeze(0))
    y2 = torch.min(boxes1[:, 3].unsqueeze(1), boxes2[:, 3].unsqueeze(0))

    # Calculate intersection area
    intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)

    # Calculate areas of both boxes
    area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
    area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])

    # Calculate union area
    union = area1.unsqueeze(1) + area2.unsqueeze(0) - intersection

    # Calculate IoU
    iou = intersection / union
    return iou

Inference Pipeline

Python
def detect_objects(model, image, device, confidence_threshold=0.7):
    """
    Perform object detection on a single image
    """
    model.eval()

    # Preprocess image
    transform = transforms.Compose([
        transforms.ToTensor(),
    ])

    if isinstance(image, Image.Image):
        image_tensor = transform(image).unsqueeze(0).to(device)
    else:
        image_tensor = image.unsqueeze(0).to(device)

    with torch.no_grad():
        predictions = model(image_tensor)

    # Extract predictions
    pred = predictions[0]  # First (and only) image in batch

    boxes = pred['boxes'].cpu()
    scores = pred['scores'].cpu()
    labels = pred['labels'].cpu()

    # Filter by confidence
    mask = scores > confidence_threshold
    boxes = boxes[mask]
    scores = scores[mask]
    labels = labels[mask]

    return boxes, scores, labels

def visualize_detections(image, boxes, scores, labels, class_names, 
                        confidence_threshold=0.5):
    """
    Visualize detection results on image
    """
    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(image)

    colors = plt.cm.Set3(np.linspace(0, 1, len(class_names)))

    for box, score, label in zip(boxes, scores, labels):
        if score < confidence_threshold:
            continue

        x_min, y_min, x_max, y_max = box
        width = x_max - x_min
        height = y_max - y_min

        # Draw bounding box
        color = colors[label % len(colors)]
        rect = patches.Rectangle((x_min, y_min), width, height,
                               linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)

        # Add label and score
        label_text = f'{class_names[label-1]}: {score:.2f}'
        ax.text(x_min, y_min - 5, label_text, fontsize=10, color=color,
                bbox=dict(boxstyle="round,pad=0.3", facecolor='white', alpha=0.7))

    ax.axis('off')
    plt.tight_layout()
    plt.show()

# COCO class names
COCO_CLASSES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
    'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench',
    'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra',
    'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
    'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
    'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
    'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
    'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
    'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier',
    'toothbrush'
]

# Example usage
# image = Image.open('test_image.jpg')
# boxes, scores, labels = detect_objects(faster_rcnn, image, device)
# visualize_detections(image, boxes, scores, labels, COCO_CLASSES)

Mean Average Precision (mAP) Evaluation

Python
def calculate_ap(recalls, precisions):
    """Calculate Average Precision using the 11-point method"""
    ap = 0.0
    for t in np.arange(0.0, 1.1, 0.1):
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap = ap + p / 11.0
    return ap

def evaluate_detection_map(model, test_loader, device, iou_threshold=0.5):
    """
    Calculate mean Average Precision (mAP) for object detection
    """
    model.eval()
    all_detections = []
    all_ground_truths = []

    with torch.no_grad():
        for images, targets in test_loader:
            images = [img.to(device) for img in images]
            predictions = model(images)

            for i, (pred, target) in enumerate(zip(predictions, targets)):
                # Store predictions
                boxes = pred['boxes'].cpu().numpy()
                scores = pred['scores'].cpu().numpy()
                labels = pred['labels'].cpu().numpy()

                all_detections.append({
                    'boxes': boxes,
                    'scores': scores,
                    'labels': labels
                })

                # Store ground truth
                gt_boxes = target['boxes'].cpu().numpy()
                gt_labels = target['labels'].cpu().numpy()

                all_ground_truths.append({
                    'boxes': gt_boxes,
                    'labels': gt_labels
                })

    # Calculate mAP for each class
    num_classes = 80  # COCO classes
    aps = []

    for class_id in range(1, num_classes + 1):
        # Collect all detections and ground truths for this class
        class_detections = []
        class_ground_truths = []

        for i, (det, gt) in enumerate(zip(all_detections, all_ground_truths)):
            # Filter detections for this class
            mask = det['labels'] == class_id
            if np.sum(mask) > 0:
                class_detections.extend([{
                    'image_id': i,
                    'box': box,
                    'score': score
                } for box, score in zip(det['boxes'][mask], det['scores'][mask])])

            # Filter ground truths for this class
            gt_mask = gt['labels'] == class_id
            if np.sum(gt_mask) > 0:
                class_ground_truths.extend([{
                    'image_id': i,
                    'box': box
                } for box in gt['boxes'][gt_mask]])

        if len(class_detections) == 0:
            aps.append(0.0)
            continue

        # Sort detections by score
        class_detections.sort(key=lambda x: x['score'], reverse=True)

        # Calculate precision and recall
        tp = np.zeros(len(class_detections))
        fp = np.zeros(len(class_detections))

        for i, detection in enumerate(class_detections):
            # Find matching ground truths in the same image
            gt_boxes = [gt['box'] for gt in class_ground_truths 
                       if gt['image_id'] == detection['image_id']]

            if len(gt_boxes) == 0:
                fp[i] = 1
                continue

            # Calculate IoU with all ground truth boxes
            ious = []
            for gt_box in gt_boxes:
                iou = calculate_iou_single(detection['box'], gt_box)
                ious.append(iou)

            max_iou = max(ious)
            if max_iou >= iou_threshold:
                tp[i] = 1
            else:
                fp[i] = 1

        # Calculate cumulative precision and recall
        tp_cumsum = np.cumsum(tp)
        fp_cumsum = np.cumsum(fp)

        recalls = tp_cumsum / len(class_ground_truths)
        precisions = tp_cumsum / (tp_cumsum + fp_cumsum)

        # Calculate AP
        ap = calculate_ap(recalls, precisions)
        aps.append(ap)

    return np.mean(aps), aps

def calculate_iou_single(box1, box2):
    """Calculate IoU between two boxes"""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    if x2 <= x1 or y2 <= y1:
        return 0.0

    intersection = (x2 - x1) * (y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / union

# Evaluate model
# map_score, class_aps = evaluate_detection_map(faster_rcnn, test_loader, device)
# print(f'mAP: {map_score:.4f}')

Real-time Detection Pipeline

Python
import cv2

class RealTimeDetector:
    def __init__(self, model, device, class_names, confidence_threshold=0.7):
        self.model = model
        self.device = device
        self.class_names = class_names
        self.confidence_threshold = confidence_threshold
        self.model.eval()

    def detect_frame(self, frame):
        """Detect objects in a single frame"""
        # Convert BGR to RGB
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        pil_image = Image.fromarray(rgb_frame)

        # Detect objects
        boxes, scores, labels = detect_objects(
            self.model, pil_image, self.device, self.confidence_threshold
        )

        return boxes, scores, labels

    def draw_detections(self, frame, boxes, scores, labels):
        """Draw detection results on frame"""
        colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), 
                 (255, 0, 255), (0, 255, 255)]

        for box, score, label in zip(boxes, scores, labels):
            x1, y1, x2, y2 = map(int, box)
            color = colors[label % len(colors)]

            # Draw bounding box
            cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)

            # Draw label
            label_text = f'{self.class_names[label-1]}: {score:.2f}'
            label_size = cv2.getTextSize(label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 2)[0]

            # Background for text
            cv2.rectangle(frame, (x1, y1 - label_size[1] - 10), 
                         (x1 + label_size[0], y1), color, -1)

            # Text
            cv2.putText(frame, label_text, (x1, y1 - 5), 
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

        return frame

    def run_webcam(self):
        """Run detection on webcam feed"""
        cap = cv2.VideoCapture(0)

        while True:
            ret, frame = cap.read()
            if not ret:
                break

            # Detect objects
            boxes, scores, labels = self.detect_frame(frame)

            # Draw results
            frame = self.draw_detections(frame, boxes, scores, labels)

            # Display
            cv2.imshow('Object Detection', frame)

            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

        cap.release()
        cv2.destroyAllWindows()

# Example usage
# detector = RealTimeDetector(faster_rcnn, device, COCO_CLASSES)
# detector.run_webcam()  # Press 'q' to quit

Key Takeaways

In this second part of our computer vision series, we covered:

  1. Object Detection Fundamentals: Understanding bounding boxes and multi-object scenarios
  2. Detection Architectures: One-stage (YOLO) vs two-stage (Faster R-CNN) detectors
  3. Training Process: Custom loss functions and evaluation metrics
  4. Evaluation: Non-maximum suppression and mAP calculation
  5. Real-time Applications: Webcam detection pipeline

Best Practices for Object Detection:

  • Data Augmentation: Use detection-aware augmentations that transform both images and annotations
  • Anchor Design: Choose appropriate anchor sizes and aspect ratios for your dataset
  • Multi-scale Training: Train and test at multiple image scales
  • Hard Negative Mining: Focus training on difficult examples
  • Model Ensemble: Combine multiple models for better performance

Performance Optimization Tips:

  • Model Pruning: Remove redundant parameters for faster inference
  • Quantization: Use lower precision arithmetic (FP16, INT8)
  • TensorRT/ONNX: Convert models for optimized deployment
  • Batch Processing: Process multiple images simultaneously

In Part 3, we'll explore semantic segmentation, where we'll learn to classify every pixel in an image, providing even more detailed understanding of visual scenes.

Next Steps

  • Experiment with different detection architectures (RetinaNet, EfficientDet)
  • Try instance segmentation with Mask R-CNN
  • Implement custom datasets with your own object classes
  • Deploy models on mobile devices or edge hardware

Stay tuned for the final part of our series, where we'll dive into semantic segmentation!