Welcome back to our computer vision series! In Part 1, we explored image classification, learning to categorize entire images. Now we'll tackle object detection - a more complex task that requires not only identifying what objects are in an image, but also precisely locating where they are.
Object detection is fundamental to many real-world applications: autonomous driving, surveillance systems, medical imaging, and robotics. Unlike classification, detection models must handle variable numbers of objects per image and predict both their class labels and spatial locations.
From Classification to Detection
The key differences between classification and detection:
- Classification: "What is in this image?" → Single label per image
- Detection: "What objects are in this image and where are they?" → Multiple objects with bounding boxes
Understanding Bounding Boxes
Bounding boxes are typically represented as [x_min, y_min, x_max, y_max]
or [x_center, y_center, width, height]
.
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np
def visualize_bbox(image, bbox, label, color='red'):
"""
Visualize bounding box on image
bbox format: [x_min, y_min, x_max, y_max]
"""
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(image)
# Create rectangle patch
x_min, y_min, x_max, y_max = bbox
width = x_max - x_min
height = y_max - y_min
rect = patches.Rectangle((x_min, y_min), width, height,
linewidth=2, edgecolor=color, facecolor='none')
ax.add_patch(rect)
# Add label
ax.text(x_min, y_min - 5, label, fontsize=12, color=color,
bbox=dict(boxstyle="round,pad=0.3", facecolor=color, alpha=0.7))
ax.axis('off')
plt.show()
# Example usage
# image = Image.open('example.jpg')
# bbox = [50, 30, 200, 180] # [x_min, y_min, x_max, y_max]
# visualize_bbox(image, bbox, 'Cat')
Dataset Preparation
We'll work with the COCO dataset format, which is the standard for object detection. For this tutorial, we'll create a simplified example using a subset of data.
import json
from torch.utils.data import Dataset, DataLoader
import cv2
class COCODataset(Dataset):
def __init__(self, root_dir, annotation_file, transforms=None):
self.root_dir = root_dir
self.transforms = transforms
# Load COCO annotations
with open(annotation_file, 'r') as f:
self.coco = json.load(f)
# Create mappings
self.image_id_to_filename = {
img['id']: img['file_name']
for img in self.coco['images']
}
self.category_id_to_name = {
cat['id']: cat['name']
for cat in self.coco['categories']
}
# Group annotations by image
self.image_annotations = {}
for ann in self.coco['annotations']:
image_id = ann['image_id']
if image_id not in self.image_annotations:
self.image_annotations[image_id] = []
self.image_annotations[image_id].append(ann)
self.image_ids = list(self.image_annotations.keys())
def __len__(self):
return len(self.image_ids)
def __getitem__(self, idx):
image_id = self.image_ids[idx]
# Load image
filename = self.image_id_to_filename[image_id]
image_path = f"{self.root_dir}/{filename}"
image = Image.open(image_path).convert('RGB')
# Get annotations for this image
annotations = self.image_annotations[image_id]
boxes = []
labels = []
for ann in annotations:
# COCO bbox format: [x, y, width, height]
x, y, w, h = ann['bbox']
# Convert to [x_min, y_min, x_max, y_max]
boxes.append([x, y, x + w, y + h])
labels.append(ann['category_id'])
# Convert to tensors
boxes = torch.FloatTensor(boxes)
labels = torch.LongTensor(labels)
target = {
'boxes': boxes,
'labels': labels,
'image_id': torch.tensor([image_id])
}
if self.transforms:
image, target = self.transforms(image, target)
return image, target
# Custom transforms for detection
class DetectionTransforms:
def __init__(self, train=True):
self.train = train
def __call__(self, image, target):
# Convert PIL to tensor
image = transforms.ToTensor()(image)
if self.train:
# Add data augmentation here
# For now, just normalize
image = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)(image)
return image, target
# Example dataset setup (you would use your actual data paths)
# train_dataset = COCODataset(
# root_dir='path/to/images',
# annotation_file='path/to/annotations.json',
# transforms=DetectionTransforms(train=True)
# )
Detection Architectures
YOLO (You Only Look Once) - One-Stage Detector
YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly.
import torch.nn as nn
class YOLOv1(nn.Module):
def __init__(self, num_classes=20, num_boxes=2):
super(YOLOv1, self).__init__()
self.num_classes = num_classes
self.num_boxes = num_boxes
# Backbone (simplified version of Darknet)
self.backbone = nn.Sequential(
# Conv Layer 1
nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
# Conv Layer 2
nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(128),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
# Conv Layer 3-4
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.1),
nn.Conv2d(256, 256, kernel_size=1, stride=1),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
# Conv Layer 5-8
nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.1),
nn.Conv2d(512, 256, kernel_size=1, stride=1),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.1),
nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
)
# Detection head
self.detection_head = nn.Sequential(
nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(1024),
nn.LeakyReLU(0.1),
nn.Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(1024),
nn.LeakyReLU(0.1),
)
# Final prediction layer
# Each grid cell predicts: num_boxes * 5 + num_classes
# 5 = (x, y, w, h, confidence)
output_size = num_boxes * 5 + num_classes
self.final_layer = nn.Sequential(
nn.Flatten(),
nn.Linear(1024 * 7 * 7, 4096),
nn.LeakyReLU(0.1),
nn.Dropout(0.5),
nn.Linear(4096, 7 * 7 * output_size)
)
def forward(self, x):
x = self.backbone(x)
x = self.detection_head(x)
x = self.final_layer(x)
# Reshape to (batch_size, 7, 7, num_boxes * 5 + num_classes)
batch_size = x.size(0)
output_size = self.num_boxes * 5 + self.num_classes
x = x.view(batch_size, 7, 7, output_size)
return x
# Initialize model
yolo_model = YOLOv1(num_classes=20, num_boxes=2)
print(f"YOLO model parameters: {sum(p.numel() for p in yolo_model.parameters()):,}")
Using Pre-trained Faster R-CNN
For practical applications, it's often better to use pre-trained models from torchvision:
import torchvision.models as models
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
def create_faster_rcnn_model(num_classes):
# Load pre-trained Faster R-CNN model
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
# Replace the classifier head for our number of classes
# (background is class 0, so we need num_classes + 1)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = models.detection.faster_rcnn.FastRCNNPredictor(
in_features, num_classes + 1
)
return model
# Create model for COCO dataset (80 classes + background)
faster_rcnn = create_faster_rcnn_model(num_classes=80)
def create_mobilenet_detector(num_classes):
# Use MobileNetV3 backbone for faster inference
backbone = models.mobilenet_v3_large(pretrained=True).features
backbone.out_channels = 960
# Define anchor generator
anchor_generator = AnchorGenerator(
sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),)
)
# Define ROI pooler
roi_pooler = torchvision.ops.MultiScaleRoIAlign(
featmap_names=['0'],
output_size=7,
sampling_ratio=2
)
# Create the model
model = FasterRCNN(
backbone,
num_classes=num_classes + 1,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler
)
return model
# Lightweight model for mobile deployment
mobile_detector = create_mobilenet_detector(num_classes=80)
Training Object Detection Models
YOLO Loss Function
class YOLOLoss(nn.Module):
def __init__(self, S=7, B=2, C=20):
super(YOLOLoss, self).__init__()
self.S = S # Grid size
self.B = B # Number of bounding boxes per cell
self.C = C # Number of classes
self.lambda_coord = 5
self.lambda_noobj = 0.5
def forward(self, predictions, targets):
"""
predictions: (batch_size, S, S, B*5 + C)
targets: (batch_size, S, S, B*5 + C)
"""
batch_size = predictions.size(0)
# Split predictions
# Class probabilities
class_pred = predictions[:, :, :, :self.C]
# Bounding box predictions
bbox_pred = predictions[:, :, :, self.C:].contiguous().view(
batch_size, self.S, self.S, self.B, 5
)
# Split targets similarly
class_target = targets[:, :, :, :self.C]
bbox_target = targets[:, :, :, self.C:].contiguous().view(
batch_size, self.S, self.S, self.B, 5
)
# Object mask (where objects exist)
obj_mask = bbox_target[:, :, :, :, 4] > 0 # confidence > 0
noobj_mask = bbox_target[:, :, :, :, 4] == 0
# Coordinate loss (only for cells with objects)
coord_loss = 0
if obj_mask.sum() > 0:
coord_pred = bbox_pred[obj_mask] # [N, 5]
coord_target = bbox_target[obj_mask] # [N, 5]
# xy loss
xy_loss = torch.sum((coord_pred[:, :2] - coord_target[:, :2]) ** 2)
# wh loss (square root)
wh_loss = torch.sum((torch.sqrt(coord_pred[:, 2:4]) -
torch.sqrt(coord_target[:, 2:4])) ** 2)
coord_loss = self.lambda_coord * (xy_loss + wh_loss)
# Confidence loss
# Object confidence loss
obj_conf_loss = 0
if obj_mask.sum() > 0:
obj_conf_pred = bbox_pred[obj_mask][:, 4]
obj_conf_target = bbox_target[obj_mask][:, 4]
obj_conf_loss = torch.sum((obj_conf_pred - obj_conf_target) ** 2)
# No object confidence loss
noobj_conf_loss = 0
if noobj_mask.sum() > 0:
noobj_conf_pred = bbox_pred[noobj_mask][:, 4]
noobj_conf_loss = torch.sum(noobj_conf_pred ** 2)
conf_loss = obj_conf_loss + self.lambda_noobj * noobj_conf_loss
# Classification loss (only for cells with objects)
class_loss = 0
if torch.sum(bbox_target[:, :, :, 0, 4] > 0) > 0:
obj_cells = bbox_target[:, :, :, 0, 4] > 0
class_pred_obj = class_pred[obj_cells]
class_target_obj = class_target[obj_cells]
class_loss = torch.sum((class_pred_obj - class_target_obj) ** 2)
total_loss = coord_loss + conf_loss + class_loss
return total_loss / batch_size
# Initialize loss function
criterion = YOLOLoss(S=7, B=2, C=20)
Training Loop for Detection
def train_detection_model(model, train_loader, val_loader, num_epochs=10):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# For pre-trained models like Faster R-CNN, we can use SGD
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.005,
momentum=0.9,
weight_decay=0.0005
)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer, step_size=3, gamma=0.1
)
for epoch in range(num_epochs):
print(f'Epoch {epoch+1}/{num_epochs}')
print('-' * 10)
# Training phase
model.train()
running_loss = 0.0
for images, targets in train_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
optimizer.zero_grad()
# Forward pass
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
# Backward pass
losses.backward()
optimizer.step()
running_loss += losses.item()
scheduler.step()
epoch_loss = running_loss / len(train_loader)
print(f'Training Loss: {epoch_loss:.4f}')
# Validation
if val_loader:
val_loss = evaluate_detection_model(model, val_loader, device)
print(f'Validation Loss: {val_loss:.4f}')
return model
def evaluate_detection_model(model, val_loader, device):
model.train() # Keep in train mode for loss calculation
running_loss = 0.0
with torch.no_grad():
for images, targets in val_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
running_loss += losses.item()
return running_loss / len(val_loader)
# Train the model
# trained_model = train_detection_model(faster_rcnn, train_loader, val_loader, num_epochs=10)
Evaluation and Inference
Non-Maximum Suppression
def non_max_suppression(boxes, scores, score_threshold=0.5, iou_threshold=0.5):
"""
Apply Non-Maximum Suppression to remove overlapping boxes
"""
# Filter out low-confidence boxes
mask = scores > score_threshold
boxes = boxes[mask]
scores = scores[mask]
if len(boxes) == 0:
return torch.empty((0, 4)), torch.empty((0,))
# Sort by scores in descending order
sorted_indices = torch.argsort(scores, descending=True)
keep = []
while len(sorted_indices) > 0:
# Keep the box with highest score
current = sorted_indices[0]
keep.append(current)
if len(sorted_indices) == 1:
break
# Calculate IoU with remaining boxes
current_box = boxes[current].unsqueeze(0)
remaining_boxes = boxes[sorted_indices[1:]]
ious = calculate_iou(current_box, remaining_boxes)
# Keep only boxes with IoU < threshold
mask = ious < iou_threshold
sorted_indices = sorted_indices[1:][mask]
return boxes[keep], scores[keep]
def calculate_iou(boxes1, boxes2):
"""Calculate Intersection over Union (IoU)"""
# Calculate intersection coordinates
x1 = torch.max(boxes1[:, 0].unsqueeze(1), boxes2[:, 0].unsqueeze(0))
y1 = torch.max(boxes1[:, 1].unsqueeze(1), boxes2[:, 1].unsqueeze(0))
x2 = torch.min(boxes1[:, 2].unsqueeze(1), boxes2[:, 2].unsqueeze(0))
y2 = torch.min(boxes1[:, 3].unsqueeze(1), boxes2[:, 3].unsqueeze(0))
# Calculate intersection area
intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)
# Calculate areas of both boxes
area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
# Calculate union area
union = area1.unsqueeze(1) + area2.unsqueeze(0) - intersection
# Calculate IoU
iou = intersection / union
return iou
Inference Pipeline
def detect_objects(model, image, device, confidence_threshold=0.7):
"""
Perform object detection on a single image
"""
model.eval()
# Preprocess image
transform = transforms.Compose([
transforms.ToTensor(),
])
if isinstance(image, Image.Image):
image_tensor = transform(image).unsqueeze(0).to(device)
else:
image_tensor = image.unsqueeze(0).to(device)
with torch.no_grad():
predictions = model(image_tensor)
# Extract predictions
pred = predictions[0] # First (and only) image in batch
boxes = pred['boxes'].cpu()
scores = pred['scores'].cpu()
labels = pred['labels'].cpu()
# Filter by confidence
mask = scores > confidence_threshold
boxes = boxes[mask]
scores = scores[mask]
labels = labels[mask]
return boxes, scores, labels
def visualize_detections(image, boxes, scores, labels, class_names,
confidence_threshold=0.5):
"""
Visualize detection results on image
"""
fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(image)
colors = plt.cm.Set3(np.linspace(0, 1, len(class_names)))
for box, score, label in zip(boxes, scores, labels):
if score < confidence_threshold:
continue
x_min, y_min, x_max, y_max = box
width = x_max - x_min
height = y_max - y_min
# Draw bounding box
color = colors[label % len(colors)]
rect = patches.Rectangle((x_min, y_min), width, height,
linewidth=2, edgecolor=color, facecolor='none')
ax.add_patch(rect)
# Add label and score
label_text = f'{class_names[label-1]}: {score:.2f}'
ax.text(x_min, y_min - 5, label_text, fontsize=10, color=color,
bbox=dict(boxstyle="round,pad=0.3", facecolor='white', alpha=0.7))
ax.axis('off')
plt.tight_layout()
plt.show()
# COCO class names
COCO_CLASSES = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench',
'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra',
'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier',
'toothbrush'
]
# Example usage
# image = Image.open('test_image.jpg')
# boxes, scores, labels = detect_objects(faster_rcnn, image, device)
# visualize_detections(image, boxes, scores, labels, COCO_CLASSES)
Mean Average Precision (mAP) Evaluation
def calculate_ap(recalls, precisions):
"""Calculate Average Precision using the 11-point method"""
ap = 0.0
for t in np.arange(0.0, 1.1, 0.1):
if np.sum(recalls >= t) == 0:
p = 0
else:
p = np.max(precisions[recalls >= t])
ap = ap + p / 11.0
return ap
def evaluate_detection_map(model, test_loader, device, iou_threshold=0.5):
"""
Calculate mean Average Precision (mAP) for object detection
"""
model.eval()
all_detections = []
all_ground_truths = []
with torch.no_grad():
for images, targets in test_loader:
images = [img.to(device) for img in images]
predictions = model(images)
for i, (pred, target) in enumerate(zip(predictions, targets)):
# Store predictions
boxes = pred['boxes'].cpu().numpy()
scores = pred['scores'].cpu().numpy()
labels = pred['labels'].cpu().numpy()
all_detections.append({
'boxes': boxes,
'scores': scores,
'labels': labels
})
# Store ground truth
gt_boxes = target['boxes'].cpu().numpy()
gt_labels = target['labels'].cpu().numpy()
all_ground_truths.append({
'boxes': gt_boxes,
'labels': gt_labels
})
# Calculate mAP for each class
num_classes = 80 # COCO classes
aps = []
for class_id in range(1, num_classes + 1):
# Collect all detections and ground truths for this class
class_detections = []
class_ground_truths = []
for i, (det, gt) in enumerate(zip(all_detections, all_ground_truths)):
# Filter detections for this class
mask = det['labels'] == class_id
if np.sum(mask) > 0:
class_detections.extend([{
'image_id': i,
'box': box,
'score': score
} for box, score in zip(det['boxes'][mask], det['scores'][mask])])
# Filter ground truths for this class
gt_mask = gt['labels'] == class_id
if np.sum(gt_mask) > 0:
class_ground_truths.extend([{
'image_id': i,
'box': box
} for box in gt['boxes'][gt_mask]])
if len(class_detections) == 0:
aps.append(0.0)
continue
# Sort detections by score
class_detections.sort(key=lambda x: x['score'], reverse=True)
# Calculate precision and recall
tp = np.zeros(len(class_detections))
fp = np.zeros(len(class_detections))
for i, detection in enumerate(class_detections):
# Find matching ground truths in the same image
gt_boxes = [gt['box'] for gt in class_ground_truths
if gt['image_id'] == detection['image_id']]
if len(gt_boxes) == 0:
fp[i] = 1
continue
# Calculate IoU with all ground truth boxes
ious = []
for gt_box in gt_boxes:
iou = calculate_iou_single(detection['box'], gt_box)
ious.append(iou)
max_iou = max(ious)
if max_iou >= iou_threshold:
tp[i] = 1
else:
fp[i] = 1
# Calculate cumulative precision and recall
tp_cumsum = np.cumsum(tp)
fp_cumsum = np.cumsum(fp)
recalls = tp_cumsum / len(class_ground_truths)
precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
# Calculate AP
ap = calculate_ap(recalls, precisions)
aps.append(ap)
return np.mean(aps), aps
def calculate_iou_single(box1, box2):
"""Calculate IoU between two boxes"""
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
if x2 <= x1 or y2 <= y1:
return 0.0
intersection = (x2 - x1) * (y2 - y1)
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union
# Evaluate model
# map_score, class_aps = evaluate_detection_map(faster_rcnn, test_loader, device)
# print(f'mAP: {map_score:.4f}')
Real-time Detection Pipeline
import cv2
class RealTimeDetector:
def __init__(self, model, device, class_names, confidence_threshold=0.7):
self.model = model
self.device = device
self.class_names = class_names
self.confidence_threshold = confidence_threshold
self.model.eval()
def detect_frame(self, frame):
"""Detect objects in a single frame"""
# Convert BGR to RGB
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_image = Image.fromarray(rgb_frame)
# Detect objects
boxes, scores, labels = detect_objects(
self.model, pil_image, self.device, self.confidence_threshold
)
return boxes, scores, labels
def draw_detections(self, frame, boxes, scores, labels):
"""Draw detection results on frame"""
colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0),
(255, 0, 255), (0, 255, 255)]
for box, score, label in zip(boxes, scores, labels):
x1, y1, x2, y2 = map(int, box)
color = colors[label % len(colors)]
# Draw bounding box
cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
# Draw label
label_text = f'{self.class_names[label-1]}: {score:.2f}'
label_size = cv2.getTextSize(label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 2)[0]
# Background for text
cv2.rectangle(frame, (x1, y1 - label_size[1] - 10),
(x1 + label_size[0], y1), color, -1)
# Text
cv2.putText(frame, label_text, (x1, y1 - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
return frame
def run_webcam(self):
"""Run detection on webcam feed"""
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
# Detect objects
boxes, scores, labels = self.detect_frame(frame)
# Draw results
frame = self.draw_detections(frame, boxes, scores, labels)
# Display
cv2.imshow('Object Detection', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# Example usage
# detector = RealTimeDetector(faster_rcnn, device, COCO_CLASSES)
# detector.run_webcam() # Press 'q' to quit
Key Takeaways
In this second part of our computer vision series, we covered:
- Object Detection Fundamentals: Understanding bounding boxes and multi-object scenarios
- Detection Architectures: One-stage (YOLO) vs two-stage (Faster R-CNN) detectors
- Training Process: Custom loss functions and evaluation metrics
- Evaluation: Non-maximum suppression and mAP calculation
- Real-time Applications: Webcam detection pipeline
Best Practices for Object Detection:
- Data Augmentation: Use detection-aware augmentations that transform both images and annotations
- Anchor Design: Choose appropriate anchor sizes and aspect ratios for your dataset
- Multi-scale Training: Train and test at multiple image scales
- Hard Negative Mining: Focus training on difficult examples
- Model Ensemble: Combine multiple models for better performance
Performance Optimization Tips:
- Model Pruning: Remove redundant parameters for faster inference
- Quantization: Use lower precision arithmetic (FP16, INT8)
- TensorRT/ONNX: Convert models for optimized deployment
- Batch Processing: Process multiple images simultaneously
In Part 3, we'll explore semantic segmentation, where we'll learn to classify every pixel in an image, providing even more detailed understanding of visual scenes.
Next Steps
- Experiment with different detection architectures (RetinaNet, EfficientDet)
- Try instance segmentation with Mask R-CNN
- Implement custom datasets with your own object classes
- Deploy models on mobile devices or edge hardware
Stay tuned for the final part of our series, where we'll dive into semantic segmentation!