Vehicle Detection and Classification with YOLO

Michał Jurzak, Albert Dańko

An object-detection study built for the Fundamentals of Artificial Intelligence course at AGH. The task is to locate and classify vehicles in street scenes, framed in the YOLO spirit as a single regression from image pixels to bounding-box coordinates and class probabilities.

Data

The Road Vehicle Images dataset provides 3,004 annotated images across 21 classes, each label giving a class and bounding-box coordinates. Exploratory analysis surfaced two practical problems:

Heavy class imbalance: cars dominate, followed by rickshaws, buses, and tricycles; the long tail of classes is sparse, and the original validation split was missing many classes entirely.
Mixed resolutions: a handful of sizes (640\times360, 360\times640, 480\times640) account for most of the data.

Colour distributions were close to normal (confirmed by Kolmogorov-Smirnov and Shapiro-Wilk tests), so no normalisation was applied. The data was re-split into 70% train / 10% validation / 20% test to repair the broken class coverage.

Models

Detection quality is measured by the intersection-over-union between a predicted box B_p and the ground-truth box B_{gt},

\mathrm{IoU} = \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|},

reported as mean average precision at a fixed threshold (\mathrm{mAP}_{50}) and averaged over thresholds (\mathrm{mAP}_{50\text{-}95}).

Several variants of YOLOv8 and YOLOv5 were trained, both pretrained and from scratch. YOLOv8’s architecture follows the usual backbone / neck / head split: a CNN backbone, a neck combining a Feature Pyramid Network and a Path Aggregation Network for multi-scale features, and a single anchor-free head that predicts object centres directly.

Results

Under a fixed 35 to 40 epoch budget (hardware-limited, so all models were undertrained but comparable):

Pretrained YOLOv8n reached roughly \mathrm{mAP}_{50} = 0.35 and \mathrm{mAP}_{50\text{-}95} \approx 0.25, with metrics still climbing.
Training from scratch was markedly worse (\mathrm{mAP}_{50} \approx 0.15), as expected without pretrained weights.
Pretrained YOLOv5n was competitive with v8 and learned box localisation slightly faster under the same budget.

Source

Notebooks and the full results are in the source repository.