RT-DETRv2 Object Detection Format¶

Overview¶

RT-DETRv2 is an enhanced version of the Real-Time DEtection TRansformer (RT-DETR), introduced in the paper RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. Building upon the groundbreaking end-to-end object detection framework of the original RT-DETR, RT-DETRv2 continues the legacy of eliminating Non-Maximum Suppression (NMS) post-processing while introducing additional improvements in accuracy and efficiency for real-time object detection scenarios.

Info: RT-DETRv2 was introduced through the technical report "RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer" published in 2024. For the full paper, see: arXiv:2407.17140 For RT-DETR foundation, see: RT-DETR Paper (arXiv:2304.08069) For implementation details and code, see: GitHub Repository: lyuwenyu/RT-DETR

Availability: RT-DETRv2 is now available in multiple frameworks: - Hugging Face Transformers - Ultralytics

Key RT-DETRv2 Model Features¶

RT-DETRv2 maintains compatibility with the standard COCO annotation format while introducing specific technical improvements over RT-DETR:

Distinct Sampling Points for Different Scales: Introduces flexible multi-scale feature extraction by setting different numbers of sampling points for features at different scales in the deformable attention module, rather than using the same number across all scales.
Discrete Sampling Operator: Provides an optional discrete sampling operator to replace the grid_sample operator, removing deployment constraints typically associated with DETRs and improving practical applicability across different deployment platforms.
Dynamic Data Augmentation: Implements adaptive data augmentation strategy that applies stronger augmentation in early training periods and reduces it in later stages to improve model robustness and target domain adaptation.
Scale-Adaptive Hyperparameters: Customizes optimizer hyperparameters based on model scale, using higher learning rates for lighter models (e.g., ResNet18) and lower rates for larger models (e.g., ResNet101) to achieve optimal performance.
Bag-of-Freebies Approach: Incorporates multiple training improvements that enhance performance without increasing inference cost or model complexity.
Consistent Performance Gains: Achieves improved accuracy across all model scales (S: +1.4 mAP, M: +1.0 mAP, L: +0.3 mAP) while maintaining the same inference speed as RT-DETR.

These enhancements are handled internally by the model design and training pipeline, requiring no changes to the standard COCO annotation format described below.

Specification of RT-DETRv2 Detection Format¶

RT-DETRv2 uses the standard COCO format for annotations, ensuring complete compatibility with existing COCO datasets and tools. The format specification is identical to the original COCO format:

`images`¶

Defines metadata for each image in the dataset:

{
  "id": 0,                    // Unique image ID
  "file_name": "image1.jpg",  // Image filename
  "width": 640,               // Image width in pixels
  "height": 416               // Image height in pixels
}

`categories`¶

Defines the object classes:

{
  "id": 0,                    // Unique category ID
  "name": "cat"               // Category name
}

Annotations¶

Defines object instances:

{
  "image_id": 0,              // Reference to image
  "category_id": 2,           // Reference to category
  "bbox": [540.0, 295.0, 23.0, 18.0]  // [x, y, width, height] in absolute pixels
}

Directory Structure of RT-DETRv2 Dataset¶

dataset/
├── images/                   # Image files
│   ├── image1.jpg
│   └── image2.jpg
└── annotations.json         # Single JSON file containing all annotations

Benefits of RT-DETRv2 Format¶

Standard Compatibility: Uses the widely-adopted COCO format, ensuring compatibility with existing tools and frameworks.
End-to-End Processing: Maintains the NMS-free architecture for stable and predictable inference performance.
Enhanced Performance: Improved accuracy and efficiency compared to the original RT-DETR.

Converting Annotations to RT-DETRv2 Format with Labelformat¶

Since RT-DETRv2 uses the standard COCO format, converting annotations to RT-DETRv2 format is equivalent to converting to COCO format.

Installation¶

First, ensure that Labelformat is installed:

pip install labelformat

Conversion Example: YOLOv8 to RT-DETRv2¶

Step 1: Prepare Your Dataset

Ensure your dataset follows the standard YOLOv8 structure with data.yaml and label files.

Step 2: Run the Conversion Command

Use the Labelformat CLI to convert YOLOv8 annotations to RT-DETRv2 (COCO format):

labelformat convert \
    --task object-detection \
    --input-format yolov8 \
    --input-file dataset/data.yaml \
    --input-split train \
    --output-format rtdetrv2 \
    --output-file dataset/rtdetrv2_annotations.json

Step 3: Verify the Converted Annotations

After conversion, your dataset structure will be:

dataset/
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
└── rtdetrv2_annotations.json    # COCO format annotations for RT-DETRv2

Python API Example¶

from pathlib import Path
from labelformat.formats import YOLOv8ObjectDetectionInput, RTDETRv2ObjectDetectionOutput

# Load YOLOv8 format
label_input = YOLOv8ObjectDetectionInput(
    input_file=Path("dataset/data.yaml"),
    input_split="train"
)

# Convert to RT-DETRv2 format
RTDETRv2ObjectDetectionOutput(
    output_file=Path("dataset/rtdetrv2_annotations.json")
).save(label_input=label_input)

RT-DETRv2 vs RT-DETR¶

RT-DETRv2 builds upon the foundation of RT-DETR with several key improvements:

Enhanced Architecture: Refined encoder and decoder designs for better performance
Improved Training: Advanced training strategies and optimization techniques
Better Accuracy: Higher detection accuracy across various model scales

Error Handling in Labelformat¶

Since RT-DETRv2 uses the COCO format, the same validation and error handling applies:

Invalid JSON Structure: Proper error reporting for malformed JSON files
Missing Required Fields: Validation ensures all required COCO fields are present
Invalid JSON Structure: Proper error reporting for malformed JSON files.
Missing Required Fields: Validation ensures all required COCO fields are present.
Reference Integrity: Checks that image_id and category_id references are valid.

Bounding Box Validation: Ensures bounding boxes are within image boundaries.

{
  "images": [{"id": 0, "file_name": "image1.jpg", "width": 640, "height": 480}],
  "categories": [{"id": 1, "name": "person"}],
  "annotations": [{"image_id": 0, "category_id": 1, "bbox": [100, 120, 50, 80]}]
}