Skip to content

RT-DETRv2 Object Detection Format

Overview

RT-DETRv2 is an enhanced version of the Real-Time DEtection TRansformer (RT-DETR), introduced in the paper RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. Building upon the groundbreaking end-to-end object detection framework of the original RT-DETR, RT-DETRv2 continues the legacy of eliminating Non-Maximum Suppression (NMS) post-processing while introducing additional improvements in accuracy and efficiency for real-time object detection scenarios.

Info: RT-DETRv2 was introduced through the technical report "RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer" published in 2024. For the full paper, see: arXiv:2407.17140 For RT-DETR foundation, see: RT-DETR Paper (arXiv:2304.08069) For implementation details and code, see: GitHub Repository: lyuwenyu/RT-DETR

Availability: RT-DETRv2 is now available in multiple frameworks: - Hugging Face Transformers - Ultralytics

Key RT-DETRv2 Model Features

RT-DETRv2 maintains compatibility with the standard COCO annotation format while introducing specific technical improvements over RT-DETR:

  • Distinct Sampling Points for Different Scales: Introduces flexible multi-scale feature extraction by setting different numbers of sampling points for features at different scales in the deformable attention module, rather than using the same number across all scales.
  • Discrete Sampling Operator: Provides an optional discrete sampling operator to replace the grid_sample operator, removing deployment constraints typically associated with DETRs and improving practical applicability across different deployment platforms.
  • Dynamic Data Augmentation: Implements adaptive data augmentation strategy that applies stronger augmentation in early training periods and reduces it in later stages to improve model robustness and target domain adaptation.
  • Scale-Adaptive Hyperparameters: Customizes optimizer hyperparameters based on model scale, using higher learning rates for lighter models (e.g., ResNet18) and lower rates for larger models (e.g., ResNet101) to achieve optimal performance.
  • Bag-of-Freebies Approach: Incorporates multiple training improvements that enhance performance without increasing inference cost or model complexity.
  • Consistent Performance Gains: Achieves improved accuracy across all model scales (S: +1.4 mAP, M: +1.0 mAP, L: +0.3 mAP) while maintaining the same inference speed as RT-DETR.

These enhancements are handled internally by the model design and training pipeline, requiring no changes to the standard COCO annotation format described below.

Specification of RT-DETRv2 Detection Format

RT-DETRv2 uses the standard COCO format for annotations, ensuring complete compatibility with existing COCO datasets and tools. The format specification is identical to the original COCO format:

images

Defines metadata for each image in the dataset:

{
  "id": 0,                    // Unique image ID
  "file_name": "image1.jpg",  // Image filename
  "width": 640,               // Image width in pixels
  "height": 416               // Image height in pixels
}

categories

Defines the object classes:

{
  "id": 0,                    // Unique category ID
  "name": "cat"               // Category name
}

Annotations

Defines object instances:

{
  "image_id": 0,              // Reference to image
  "category_id": 2,           // Reference to category
  "bbox": [540.0, 295.0, 23.0, 18.0]  // [x, y, width, height] in absolute pixels
}

Directory Structure of RT-DETRv2 Dataset

dataset/
├── images/                   # Image files
│   ├── image1.jpg
│   └── image2.jpg
└── annotations.json         # Single JSON file containing all annotations

Benefits of RT-DETRv2 Format

  • Standard Compatibility: Uses the widely-adopted COCO format, ensuring compatibility with existing tools and frameworks.
  • End-to-End Processing: Maintains the NMS-free architecture for stable and predictable inference performance.
  • Enhanced Performance: Improved accuracy and efficiency compared to the original RT-DETR.

Converting Annotations to RT-DETRv2 Format with Labelformat

Since RT-DETRv2 uses the standard COCO format, converting annotations to RT-DETRv2 format is equivalent to converting to COCO format.

Installation

First, ensure that Labelformat is installed:

pip install labelformat

Conversion Example: YOLOv8 to RT-DETRv2

Step 1: Prepare Your Dataset

Ensure your dataset follows the standard YOLOv8 structure with data.yaml and label files.

Step 2: Run the Conversion Command

Use the Labelformat CLI to convert YOLOv8 annotations to RT-DETRv2 (COCO format):

labelformat convert \
    --task object-detection \
    --input-format yolov8 \
    --input-file dataset/data.yaml \
    --input-split train \
    --output-format rtdetrv2 \
    --output-file dataset/rtdetrv2_annotations.json

Step 3: Verify the Converted Annotations

After conversion, your dataset structure will be:

dataset/
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
└── rtdetrv2_annotations.json    # COCO format annotations for RT-DETRv2

Python API Example

from pathlib import Path
from labelformat.formats import YOLOv8ObjectDetectionInput, RTDETRv2ObjectDetectionOutput

# Load YOLOv8 format
label_input = YOLOv8ObjectDetectionInput(
    input_file=Path("dataset/data.yaml"),
    input_split="train"
)

# Convert to RT-DETRv2 format
RTDETRv2ObjectDetectionOutput(
    output_file=Path("dataset/rtdetrv2_annotations.json")
).save(label_input=label_input)

RT-DETRv2 vs RT-DETR

RT-DETRv2 builds upon the foundation of RT-DETR with several key improvements:

  • Enhanced Architecture: Refined encoder and decoder designs for better performance
  • Improved Training: Advanced training strategies and optimization techniques
  • Better Accuracy: Higher detection accuracy across various model scales

Error Handling in Labelformat

Since RT-DETRv2 uses the COCO format, the same validation and error handling applies:

  • Invalid JSON Structure: Proper error reporting for malformed JSON files
  • Missing Required Fields: Validation ensures all required COCO fields are present
  • Invalid JSON Structure: Proper error reporting for malformed JSON files.
  • Missing Required Fields: Validation ensures all required COCO fields are present.
  • Reference Integrity: Checks that image_id and category_id references are valid.
  • Bounding Box Validation: Ensures bounding boxes are within image boundaries.
    {
      "images": [{"id": 0, "file_name": "image1.jpg", "width": 640, "height": 480}],
      "categories": [{"id": 1, "name": "person"}],
      "annotations": [{"image_id": 0, "category_id": 1, "bbox": [100, 120, 50, 80]}]
    }