Skip to content

RT-DETR Object Detection Format

Overview

RT-DETR (Real-Time DEtection TRansformer) is a groundbreaking end-to-end object detection framework introduced in the paper DETRs Beat YOLOs on Real-time Object Detection. RT-DETR represents the first real-time end-to-end object detector that successfully challenges the dominance of YOLO detectors in real-time applications. Unlike traditional detectors that require Non-Maximum Suppression (NMS) post-processing, RT-DETR eliminates NMS entirely while achieving superior speed and accuracy performance.

Info: RT-DETR was introduced through the academic paper "DETRs Beat YOLOs on Real-time Object Detection" published in 2023. For the full paper, see: arXiv:2304.08069 For implementation details and code, see: GitHub Repository: lyuwenyu/RT-DETR

Availability: RT-DETR is now available in multiple frameworks: - Hugging Face Transformers - Ultralytics

Key RT-DETR Model Features

RT-DETR uses the standard COCO annotation format while introducing revolutionary architectural innovations for real-time detection:

  • End-to-End Architecture: First real-time detector to completely eliminate NMS post-processing, providing more stable and predictable inference times.
  • Efficient Hybrid Encoder: Novel encoder design that decouples intra-scale interaction and cross-scale fusion to significantly reduce computational overhead.
  • Uncertainty-Minimal Query Selection: Advanced query initialization scheme that optimizes both classification and localization confidence for improved detection quality.
  • Flexible Speed Tuning: Supports adjustable inference speed by modifying the number of decoder layers without retraining.
  • Superior Performance: Achieves state-of-the-art results (e.g., RT-DETR-R50 reaches 53.1% mAP @ 108 FPS on T4 GPU, outperforming YOLOv8-L in both speed and accuracy).
  • Multiple Model Scales: Available in various scales (R18, R34, R50, R101) to accommodate different computational requirements.

These architectural innovations are handled internally by the model design and training pipeline, requiring no changes to the standard COCO annotation format described below.

Specification of RT-DETR Detection Format

RT-DETR uses the standard COCO format for annotations, ensuring seamless integration with existing COCO datasets and tools. The format consists of a single JSON file containing three main components:

images

Defines metadata for each image in the dataset:

{
  "id": 0,                    // Unique image ID
  "file_name": "image1.jpg",  // Image filename
  "width": 640,               // Image width in pixels
  "height": 416               // Image height in pixels
}

categories

Defines the object classes:

{
  "id": 0,                    // Unique category ID
  "name": "cat"               // Category name
}

annotations

Defines object instances:

{
  "image_id": 0,              // Reference to image
  "category_id": 2,           // Reference to category
  "bbox": [540.0, 295.0, 23.0, 18.0]  // [x, y, width, height] in absolute pixels
}

Directory Structure of RT-DETR Dataset

dataset/
├── images/                   # Image files
│   ├── image1.jpg
│   └── image2.jpg
└── annotations.json         # Single JSON file containing all annotations

Benefits of RT-DETR Format

  • Standard Compatibility: Uses the widely-adopted COCO format, ensuring compatibility with existing tools and frameworks.
  • Flexibility: Supports adjustable inference speeds without retraining, making it adaptable to various real-time scenarios.
  • Superior Accuracy: Achieves better accuracy than comparable YOLO detectors while maintaining competitive speed.

Converting Annotations to RT-DETR Format with Labelformat

Since RT-DETR uses the standard COCO format, converting annotations to RT-DETR format is equivalent to converting to COCO format.

Installation

First, ensure that Labelformat is installed:

pip install labelformat

Conversion Example: YOLOv8 to RT-DETR

Assume you have annotations in YOLOv8 format and wish to convert them to RT-DETR. Here's how you can achieve this using Labelformat.

Step 1: Prepare Your Dataset

Ensure your dataset follows the standard YOLOv8 structure with data.yaml and label files.

Step 2: Run the Conversion Command

Use the Labelformat CLI to convert YOLOv8 annotations to RT-DETR (COCO format):

labelformat convert \
    --task object-detection \
    --input-format yolov8 \
    --input-file dataset/data.yaml \
    --input-split train \
    --output-format rtdetr \
    --output-file dataset/rtdetr_annotations.json

Step 3: Verify the Converted Annotations

After conversion, your dataset structure will be:

dataset/
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
└── rtdetr_annotations.json    # COCO format annotations for RT-DETR

Python API Example

from pathlib import Path
from labelformat.formats import YOLOv8ObjectDetectionInput, RTDETRObjectDetectionOutput

# Load YOLOv8 format
label_input = YOLOv8ObjectDetectionInput(
    input_file=Path("dataset/data.yaml"),
    input_split="train"
)

# Convert to RT-DETR format
RTDETRObjectDetectionOutput(
    output_file=Path("dataset/rtdetr_annotations.json")
).save(label_input=label_input)

Error Handling in Labelformat

Since RT-DETR uses the COCO format, the same validation and error handling applies:

  • Invalid JSON Structure: Proper error reporting for malformed JSON files
  • Missing Required Fields: Validation ensures all required COCO fields are present
  • Reference Integrity: Checks that image_id and category_id references are valid
  • Bounding Box Validation: Ensures bounding boxes are within image boundaries

Example of a properly formatted annotation:

{
  "images": [{"id": 0, "file_name": "image1.jpg", "width": 640, "height": 480}],
  "categories": [{"id": 1, "name": "person"}],
  "annotations": [{"image_id": 0, "category_id": 1, "bbox": [100, 120, 50, 80]}]
}