Object Detection With Transformers Using Facebook’s DETR.

5 min readJul 16, 2020

On May 2020, Facebook released a novel object detection AI, named DEtection TRansformer (DETR), which views object detection as a direct set problem. The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction problem indirectly by defining surrogate tasks — surrogate regression and classification problems on a large set of proposals. DETR bypasses this surrogate tasks by proposing a direct set prediction approach.

The main ingredients of the DETR model are a set-based global loss that forces unique predictions through bipartite matching and a transformer encoder-decoder architecture. Bipartite matching loss function, which uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, hence emitting them in parallel; and a transformer encoder-decoder architecture — an architecture suitable for sequence predictions, because the self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make the architecture particularly suitable for specific constraints of set prediction such as removing duplicate predictions.

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context and then, directly output the final set of predictions in parallel. DETR makes its predictions all at once, in parallel, and is trained end-to-end with a set loss function that preforms bipartite matching between predicted and ground-truth objects.

Set Prediction

One of the difficulties in set prediction is to avoid near-duplicates in problems where there are near-identical boxes. Most current detectors use postprocessings such as non-maximal suppression to address this difficulty, but direct set predictions are postprocessing-free.

The usual solution is to design a loss based on the Hungarian algorithm, to find a bipartite matching between the ground-truth and prediction. This enforces permutation-invariance and guarantees that each target element has a unique match. DETR uses the bipartite matching loss approach, and instead of autoregressive models (such as recurrent neural networks), uses transformers with parallel decoding.

Transformers With Parallel Decoding

Transformers introduced self-attention layers that scan through each element of the sequence and update it by aggregating information from the whole sequence. One of the main advantages of attention-based models is their global computations and perfect memory.

Transformers were first used in auto-regressive models, following early sequence-to-sequence models, generating output tokens one by one. However, the prohibitive inference cost leads to the development of parallel sequence generation. DETR combines transformers and parallel decoding for their suitable trade-off between computational cost and the ability to perform the global computations required for set prediction.

Object Detection

Most modern object detection methods make predictions relative to some initial guesses, like proposals for two-stage detectors, or anchors for single-stage methods. Recent work demonstrates that the final performance of these systems heavily depends on the exact way these initial guesses are set.

The goal is to remove as much of the handcrafted process as possible and streamline the detection process by directly predicting the set of detections with absolute box predictions with respect to the input image rather than an anchor.

Several object detectors that use the bipartite matching loss, however, require hand-designed NMS post-processing to improve their performance.

Learnable NMS methods, using direct set prediction and hence post-processing free, however, employ additional hand-crafted context features like proposals

Recurrent detectors — end-to-end set predictions for object detection and instance segmentation — which are closest to the DETR approach, use bipartite-matching loss. However, these approaches were only evaluated on small datasets and they are based on autoregressive models, unlike the DETR approach that leverages the recent transformers with parallel encoding.

DETR Model

The two essentials for direct set predictions in detection are: (1) a set prediction loss that forces unique matching between predicted and ground-truth boxes; (2) an architecture that predicts in a single pass a set of objects and models their relation.

The DETR architecture contains three main components: a CNN backbone, an encoder-decoder transformer, and a simple feed-forward network (FFN).

Backbone — A conventional CNN backbone generates a lower activation map from an initial image.
Transformer encoder — The encoder expects a sequence as input, so the new feature map is collapsed into one dimension. Since the transformer architecture is permutation-invariant, the input of each attention layer is supplemented with fixed positional encodings.
Transformer decoder — It decodes the N objects in parallel at each decoder layer. Since the decoder is also permutation-invariant, the N input embeddings must be different to produce different results. These input-embeddings are learnt positional encodings, referred to as object queries, and similarly to the encoder, are added to the input of each attention layer. The N object queries are then independently decoded into box coordinates and class labels by a feed-forward network, resulting N final predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.
Feed-Forward Networks (FFNs) — The final prediction is computed by a three-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. The FFN predicts the normalized center coordinates, height, and width of the box with respect to the input image, and the linear layer predicts the class labels using a softmax function.
Auxiliary decoding losses — helps the model output the correct number of objects of each class. Prediction FFNs and Hungarian loss are added after each decoder layer. All predictions FFNs share their parameters. Additional shared layer-norm are used to normalize the input to the prediction FFNs from different decoder layers.

DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation, with competitive results. It achieves comparable results to an optimized Faster R-CNN baseline on the challenging COCO dataset.

DETR demonstrates significantly better performance on large objects.

The challenges of DETR are with regards to training, optimization, and performances on small objects, which, it is hoped for, will be addressed successfully by future contributions to the model.

The training code and pretrained models are available at https://github.com/facebookresearch/detr.

Object Detection With Transformers Using Facebook’s DETR.

Written by Ubajaka CJ