Choosing a candidate model

Choosing a candidate model#

Our task in this chapter is to choose a candidate architecture that allows us to use pre-trained vision foundation models as their backbone’s feature extractor.

State of the Art#

A brief glimpse into the literature gives us some promising picks but also some fundamental questions. The first thing we find is that there is no clear winner between CNN-based and ViT-based models, especially when we factor latency/efficiency into the equation. Furthermore, neither CNN-based and ViT-based models have a clear best architectural variant (e.g vanilla ViT vs EVA’s TrV’s, ResNet vs ResNext) and sometimes the backbone’s architecture itself is modified to better suit the task at hand (e.g Swin is a ViT with hierarchical features, useful for dense prediction tasks). Furthermore, some backbones are finetuned in task-specific datasets, which improves task-specific performance at expense of generality.

Tip

More generally, the pretraining objective also matters. [PKH+23] shows that Contrastive learning favors image classification, while Masked Image Modelling favors dense prediction (object detection).

Table 1 categorizes these model choices and summarizes its performance on the COCO dataset. However, as we described beforehand, these comparisons are often not fair, and a comprehensive evaluation would have to be done to determine the best backbone on all main tasks, and thus the best candidate as a vision foundation model. This question has been tackled by [GSN+23] last year (2023), but its results are already outdated, as the most popular vit-based foundation models (dinov2 [ODM+23], eva02 [FSW+24]) were released afterwards. In any case, we want a model that is meant for general use, which narrows down the search.

Table 1 State of the Art of Object Detection models#

Encoder

Model

Decoder

Size

COCO
(minival)

FPS

Device

Code +
Weights

CNN

YOLOv10

YOLOv10

L

53.4

137

T4[1]

X

54.4

93

T4[1]

YOLOv8

YOLOv8

L

52.9

81

T4[1]

X

53.9

59

T4[1]

RT-DETR [ZLX+23]

DETR

R50

53.1

108

T4[1]

R101

54.3

73

T4[1]

ConvNext’s [LMW+22]

Cascade Mask-RCNN

ConvNeXt-B

54.0

11.5

A100

ConvNext-L

54.8

10

A100

ViT (w/
task-specific
modifications)

MViTv2 [LWF+22]

Cascade Mask-RCNN

L

55.7

5

A100

H

55.8

3

A100

Co-DETR [ZSL23]

Co-DETR

ViT-L

65.9[2]

-

-

Swin-L

64.1

-

-

ConvNext [LMW+22]

Cascade Mask-RCNN

Swin-B

53.0

10.7

A100


Swin-L

53.9

9.2

A100


ViT (w/
modifications)

EVA-02 [FSW+24]

Cascade Mask-RCNN

TrV-L

59.2
62.3[2]
64.1[2]

~1.4x faster
than ViT

A100


ViT (plain)

ViTDet [LMGH22]

Cascade Mask-RCNN

ViT-B

54

11

A100


ViT-L

57.6
59.6[2]
-

7

A100


ViT-H

58.7
60.4[2]
-

5

A100


SimPLR [NOS24]

SimPLR

ViT-L

58.7

9

A100

ViT-H

59.8

7

A100

Final decision#

To finally arrive at a decision, it is useful to think back to the original motivation of using VFMs: To leverage the knowledge acquired by a model pre-trained with extensive data and compute. To keep ourselves future-proofed, we chose Dinov2 [ODM+23] as the backbone, as it has the most support from the community and the authors at Meta. With the same reasoning, we chose the VitDet [LMGH22] adapter which allows us to use almost any decoder head. To stay at the state-of-the-art we chose the DINO [ZLL+22] decoder.

arch

Fig. 4 Model architecture: Dinov2 backbone, VitDet adapter, DINO decoder.#