We present ablations of the saliency baselines on PASCAL-Co for four randomly chosen sets on several scenarios (see Tab. 3): DINO ResNet, Supervised Resnet, DINO ViT and Supervised ViT. Our method best manages to localize the common objects in the images.
Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|
DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|
DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|
DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|
DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |