We present ablations of the saliency baselines on PASCAL-Co for four randomly chosen sets on several scenarios (see Tab. 3): DINO ResNet, Supervised Resnet, DINO ViT and Supervised ViT. Our method best manages to localize the common objects in the images.
| Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|
| DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|
| DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|
| DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Image | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|
| DINO Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. Resnet | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sup. ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DINO ViT | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Ours | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |