Deep ViT Features as Dense Visual Descriptors

Abstract

We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties: (i) the features encode powerful high level information at high spatial resolution--i.e., capture semantic object parts at fine spatial granularity, and (ii) the encoded semantic information is shared across related, yet different object categories (i.e. super-categories). These properties allow us to design powerful dense ViT descriptors that facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences -- all achieved by applying lightweight methodologies to deep ViT features (e.g., binning / clustering). We take these applications further to the realm of inter-class tasks -- demonstrating how objects from related categories can be commonly segmented into semantic parts, under significant pose and appearance changes. Our methods, extensively evaluated qualitatively and quantitatively, achieve state-of-the-art part co-segmentation results, and competitive results with recent supervised methods trained specifically for co-segmentation and correspondences.

PCA Visualization

We apply principal component analysis (PCA) on spatial descriptors across layers from a supervised ViT and a DINO-trained ViT. We find that early layers contain positionally biased representations, that gradually become more semantic in deeper layers. Both ViTs produce semantic representations with high granularity, that cause semantic object parts to emerge. However, DINO ViT representations are less noisy that supervised ViT representations.

Part Co-segmentation examples

We apply clustering on Deep ViT spatial features to co-segment common objects among a set of images, and then further co-segment the common regions into parts. The parts remain consistent under variations in appearance, pose, scale, and under different yet related classes. We leverage Deep ViT spatial features to co-segment common objects among a set of images, and then further co-segment the common regions into parts. The parts remain consistent under variations in appearance, pose, scale, and under different yet related classes.

The method can be applied on as little as a pair of images and as much as thousands of images.

Point Correpondences examples

We leverage Deep ViT features to automatically detect semantically corresponding points between images from different classes, under significant variations in appearance, pose and scale.

Video Part Co-segmentation examples

We apply our part co-segmentation method on a collection of frames instead of a set of images, to recieve temporally consistent part co-segmentation. No temporal information is used.

Original Video	Part Co-segmentation

Paper

Deep ViT Features as Dense Visual Descriptors
Shir Amir, Yossi Gandelsman, Shai Bagon, Tali Dekel.
Arxiv. 2021.

[paper]

Supplementary Material

[supplementary page]

Code

[code]

Bibtex

	
  @article{amir2021deep,
	  author    = {Shir Amir and Yossi Gandelsman and Shai Bagon and Tali Dekel},
  	  title     = {Deep ViT Features as Dense Visual Descriptors}, 
	  journal   = {ECCVW What is Motion For?},
	  year      = {2022},
  }

Acknowledgments

We thank Miki Rubinstein, Meirav Galun, Niv Haim and Kfir Aberman for their useful comments.