Deep ViT Features as Dense Visual Descriptors
![]() |
1 Weizmann Institute of Science | ![]() |
2 Berkeley Artificial Intelligence Research |
ECCVW 2022 "WIMF" Best Spotlight Presentation
| Paper | Supplementary Material | Code | |
Abstract
We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties: (i) the features encode powerful high level information at high spatial resolution--i.e., capture semantic object parts at fine spatial granularity, and (ii) the encoded semantic information is shared across related, yet different object categories (i.e. super-categories). These properties allow us to design powerful dense ViT descriptors that facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences -- all achieved by applying lightweight methodologies to deep ViT features (e.g., binning / clustering). We take these applications further to the realm of inter-class tasks -- demonstrating how objects from related categories can be commonly segmented into semantic parts, under significant pose and appearance changes. Our methods, extensively evaluated qualitatively and quantitatively, achieve state-of-the-art part co-segmentation results, and competitive results with recent supervised methods trained specifically for co-segmentation and correspondences.
PCA Visualization
Part Co-segmentation examples
We apply clustering on Deep ViT spatial features to co-segment common objects among a set of images, and then further co-segment the common regions into parts. The parts remain consistent under variations in appearance, pose, scale, and under different yet related classes.The method can be applied on as little as a pair of images and as much as thousands of images.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Point Correpondences examples
We leverage Deep ViT features to automatically detect semantically corresponding points between images from different classes, under significant variations in appearance, pose and scale. ![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
Video Part Co-segmentation examples
We apply our part co-segmentation method on a collection of frames instead of a set of images, to recieve temporally consistent part co-segmentation. No temporal information is used.Original Video | Part Co-segmentation |
---|---|
Paper
![]() |
Deep ViT Features as Dense Visual Descriptors |
Supplementary Material
![]() |
Code
![]() |
[code] |
Bibtex
Acknowledgments
We thank Miki Rubinstein, Meirav Galun, Niv Haim and Kfir Aberman for their useful comments.