WildCLIP: Scene and Animal Attribute Retrieval from Camera Trap Data with Domain-Adapted Vision-Language Models

被引:1
|
作者
Gabeff, Valentin [1 ,2 ]
Russwurm, Marc [2 ,3 ]
Tuia, Devis [2 ]
Mathis, Alexander [1 ]
机构
[1] EPFL, Brain Mind & NeuroX Inst, Sch Life Sci, Geneva, Switzerland
[2] EPFL, Environm Computat Sci & Earth Observat Lab ECEO, Sion, Switzerland
[3] WUR, Lab Geoinformat Sci & Remote Sensing, Wageningen, Netherlands
关键词
Vision-language models; CLIP; Wildlife; Camera traps; Few-shot learning; Vocabulary replay; IMAGES;
D O I
10.1007/s11263-024-02026-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Wildlife observation with camera traps has great potential for ethology and ecology, as it gathers data non-invasively in an automated way. However, camera traps produce large amounts of uncurated data, which is time-consuming to annotate. Existing methods to label these data automatically commonly use a fixed pre-defined set of distinctive classes and require many labeled examples per class to be trained. Moreover, the attributes of interest are sometimes rare and difficult to find in large data collections. Large pretrained vision-language models, such as contrastive language image pretraining (CLIP), offer great promises to facilitate the annotation process of camera-trap data. Images can be described with greater detail, the set of classes is not fixed and can be extensible on demand and pretrained models can help to retrieve rare samples. In this work, we explore the potential of CLIP to retrieve images according to environmental and ecological attributes. We create WildCLIP by fine-tuning CLIP on wildlife camera-trap images and to further increase its flexibility, we add an adapter module to better expand to novel attributes in a few-shot manner. We quantify WildCLIP's performance and show that it can retrieve novel attributes in the Snapshot Serengeti dataset. Our findings outline new opportunities to facilitate annotation processes with complex and multi-attribute captions. The code is available at https://github.com/amathislab/wildclip.
引用
收藏
页码:3770 / 3786
页数:17
相关论文
共 2 条
  • [1] Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
    Ha, Huy
    Song, Shuran
    CONFERENCE ON ROBOT LEARNING, VOL 205, 2022, 205 : 643 - 653
  • [2] Past, present and future approaches using computer vision for animal re-identification from camera trap data
    Schneider, Stefan
    Taylor, Graham W.
    Linquist, Stefan
    Kremer, Stefan C.
    METHODS IN ECOLOGY AND EVOLUTION, 2019, 10 (04): : 461 - 470