Learning Text-to-Video Retrieval from Image Captioning

被引:0
|
作者
Lucas Ventura [1 ]
Cordelia Schmid [2 ]
Gül Varol [2 ]
机构
[1] Univ Gustave Eiffel,LIGM, École des Ponts, CNRS
[2] PSL Research University,Inria, ENS, CNRS
关键词
Text-to-video retrieval; Image captioning; Multimodal learning;
D O I
10.1007/s11263-024-02202-8
中图分类号
学科分类号
摘要
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
引用
收藏
页码:1834 / 1854
页数:20
相关论文
共 50 条
  • [1] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [2] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
    Tian, Kaibin
    Cheng, Yanhua
    Liu, Yi
    Hou, Xinglin
    Chen, Quan
    Li, Han
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5207 - 5214
  • [3] Summarization of Text and Image Captioning in Information Retrieval Using Deep Learning Techniques
    Mahalakshmi, P.
    Fatima, N. Sabiyath
    IEEE ACCESS, 2022, 10 : 18289 - 18297
  • [4] Visual to Text: Survey of Image and Video Captioning
    Li, Sheng
    Tao, Zhiqiang
    Li, Kang
    Fu, Yun
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
  • [5] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
    Hu, Fan
    Chen, Aozhu
    Wang, Ziyue
    Zhou, Fangming
    Dong, Jianfeng
    Li, Xirong
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
  • [6] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
    Yang, Yuchen
    ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
  • [7] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
    Rodriguez, Pedro
    Azab, Mahmoud
    Silvert, Becka
    Sanchez, Renato
    Labson, Linzy
    Shah, Hardik
    Moon, Seungwhan
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68
  • [8] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
    Song, Xue
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
  • [9] Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
    Ibrahimi, Sarah
    Sun, Xiaohang
    Wang, Pichao
    Garg, Amanmeet
    Sanan, Ashutosh
    Omar, Mohamed
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12020 - 12030
  • [10] Learning Video-Text Aligned Representations for Video Captioning
    Shi, Yaya
    Xu, Haiyang
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)