Learning Text-to-Video Retrieval from Image Captioning

被引：0

作者：

Lucas Ventura ^{[1
]}

Cordelia Schmid ^{[2
]}

Gül Varol ^{[2
]}

机构：

[1] Univ Gustave Eiffel,LIGM, École des Ponts, CNRS

[2] PSL Research University,Inria, ENS, CNRS

来源：

International Journal of Computer Vision | 2025年 / 133卷 / 4期

关键词：

Text-to-video retrieval; Image captioning; Multimodal learning;

D O I：

10.1007/s11263-024-02202-8

中图分类号：

学科分类号：

摘要：

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.

引用

页码：1834 / 1854

页数：20

共 50 条

[1] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Dong, Jianfeng
Wang, Yabing
Chen, Xianke
Qu, Xiaoye
Li, Xirong
He, Yuan
Wang, Xun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
[2] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Tian, Kaibin
Cheng, Yanhua
Liu, Yi
Hou, Xinglin
Chen, Quan
Li, Han
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5207 - 5214
[3] Summarization of Text and Image Captioning in Information Retrieval Using Deep Learning Techniques
Mahalakshmi, P.
Fatima, N. Sabiyath
IEEE ACCESS, 2022, 10 : 18289 - 18297
[4] Visual to Text: Survey of Image and Video Captioning
Li, Sheng
Tao, Zhiqiang
Li, Kang
Fu, Yun
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
[5] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Hu, Fan
Chen, Aozhu
Wang, Ziyue
Zhou, Fangming
Dong, Jianfeng
Li, Xirong
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
[6] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
Yang, Yuchen
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
[7] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Rodriguez, Pedro
Azab, Mahmoud
Silvert, Becka
Sanchez, Renato
Labson, Linzy
Shah, Hardik
Moon, Seungwhan
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68
[8] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
Song, Xue
Chen, Jingjing
Jiang, Yu-Gang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
[9] Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Ibrahimi, Sarah
Sun, Xiaohang
Wang, Pichao
Garg, Amanmeet
Sanan, Ashutosh
Omar, Mohamed
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12020 - 12030
[10] Learning Video-Text Aligned Representations for Video Captioning
Shi, Yaya
Xu, Haiyang
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)

← 1 2 3 4 5 →