Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
|
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
下载
收藏
页码:5207 / 5214
页数:8
相关论文
共 50 条
  • [1] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [2] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
    Wang, Wei
    Gao, Junyu
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
  • [3] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [4] COARSE-TO-FINE VIDEO TEXT DETECTION
    Miao, Guangyi
    Huang, Qingming
    Jiang, Shuqiang
    Gao, Wen
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 569 - +
  • [5] Learning Text-to-Video Retrieval from Image Captioning
    Lucas Ventura
    Cordelia Schmid
    Gül Varol
    International Journal of Computer Vision, 2025, 133 (4) : 1834 - 1854
  • [6] Zero-shot visual grounding via coarse-to-fine representation learning
    Mi, Jinpeng
    Jin, Shaofei
    Chen, Zhiqian
    Liu, Dan
    Wei, Xian
    Zhang, Jianwei
    NEUROCOMPUTING, 2024, 610
  • [7] Coarse-to-fine dual-level attention for video-text cross modal retrieval
    Jin, Ming
    Zhang, Huaxiang
    Zhu, Lei
    Sun, Jiande
    Liu, Li
    KNOWLEDGE-BASED SYSTEMS, 2022, 242
  • [8] An Efficient Coarse-to-Fine Scheme for Text Detection in Videos
    Wang, Liuan
    Huang, Lin-Lin
    Wu, Yang
    2011 FIRST ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR), 2011, : 475 - 479
  • [9] Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval
    Zhu, Yunquan
    Gao, Xinkai
    Ke, Bo
    Qiao, Ruizhi
    Sun, Xing
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11226 - 11235
  • [10] A Coarse-to-Fine Framework for Resource Efficient Video Recognition
    Zuxuan Wu
    Hengduo Li
    Yingbin Zheng
    Caiming Xiong
    Yu-Gang Jiang
    Larry S Davis
    International Journal of Computer Vision, 2021, 129 : 2965 - 2977