Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
|
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
收藏
页码:5207 / 5214
页数:8
相关论文
共 50 条
  • [21] Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation
    James, Stephen
    Wada, Kentaro
    Laidlow, Tristan
    Davison, Andrew J.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13729 - 13738
  • [22] Coarse-to-Fine Robust Heterogeneous Network Representation Learning Without Metapath
    Chen, Lei
    Guo, Haomiao
    Lei, Yong
    Li, Yuan
    Liu, Zhaohua
    IEEE Transactions on Network Science and Engineering, 2024, 11 (06): : 5773 - 5789
  • [23] Coarse-to-Fine Construction for High-Resolution Representation in Visual Working Memory
    Gao, Zaifeng
    Ding, Xiaowei
    Yang, Tong
    Liang, Junying
    Shui, Rende
    PLOS ONE, 2013, 8 (02):
  • [24] C2F: An effective coarse-to-fine network for video summarization
    Jin, Ye
    Tian, Xiaoyan
    Zhang, Zhao
    Liu, Peng
    Tang, Xianglong
    IMAGE AND VISION COMPUTING, 2024, 144
  • [25] FROM VIDEO TO TEXT: SEMANTIC DRIVING SCENE UNDERSTANDING USING A COARSE-TO-FINE METHOD
    Fu, Huiyuan
    Ma, Huadong
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 1393 - 1396
  • [26] ACE: A Coarse-to-Fine Learning Framework for Reliable Representation Learning Against Label Noise
    Zhang, Chenbin
    Yang, Xiangli
    Liang, Jian
    Bai, Bing
    Bai, Kun
    King, Irwin
    Xu, Zenglin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [27] CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
    Hou, Zhijian
    Zhong, Wanjun
    Ji, Lei
    Gao, Difei
    Yan, Kun
    Chan, Wing-Kwong
    Ngo, Chong-Wah
    Shou, Mike Zheng
    Duan, Nan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 8013 - 8028
  • [28] FastClip: An Efficient Video Understanding System with Heterogeneous Computing and Coarse-to-fine Processing
    Zhao, Liming
    Sun, Siyang
    Zhang, Yanhao
    Zheng, Yun
    Pan, Pan
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 67 - 71
  • [29] Learning Coarse-to-Fine Sparselets for Efficient Object Detection and Scene Classification
    Cheng, Gong
    Han, Junwei
    Guo, Lei
    Liu, Tianming
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 1173 - 1181
  • [30] Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search
    Huang, Wenxin
    Jia, Xuemei
    Zhong, Xian
    Wang, Xiao
    Jiang, Kui
    Wang, Zheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)