Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
|
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
下载
收藏
页码:5207 / 5214
页数:8
相关论文
共 50 条
  • [41] Effective and Efficient Sports Play Retrieval with Deep Representation Learning
    Wang, Zheng
    Long, Cheng
    Cong, Gao
    Ju, Ce
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 499 - 509
  • [42] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
    Shi, Yaya
    Liu, Haowei
    Xu, Haiyang
    Ma, Zongyang
    Ye, Qinghao
    Hu, Anwen
    Yan, Ming
    Zhang, Ji
    Huang, Fei
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
  • [43] Learning Low-Rank and Sparse Discriminative Correlation Filters for Coarse-to-Fine Visual Object Tracking
    Xu, Tianyang
    Feng, Zhen-Hua
    Wu, Xiao-Jun
    Kittler, Josef
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3727 - 3739
  • [44] Text-guided visual representation learning for medical image retrieval systems
    Serieys, Guillaume
    Kurtz, Camille
    Fournier, Laure
    Cloppet, Florence
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 593 - 598
  • [45] Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark
    Xu, Tianyang
    Kang, Ze
    Zhu, Xuefeng
    Wu, Xiao-Jun
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (07) : 2698 - 2712
  • [46] Towards Cross-Granularity Few-Shot Learning: Coarse-to-Fine Pseudo-Labeling with Visual-Semantic Meta-Embedding
    Yang, Jinhai
    Yang, Hua
    Chen, Lin
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3005 - 3014
  • [47] Efficient detection of intra/inter-frame video copy-move forgery: A hierarchical coarse-to-fine method
    Zhong, Jun-Liu
    Gan, Yan-Fen
    Yang, Ji-Xiang
    JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2024, 85
  • [48] Effective connectivity in the neural network underlying coarse-to-fine categorization of visual scenes. A dynamic causal modeling study
    Kauffmann, Louise
    Chauvin, Alan
    Pichat, Cedric
    Peyrin, Carole
    BRAIN AND COGNITION, 2015, 99 : 46 - 56
  • [49] Snowball: Energy Efficient and Accurate Federated Learning With Coarse-to-Fine Compression Over Heterogeneous Wireless Edge Devices
    Li, Peichun
    Cheng, Guoliang
    Huang, Xumin
    Kang, Jiawen
    Yu, Rong
    Wu, Yuan
    Pan, Miao
    Niyato, Dusit
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2023, 22 (10) : 6778 - 6792
  • [50] On Representation Learning-based Methods for Effective, Efficient, and Scalable Code Retrieval
    Franca, Celso
    Lima, Rennan C.
    Andrade, Claudio
    Cunha, Washington
    Melo, Pedro O. S. Vaz de
    Ribeiro-Neto, Berthier
    Rocha, Leonardo
    Santos, Rodrygo L. T.
    Pagano, Adriana Silvina
    Goncalves, Marcos Andre
    NEUROCOMPUTING, 2024, 600