Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

被引:32
|
作者
Moon, WonJun [1 ]
Hyun, Sangeek [1 ]
Park, SangUk [2 ]
Park, Dongchan [2 ]
Heo, Jae-Pil [1 ]
机构
[1] Sungkyunkwan Univ, Seoul, South Korea
[2] Pyler, Seoul, South Korea
关键词
D O I
10.1109/CVPR52729.2023.02205
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.
引用
收藏
页码:23023 / 23033
页数:11
相关论文
共 50 条
  • [1] Learning Query-dependent Prefilters for Scalable Image Retrieval
    Torresani, Lorenzo
    Szummer, Martin
    Fitzgibbon, Andrew
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 2607 - +
  • [2] Clip-based similarity measure for query-dependent clip retrieval and video summarization
    Peng, Yuxin
    Ngo, Chong-Wah
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2006, 16 (05) : 612 - 627
  • [3] RANKOM: A QUERY-DEPENDENT RANKING SYSTEM FOR INFORMATION RETRIEVAL
    Jiang, Jung-Yi
    Lee, Lian-Wang
    Lee, Shie-Jue
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2011, 7 (12): : 6739 - 6756
  • [4] QDFA: Query-Dependent Feature Aggregation for Medical Image Retrieval
    Huang, Yonggang
    Ma, Dianfu
    Zhang, Jun
    Zhao, Yongwang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (01) : 275 - 279
  • [5] Learning Query-Dependent Distance Metrics for Interactive Image Retrieval
    Han, Junwei
    McKenna, Stephen J.
    Wang, Ruixuan
    COMPUTER VISION SYSTEMS, PROCEEDINGS, 2009, 5815 : 374 - 383
  • [6] Multi-video summarization with query-dependent weighted archetypal analysis
    Ji, Zhong
    Zhang, Yuanyuan
    Pang, Yanwei
    Li, Xuelong
    Pan, Jing
    NEUROCOMPUTING, 2019, 332 : 406 - 416
  • [7] Query-dependent learning to rank for cross-lingual information retrieval
    Elham Ghanbari
    Azadeh Shakery
    Knowledge and Information Systems, 2019, 59 : 711 - 743
  • [8] Task-Dependent and Query-Dependent Subspace Learning for Cross-Modal Retrieval
    Wang, Li
    Zhu, Lei
    Yu, En
    Sun, Jiande
    Zhang, Huaxiang
    IEEE ACCESS, 2018, 6 : 27091 - 27102
  • [9] Query-dependent learning to rank for cross-lingual information retrieval
    Ghanbari, Elham
    Shakery, Azadeh
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 59 (03) : 711 - 743
  • [10] Query-aware video encoder for video moment retrieval
    Hao, Jiachang
    Sun, Haifeng
    Ren, Pengfei
    Wang, Jingyu
    Qi, Qi
    Liao, Jianxin
    NEUROCOMPUTING, 2022, 483 : 72 - 86