Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

被引:32
|
作者
Moon, WonJun [1 ]
Hyun, Sangeek [1 ]
Park, SangUk [2 ]
Park, Dongchan [2 ]
Heo, Jae-Pil [1 ]
机构
[1] Sungkyunkwan Univ, Seoul, South Korea
[2] Pyler, Seoul, South Korea
关键词
D O I
10.1109/CVPR52729.2023.02205
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.
引用
收藏
页码:23023 / 23033
页数:11
相关论文
共 50 条
  • [31] MS-DETR: Exploiting Modality Synergy for Moment Retrieval and Highlight Detection
    Chen, Luyuan
    Huang, Jing
    Kong, Ming
    Liang, Tian
    Zhu, Qiang
    Wu, Jianwu
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 416 - 429
  • [32] Temporal refinement and multi-grained matching for moment retrieval and highlight detection
    Zhu, Cunjuan
    Zhang, Yanyi
    Jia, Qi
    Wang, Weimin
    Liu, Yu
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [33] Query-dependent banding (QDB) for faster RNA similarity searches
    Nawrocki, Eric P.
    Eddy, Sean R.
    PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (03) : 540 - 554
  • [34] Query-dependent cross-domain ranking in heterogeneous network
    Bo Wang
    Jie Tang
    Wei Fan
    Songcan Chen
    Chenhao Tan
    Zi Yang
    Knowledge and Information Systems, 2013, 34 : 109 - 145
  • [35] Query-dependent cross-domain ranking in heterogeneous network
    Wang, Bo
    Tang, Jie
    Fan, Wei
    Chen, Songcan
    Tan, Chenhao
    Yang, Zi
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (01) : 109 - 145
  • [36] Video fingerprinting: Features for duplicate and similar video detection and query-based video retrieval
    Sarkar, Anindya
    Ghosh, Pratim
    Moxley, Emily
    Manjunath, B. S.
    MULTIMEDIA CONTENT ACCESS: ALGORITHMS AND SYSTEMS II, 2008, 6820
  • [37] Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval
    Liu, Daizong
    Qu, Xiaoye
    Dong, Jianfeng
    Nan, Guoshun
    Zhou, Pan
    Xu, Zichuan
    Chen, Lixing
    Yan, He
    Cheng, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4190 - 4199
  • [38] Language-enhanced object reasoning networks for video moment retrieval with text query
    Wang, Gongmian
    Jiang, Xun
    Liu, Ning
    Xu, Xing
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 102
  • [39] Approximate Shortest Distance Computing: A Query-Dependent Local Landmark Scheme
    Qiao, Miao
    Cheng, Hong
    Chang, Lijun
    Yu, Jeffrey Xu
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 462 - 473
  • [40] Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering
    Luo, Hongyin
    Mohtarami, Mitra
    Glass, James
    Krishnanzurthy, Karthik
    Richardson, Brigitte
    INTERSPEECH 2019, 2019, : 599 - 603