Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

被引：32

作者：

Moon, WonJun ^{[1
]}

Hyun, Sangeek ^{[1
]}

Park, SangUk ^{[2
]}

Park, Dongchan ^{[2
]}

Heo, Jae-Pil ^{[1
]}

机构：

[1] Sungkyunkwan Univ, Seoul, South Korea

[2] Pyler, Seoul, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02205

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

引用

页码：23023 / 23033

页数：11

共 50 条

[1] Learning Query-dependent Prefilters for Scalable Image Retrieval
Torresani, Lorenzo
Szummer, Martin
Fitzgibbon, Andrew
CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 2607 - +
[2] Clip-based similarity measure for query-dependent clip retrieval and video summarization
Peng, Yuxin
Ngo, Chong-Wah
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2006, 16 (05) : 612 - 627
[3] RANKOM: A QUERY-DEPENDENT RANKING SYSTEM FOR INFORMATION RETRIEVAL
Jiang, Jung-Yi
Lee, Lian-Wang
Lee, Shie-Jue
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2011, 7 (12): : 6739 - 6756
[4] QDFA: Query-Dependent Feature Aggregation for Medical Image Retrieval
Huang, Yonggang
Ma, Dianfu
Zhang, Jun
Zhao, Yongwang
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (01) : 275 - 279
[5] Learning Query-Dependent Distance Metrics for Interactive Image Retrieval
Han, Junwei
McKenna, Stephen J.
Wang, Ruixuan
COMPUTER VISION SYSTEMS, PROCEEDINGS, 2009, 5815 : 374 - 383
[6] Multi-video summarization with query-dependent weighted archetypal analysis
Ji, Zhong
Zhang, Yuanyuan
Pang, Yanwei
Li, Xuelong
Pan, Jing
NEUROCOMPUTING, 2019, 332 : 406 - 416
[7] Query-dependent learning to rank for cross-lingual information retrieval
Elham Ghanbari
Azadeh Shakery
Knowledge and Information Systems, 2019, 59 : 711 - 743
[8] Task-Dependent and Query-Dependent Subspace Learning for Cross-Modal Retrieval
Wang, Li
Zhu, Lei
Yu, En
Sun, Jiande
Zhang, Huaxiang
IEEE ACCESS, 2018, 6 : 27091 - 27102
[9] Query-dependent learning to rank for cross-lingual information retrieval
Ghanbari, Elham
Shakery, Azadeh
KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 59 (03) : 711 - 743
[10] Query-aware video encoder for video moment retrieval
Hao, Jiachang
Sun, Haifeng
Ren, Pengfei
Wang, Jingyu
Qi, Qi
Liao, Jianxin
NEUROCOMPUTING, 2022, 483 : 72 - 86

← 1 2 3 4 5 →