Survey on Video Moment Retrieval

被引：0

作者：

Wang Y. ^{[1
]}

Zhan Y.-W. ^{[1
]}

Luo X. ^{[1
]}

Liu M. ^{[2
]}

Xu X.-S. ^{[1
]}

机构：

[1] School of Software, Shandong University, Jinan

[2] School of Computer Science and Technology, Shandong Jianzhu University, Jinan

来源：

Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 02期

关键词：

artificial intelligence; deep learning; temporal activity localization via language; video moment retrieval; video understanding;

D O I：

10.13328/j.cnki.jos.006707

中图分类号：

学科分类号：

摘要：

Given a natural language sentence as the query, the task of video moment retrieval aims to localize the most relevant video moment in a long untrimmed video. Based on the rich visual, text, and audio information contained in the video, how to fully understand the visual information provided in the video and utilize the text information provided by the query sentence to enhance the generalization and robustness of model, and how to align and interact cross-modal information are crucial challenges of the video moment retrieval. This study systematically sorts out the work in the field of video moment retrieval, and divides them into ranking-based methods and localization-based methods. Thereinto, the ranking-based methods can be further divided into the methods of presetting candidate clips, and the methods of generating candidate clips with guidance; the localization-based methods can be divided into one-time localization methods and iterative localization ones. The datasets and evaluation metrics of this fieldf are also summarized and the latest advances are reviewed. Finally, the related extension task is introduced, e.g., moment localization from video corpus, and the survey is concluded with a discussion on promising trends. © 2023 Chinese Academy of Sciences. All rights reserved.

引用

页码：985 / 1006

页数：21

共 92 条

[91] Zhang B, Hu H, Lee J, Zhao M, Chammas S, Jain V, Ie E, Sha F., A hierarchical multi-modal encoder for moment localization in video corpus, (2020)
[92] Yuan Y, Lan X, Wang X, Chen L, Wang Z, Zhu W., A closer look at temporal sentence grounding in videos: Dataset and metric, Proc. of the 2nd Int’l Workshop on Human-centric Multimedia Analysis, Virtual Event, pp. 13-21, (2021)

← 1 2 3 4 5 6 7 8 9 10 →