Relation-aware Video Reading Comprehension for Temporal Language Grounding

被引:0
|
作者
Gao, Jialin [1 ,2 ]
Sun, Xin [1 ,2 ]
Xu, MengMeng [3 ]
Zhou, Xi [1 ,2 ]
Ghanem, Bernard [3 ]
机构
[1] Shanghai Jiao Tong Univ, Cooperat Medianet Innovat Ctr, Shanghai, Peoples R China
[2] CloudWalk Technol Co Ltd, Shanghai, Peoples R China
[3] King Abdullah Univ Sci & Technol, Thuwal, Saudi Arabia
关键词
LOCALIZATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and CharadesSTA demonstrate the effectiveness of our solution. Codes will be available at https: //github.com/Huntersxsx/RaNet.
引用
收藏
页码:3978 / 3988
页数:11
相关论文
共 50 条
  • [1] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [2] Relation-aware Instance Refinement for Weakly Supervised Visual Grounding
    Liu, Yongfei
    Wan, Bo
    Ma, Lin
    He, Xuming
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5608 - 5617
  • [3] Video Captioning via Relation-Aware Graph Learning
    Zheng, Yi
    Jing, Heming
    Xie, Qiujie
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Gao, Shang
    [J]. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2023, 2023-June
  • [4] Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation
    Mai, Huayu
    Sun, Rui
    Wang, Yuan
    Zhang, Tianzhu
    Wu, Feng
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4162 - 4170
  • [5] Unsupervised Video Summarization via Relation-Aware Assignment Learning
    Gao, Junyu
    Yang, Xiaoshan
    Zhang, Yingying
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 3203 - 3214
  • [6] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    [J]. PATTERN RECOGNITION, 2023, 136
  • [7] Video Moment Retrieval via Comprehensive Relation-Aware Network
    Sun, Xin
    Gao, Jialin
    Zhu, Yizhe
    Wang, Xuan
    Zhou, Xi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5281 - 5295
  • [8] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
  • [9] Efficient Video Grounding With Which-Where Reading Comprehension
    Gao, Jialin
    Sun, Xin
    Ghanem, Bernard
    Zhou, Xi
    Ge, Shiming
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6900 - 6913
  • [10] ReGR: Relation-aware graph reasoning framework for video question answering
    Wang, Zheng
    Li, Fangtao
    Ota, Kaoru
    Dong, Mianxiong
    Wu, Bin
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)