Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

被引:22
|
作者
Zhang, Zongmeng [1 ]
Han, Xianjing [1 ]
Song, Xuemeng [1 ]
Yan, Yan [2 ]
Nie, Liqiang [1 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
[2] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
中国国家自然科学基金;
关键词
Videos; Location awareness; Task analysis; Semantics; Syntactics; Convolution; Cognition; Temporal language localization; graph convolutional network; video and language; NEURAL-NETWORK;
D O I
10.1109/TIP.2021.3113791
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.
引用
收藏
页码:8265 / 8277
页数:13
相关论文
共 50 条
  • [1] Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos
    Zatsarynna, Olga
    Abu Farha, Yazan
    Gall, Juergen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2249 - 2258
  • [2] Multi-modal Graph Convolutional Network for Knowledge Graph Entity Alignment
    You, Yinghui
    Wei, Yuyang
    Zhang, Yanlong
    Chen, Wei
    Zhao, Lei
    WEB AND BIG DATA, PT I, APWEB-WAIM 2023, 2024, 14331 : 142 - 157
  • [3] Sparse graph matching network for temporal language localization in videos
    Wu, Guangli
    Xu, Tongjie
    Zhang, Jing
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 240
  • [4] Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network
    Liang, Bin
    Lou, Chenwei
    Li, Xiang
    Yang, Min
    Gui, Lin
    He, Yulan
    Pei, Wenjie
    Xu, Ruifeng
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1767 - 1777
  • [5] Graph Convolutional Incomplete Multi-modal Hashing
    Shen, Xiaobo
    Chen, Yinfan
    Pan, Shirui
    Liu, Weiwei
    Zheng, Yuhui
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7029 - 7037
  • [6] Graph Convolutional Module for Temporal Action Localization in Videos
    Zeng, Runhao
    Huang, Wenbing
    Tan, Mingkui
    Rong, Yu
    Zhao, Peilin
    Huang, Junzhou
    Gan, Chuang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 6209 - 6223
  • [7] Multi-Modal Multi-Instance Multi-Label Learning with Graph Convolutional Network
    Hang, Cheng
    Wang, Wei
    Zhan, De-Chuan
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems
    Cui, Xiaohui
    Qu, Xiaolong
    Li, Dongmei
    Yang, Yu
    Li, Yuxun
    Zhang, Xiaoping
    ELECTRONICS, 2023, 12 (12)
  • [9] Multi-Modal Graph Interaction for Multi-Graph Convolution Network in Urban Spatiotemporal Forecasting
    Zhang, Lingyu
    Geng, Xu
    Qin, Zhiwei
    Wang, Hongjun
    Wang, Xiao
    Zhang, Ying
    Liang, Jian
    Wu, Guobin
    Song, Xuan
    Wang, Yunhai
    SUSTAINABILITY, 2022, 14 (19)
  • [10] Ensemble Manifold Regularized Multi-Modal Graph Convolutional Network for Cognitive Ability Prediction
    Qu, Gang
    Xiao, Li
    Hu, Wenxing
    Wang, Junqi
    Zhang, Kun
    Calhoun, Vince D.
    Wang, Yu-Ping
    IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2021, 68 (12) : 3564 - 3573