VIDEO CAPTIONING WITH TEMPORAL AND REGION GRAPH CONVOLUTION NETWORK

被引:3
|
作者
Xiao, Xinlong [1 ]
Zhang, Yuejie [1 ]
Feng, Rui [1 ]
Zhang, Tao [2 ]
Gao, Shang [3 ]
Fan, Weiguo [4 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
[2] Shanghai Univ Finance & Econn, Sch Informat Managerment & Engn, Shanghai, Peoples R China
[3] Deakin Univ, Sch Informat Technol, Geelong, Vic, Australia
[4] Univ Iowa, Tippie Coll Business, Dept Business Analyt, Iowa City, IA 52242 USA
基金
中国国家自然科学基金;
关键词
Video Captioning; Graph Convolution Network; Temporal Graph Network; Region Graph Network; Language Generation Model;
D O I
10.1109/icme46284.2020.9102967
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets (MSVD and MSR-VTT) show the effectiveness of our model.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Using Spatial Temporal Graph Convolutional Network Dynamic Scene Graph for Video Captioning of Pedestrians Intention
    Cao, Dong
    Zhao, Qunhe
    Fu, Yunbin
    [J]. 2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 179 - 183
  • [2] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [3] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    [J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [4] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    [J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [5] Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
    Zhang, Junchao
    Peng, Yuxin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8319 - 8328
  • [6] CMGNet: Collaborative multi-modal graph network for video captioning
    Rao, Qi
    Yu, Xin
    Li, Guang
    Zhu, Linchao
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [7] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [8] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    [J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [9] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    [J]. Applied Intelligence, 2022, 52 : 5241 - 5260
  • [10] Reconstruction Network for Video Captioning
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Liu, Wei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631