Learning Hierarchical Modular Networks for Video Captioning

被引:2
|
作者
Li, Guorong [1 ]
Ye, Hanhua [1 ]
Qi, Yuankai [2 ]
Wang, Shuhui [3 ]
Qing, Laiyun [1 ]
Huang, Qingming [1 ]
Yang, Ming-Hsuan [4 ,5 ,6 ]
机构
[1] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Key Lab Big Data Min & Knowledge Management, Beijing 100049, Peoples R China
[2] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA 5005, Australia
[3] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100045, Peoples R China
[4] Univ Calif Merced, Merced, CA 95343 USA
[5] Yonsei Univ, Seoul 03722, South Korea
[6] Google, Mountain View, CA 94043 USA
关键词
Video captioning; hierarchical modular network; scene-graph reward; reinforcement learning; LANGUAGE;
D O I
10.1109/TPAMI.2023.3327677
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).
引用
收藏
页码:1049 / 1064
页数:16
相关论文
共 50 条
  • [1] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [2] Video Captioning via Hierarchical Reinforcement Learning
    Wang, Xin
    Chen, Wenhu
    Wu, Jiawei
    Wang, Yuan-Fang
    Wang, William Yang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
  • [3] Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
    Yu, Haonan
    Wang, Jiang
    Huang, Zhiheng
    Yang, Yi
    Xu, Wei
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4584 - 4593
  • [4] Learning Multimodal Attention LSTM Networks for Video Captioning
    Xu, Jun
    Yao, Ting
    Zhang, Yongdong
    Mei, Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
  • [5] Hierarchical Memory Modelling for Video Captioning
    Wang, Junbo
    Wang, Wei
    Huang, Yan
    Wang, Liang
    Tan, Tieniu
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 63 - 71
  • [6] Learning to Discretely Compose Reasoning Module Networks for Video Captioning
    Tan, Ganchao
    Liu, Daqing
    Wang, Meng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 745 - 752
  • [7] MULTISTREAM HIERARCHICAL BOUNDARY NETWORK FOR VIDEO CAPTIONING
    Thang Nguyen
    Sah, Shagan
    Ptucha, Raymond
    [J]. 2017 IEEE WESTERN NEW YORK IMAGE AND SIGNAL PROCESSING WORKSHOP (WNYISPW), 2017,
  • [8] Hierarchical Language Modeling for Dense Video Captioning
    Dave, Jaivik
    Padmavathi, S.
    [J]. INVENTIVE COMPUTATION AND INFORMATION TECHNOLOGIES, ICICIT 2021, 2022, 336 : 421 - 431
  • [9] Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks
    Yu, Mingjing
    Zheng, Huicheng
    Liu, Zehua
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [10] Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks
    Qi, Mengshi
    Wang, Yunhong
    Li, Annan
    Luo, Jiebo
    [J]. PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON MULTIMEDIA CONTENT ANALYSIS IN SPORTS (MMSPORTS'18), 2018, : 77 - 85