Global semantic enhancement network for video captioning

被引:1
|
作者
Luo, Xuemei [1 ,3 ]
Luo, Xiaotong [1 ]
Wang, Di [1 ]
Liu, Jinhui [1 ]
Wan, Bo [1 ]
Zhao, Lin [2 ,3 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Jiangsu Key Lab Image & Video Understanding Social, Nanjing 210094, Peoples R China
[3] Xidian Univ, Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Feature aggregation; Semantic enhancement;
D O I
10.1016/j.patcog.2023.109906
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame -level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Semantic Grouping Network for Video Captioning
    Ryu, Hobin
    Kang, Sunghun
    Kang, Haeyong
    Yoo, Chang D.
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
  • [2] Semantic guidance network for video captioning
    Lan Guo
    Hong Zhao
    ZhiWen Chen
    ZeYu Han
    [J]. Scientific Reports, 13
  • [3] Global-Local Combined Semantic Generation Network for Video Captioning
    Mao, Lin
    Gao, Hang
    Yang, Dawei
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
  • [4] Semantic guidance network for video captioning
    Guo, Lan
    Zhao, Hong
    Chen, Zhiwen
    Han, Zeyu
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [5] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [6] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
    Chen, Kaixuan
    Di, Qianji
    Lu, Yang
    Wang, Hanzi
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
  • [7] Attentive Visual Semantic Specialized Network for Video Captioning
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
  • [8] Video Captioning with Semantic Guiding
    Yuan, Jin
    Tian, Chunna
    Zhang, Xiangnan
    Ding, Yuxuan
    Wei, Wei
    [J]. 2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [9] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
    Gui, Yuling
    Guo, Dan
    Zhao, Ye
    [J]. PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
  • [10] Video Captioning with Transferred Semantic Attributes
    Pan, Yingwei
    Yao, Ting
    Li, Houqiang
    Mei, Tao
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 984 - 992