Adaptive semantic guidance network for video captioning

被引:0
|
作者
机构
[1] Liu, Yuanyuan
[2] Zhu, Hong
[3] Wu, Zhong
[4] Du, Sen
[5] Wu, Shuning
[6] Shi, Jing
关键词
Semantics - Visual languages;
D O I
10.1016/j.cviu.2024.104255
中图分类号
学科分类号
摘要
Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method. © 2024 Elsevier Inc.
引用
收藏
相关论文
共 50 条
  • [1] Semantic guidance network for video captioning
    Lan Guo
    Hong Zhao
    ZhiWen Chen
    ZeYu Han
    [J]. Scientific Reports, 13
  • [2] Semantic guidance network for video captioning
    Guo, Lan
    Zhao, Hong
    Chen, Zhiwen
    Han, Zeyu
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [3] Guidance Module Network for Video Captioning
    Zhang, Xiao
    Liu, Chunsheng
    Chang, Faliang
    [J]. 2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959
  • [4] Semantic Grouping Network for Video Captioning
    Ryu, Hobin
    Kang, Sunghun
    Kang, Haeyong
    Yoo, Chang D.
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
  • [5] Global semantic enhancement network for video captioning
    Luo, Xuemei
    Luo, Xiaotong
    Wang, Di
    Liu, Jinhui
    Wan, Bo
    Zhao, Lin
    [J]. PATTERN RECOGNITION, 2024, 145
  • [6] Chained semantic generation network for video captioning
    Mao L.
    Gao H.
    Yang D.
    Zhang R.
    [J]. Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2022, 30 (24): : 3198 - 3209
  • [7] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [8] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
    Chen, Kaixuan
    Di, Qianji
    Lu, Yang
    Wang, Hanzi
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
  • [9] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
    Liu, Chunsheng
    Zhang, Xiao
    Chang, Faliang
    Li, Shuang
    Hao, Penghui
    Lu, Yansha
    Wang, Yinhai
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
  • [10] Attentive Visual Semantic Specialized Network for Video Captioning
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774