Global semantic enhancement network for video captioning

被引:1
|
作者
Luo, Xuemei [1 ,3 ]
Luo, Xiaotong [1 ]
Wang, Di [1 ]
Liu, Jinhui [1 ]
Wan, Bo [1 ]
Zhao, Lin [2 ,3 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Jiangsu Key Lab Image & Video Understanding Social, Nanjing 210094, Peoples R China
[3] Xidian Univ, Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Feature aggregation; Semantic enhancement;
D O I
10.1016/j.patcog.2023.109906
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame -level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Rethinking Network for Classroom Video Captioning
    Zhu, Mingjian
    Duan, Chenrui
    Yu, Changbin
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [22] Guidance Module Network for Video Captioning
    Zhang, Xiao
    Liu, Chunsheng
    Chang, Faliang
    [J]. 2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959
  • [23] Video Captioning with Semantic Information from the Knowledge Base
    Wang, Dan
    Song, Dandan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 224 - 229
  • [24] Structured Encoding Based on Semantic Disambiguation for Video Captioning
    Sun, Bo
    Tian, Jinyu
    Wu, Yong
    Yu, Lunjun
    Tang, Yuanyan
    [J]. COGNITIVE COMPUTATION, 2024, 16 (03) : 1032 - 1048
  • [25] GSENet: Global Semantic Enhancement Network for Lane Detection
    Su, Junhao
    Chen, Zhenghan
    He, Chenghao
    Guan, Dongzhi
    Cai, Changpeng
    Zhou, Tongxi
    Wei, Jiasheng
    Tian, Wenhua
    Xie, Zhihuai
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 15108 - 15116
  • [26] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    [J]. PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [27] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
  • [28] Richer Semantic Visual and Language Representation for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Wang, Hanzhang
    Xu, Kaisheng
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
  • [29] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    [J]. PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [30] A Context Semantic Auxiliary Network for Image Captioning
    Li, Jianying
    Shao, Xiangjun
    [J]. INFORMATION, 2023, 14 (07)