Global semantic enhancement network for video captioning

被引：1

作者：

Luo, Xuemei ^{[1
,3
]}

Luo, Xiaotong ^{[1
]}

Wang, Di ^{[1
]}

Liu, Jinhui ^{[1
]}

Wan, Bo ^{[1
]}

Zhao, Lin ^{[2
,3
]}

机构：

[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China

[2] Nanjing Univ Sci & Technol, Jiangsu Key Lab Image & Video Understanding Social, Nanjing 210094, Peoples R China

[3] Xidian Univ, Key Lab Integrated Serv Networks, Xian 710071, Peoples R China

来源：

PATTERN RECOGNITION | 2024年 / 145卷

基金：

中国国家自然科学基金;

关键词：

Video captioning; Feature aggregation; Semantic enhancement;

D O I：

10.1016/j.patcog.2023.109906

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame -level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.

引用

页数：11

共 50 条

[1] Semantic Grouping Network for Video Captioning
Ryu, Hobin
Kang, Sunghun
Kang, Haeyong
Yoo, Chang D.
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
[2] Semantic guidance network for video captioning
Lan Guo
Hong Zhao
ZhiWen Chen
ZeYu Han
[J]. Scientific Reports, 13
[3] Global-Local Combined Semantic Generation Network for Video Captioning
Mao, Lin
Gao, Hang
Yang, Dawei
[J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
[4] Semantic guidance network for video captioning
Guo, Lan
Zhao, Hong
Chen, Zhiwen
Han, Zeyu
[J]. SCIENTIFIC REPORTS, 2023, 13 (01)
[5] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[6] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
Chen, Kaixuan
Di, Qianji
Lu, Yang
Wang, Hanzi
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
[7] Attentive Visual Semantic Specialized Network for Video Captioning
Perez-Martin, Jesus
Bustos, Benjamin
Perez, Jorge
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
[8] Video Captioning with Semantic Guiding
Yuan, Jin
Tian, Chunna
Zhang, Xiangnan
Ding, Yuxuan
Wei, Wei
[J]. 2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
[9] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
Gui, Yuling
Guo, Dan
Zhao, Ye
[J]. PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
[10] Video Captioning with Transferred Semantic Attributes
Pan, Yingwei
Yao, Ting
Li, Houqiang
Mei, Tao
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 984 - 992

← 1 2 3 4 5 →