Semantic Enhanced Video Captioning with Multi-feature Fusion

被引:0
|
作者
Niu, Tian-Zi [1 ]
Dong, Shan-Shan [1 ]
Chen, Zhen-Duo [1 ]
Luo, Xin [1 ]
Guo, Shanqing [2 ]
Huang, Zi [3 ]
Xu, Xin-Shun [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] Shandong Univ, Sch Cyber Sci & Technol, Qingdao 266237, Peoples R China
[3] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Australia
基金
中国国家自然科学基金;
关键词
Video captioning; semantic encoder; discrete selection; multi-feature fusion; NETWORK;
D O I
10.1145/3588572
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusionmodule uses a novel relation-aware attentionmechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-endmanner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Multi-feature fusion refine network for video captioning
    Wang, Guan-Hong
    Du, Ji-Xiang
    Zhang, Hong-Bo
    [J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2022, 34 (03) : 483 - 497
  • [2] Video Captioning based on Multi-feature Fusion with Object
    Zhou, Lijuan
    Liu, Tao
    Niu, Changyong
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2021), 2021, 11878
  • [3] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
    Jing Zhang
    Zhongjun Fang
    Zhe Wang
    [J]. Applied Intelligence, 2023, 53 : 13398 - 13414
  • [4] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
    Zhang, Jing
    Fang, Zhongjun
    Wang, Zhe
    [J]. APPLIED INTELLIGENCE, 2023, 53 (11) : 13398 - 13414
  • [5] Video text detection based on multi-feature fusion
    Xiao, Bing
    Zhao, Jing
    Zhao, Cong
    Ma, Junliang
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 37 (02) : 2125 - 2136
  • [6] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [7] Multi-feature fusion network for road scene semantic segmentation
    Sun, Jiaxing
    Li, Yujie
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2021, 92 (92)
  • [8] A flame detection algorithm based on video multi-feature fusion
    Zhang, Jinhua
    Zhuang, Jian
    Du, Haifeng
    Wang, Sun'an
    Li, Xiaohu
    [J]. ADVANCES IN NATURAL COMPUTATION, PT 2, 2006, 4222 : 784 - 792
  • [9] Flame detection algorithm based on video multi-feature fusion
    School of Mechanical Engineering, Xi'an Jiaotong University, Xi'an 710049, China
    [J]. Hsi An Chiao Tung Ta Hsueh, 2006, 7 (811-814):
  • [10] Forest Fire Detection Based on Video Multi-Feature Fusion
    Jie, Li
    Jiang, Xiao
    [J]. 2009 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, VOL 2, 2009, : 19 - 22