Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

被引:0
|
作者
Chen, Jingwen [1 ]
Pan, Yingwei [2 ]
Li, Yehao [1 ]
Yao, Ting [2 ]
Chao, Hongyang [1 ,3 ]
Mei, Tao [2 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
[3] Sun Yat Sen Univ, Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Guangdong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design - Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8% to 67.2% on MSVD.
引用
收藏
页码:8167 / 8174
页数:8
相关论文
共 50 条
  • [1] Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
    Chen, Jingwen
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
  • [2] Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks
    Yu, Mingjing
    Zheng, Huicheng
    Liu, Zehua
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
    Gui, Yuling
    Guo, Dan
    Zhao, Ye
    [J]. PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
  • [4] Empirical autopsy of deep video captioning encoder-decoder architecture
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Mian, Ajmal
    [J]. ARRAY, 2021, 9
  • [5] Encoder-Decoder Model for Automatic Video Captioning Using Yolo Algorithm
    Alkalouti, Hanan Nasser
    Al Masre, Mayada Ahmed
    [J]. 2021 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS), 2021, : 718 - 721
  • [6] Parallel encoder-decoder framework for image captioning
    Saeidimesineh, Reyhane
    Adibi, Peyman
    Karshenas, Hossein
    Darvishy, Alireza
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 282
  • [7] Encoder-decoder with densely convolutional networks for monocular depth estimation
    Chen, Songnan
    Tang, Mengxia
    Kan, Jiangming
    [J]. JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 2019, 36 (10) : 1709 - 1718
  • [8] Fetal electrocardiography extraction with residual convolutional encoder-decoder networks
    Zhong, Wei
    Liao, Lijuan
    Guo, Xuemei
    Wang, Guoli
    [J]. AUSTRALASIAN PHYSICAL & ENGINEERING SCIENCES IN MEDICINE, 2019, 42 (04) : 1081 - 1089
  • [9] Semantic Translation with Convolutional Encoder-decoder Networks for Viewpoint Estimation
    Zhang, Liangjun
    Gu, Changjian
    Gu, Chaochen
    Wu, Kaijie
    Guan, Xinping
    [J]. 2017 11TH ASIAN CONTROL CONFERENCE (ASCC), 2017, : 1660 - 1665
  • [10] Deep Hierarchical Encoder-Decoder Network for Image Captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (11) : 2942 - 2956