Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

被引:15
|
作者
Shi, Botian [1 ]
Ji, Lei [2 ,3 ]
Niu, Zhendong [1 ]
Duan, Nan [4 ]
Zhou, Ming [4 ]
Chen, Xilin [3 ,5 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Microsoft Res Asia, Beijing, Peoples R China
[3] Inst Comp Technol CAS, Beijing, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
video captioning; video summarization; semantic concept;
D O I
10.1145/3394171.3413498
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.
引用
收藏
页码:4337 / 4345
页数:9
相关论文
共 50 条
  • [1] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
    Chen, Kaixuan
    Di, Qianji
    Lu, Yang
    Wang, Hanzi
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
  • [2] Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
    Lu, Yifan
    Zhang, Ziqi
    Yuan, Chunfeng
    Li, Peng
    Wang, Yan
    Li, Bing
    Hu, Weiming
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3909 - 3917
  • [3] Learning Temporal Dynamics from Cycles in Narrated Video
    Epstein, Dave
    Wu, Jiajun
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1460 - 1469
  • [4] Center-enhanced video captioning model with multimodal semantic alignment
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    [J]. Neural Networks, 2024, 180
  • [5] Fused GRU with semantic-temporal attention for video captioning
    Gao, Lianli
    Wang, Xuanhan
    Song, Jingkuan
    Liu, Yang
    [J]. NEUROCOMPUTING, 2020, 395 (395) : 222 - 228
  • [6] Video Captioning with Semantic Guiding
    Yuan, Jin
    Tian, Chunna
    Zhang, Xiangnan
    Ding, Yuxuan
    Wei, Wei
    [J]. 2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [7] Learning topic emotion and logical semantic for video paragraph captioning
    Li, Qinyu
    Wang, Hanli
    Yi, Xiaokai
    [J]. DISPLAYS, 2024, 83
  • [8] Semantic Object Alignment and Region-Aware Learning for Change Captioning
    Tian, Weidong
    Ren, Quan
    Zhao, Zhongqiu
    Tian, Ruihua
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [9] Learning semantic visual concepts from video
    Liu, JC
    Bhanu, B
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL II, PROCEEDINGS, 2002, : 1061 - 1064
  • [10] State-aware video procedural captioning
    Taichi Nishimura
    Atsushi Hashimoto
    Yoshitaka Ushiku
    Hirotaka Kameko
    Shinsuke Mori
    [J]. Multimedia Tools and Applications, 2023, 82 : 37273 - 37301