Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引:1
|
作者
Hori, Chiori [1 ]
Hori, Takaaki [1 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
online video captioning; low-latency; audio-visual; transformer;
D O I
10.21437/Interspeech.2021-1975
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.
引用
收藏
页码:586 / 590
页数:5
相关论文
共 50 条
  • [41] Audio-Visual Speaker Recognition for Video Broadcast News
    Benoît Maison
    Chalapathy Neti
    Andrew Senior
    Journal of VLSI signal processing systems for signal, image and video technology, 2001, 29 : 71 - 79
  • [42] Audio-visual event recognition in surveillance video sequences
    Cristani, Marco
    Bicego, Manuele
    Murino, Vittorio
    IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
  • [43] Audio-Visual Art Performance System Using Computer Video Output Based on Converting Component Video Signal to Audio
    Ito, Yuichi
    Stone, Carl
    Yamada, Masashi
    Miyazaki, Shinya
    2013 INTERNATIONAL CONFERENCE ON CYBERWORLDS (CW), 2013, : 356 - 363
  • [44] Management Software Development for Online Music Audio-visual
    Wang, Jian
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCES IN MECHANICAL ENGINEERING AND INDUSTRIAL INFORMATICS, 2015, 15 : 378 - 381
  • [45] An Audio-Visual Attention System for Online Association Learning
    Heckmann, Martin
    Brandl, Holger
    Domont, Xavier
    Bolder, Bram
    Joublin, Frank
    Goerick, Christian
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
  • [46] Analysis of Meaning Types Using Audio-Visual Media in Easy English Video
    Nurnaningsih
    Pratiwi, Veronika Unun
    Astuti, Purwani Indri
    Reynaldi, Aji
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING (ICASE 2018), 2018, 175 : 37 - 42
  • [47] AUDIO-VISUAL PROGRAMMING FOR THE PIANO CLASS + INCLUDING LESSON PLAN USING AUDIO-VISUAL MEDIA
    LANCASTER, EL
    CLAVIER, 1976, 15 (05): : 28 - 33
  • [48] Identification of story units in audio-visual sequences by joint audio and video processing
    Saraceno, C
    Leonardi, R
    1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 363 - 367
  • [49] Efficient audio-visual information fusion using encoding pace synchronization for Audio-Visual Speech Separation
    Xu, Xinmeng
    Tu, Weiping
    Yang, Yuhong
    INFORMATION FUSION, 2025, 115
  • [50] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43