Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引:1
|
作者
Hori, Chiori [1 ]
Hori, Takaaki [1 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
online video captioning; low-latency; audio-visual; transformer;
D O I
10.21437/Interspeech.2021-1975
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.
引用
收藏
页码:586 / 590
页数:5
相关论文
共 50 条
  • [1] Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers
    Hori, Chiori
    Hori, Takaaki
    Le Roux, Jonathan
    INTERSPEECH 2022, 2022, : 4511 - 4515
  • [2] Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
    Liu, Xubo
    Huang, Qiushi
    Mei, Xinhao
    Liu, Haohe
    Kong, Qiuqiang
    Sun, Jianyuan
    Li, Shengchen
    Ko, Tom
    Zhang, Yu
    Tang, Lilian H.
    Plumbley, Mark D.
    Kilic, Volkan
    Wang, Wenwu
    INTERSPEECH 2023, 2023, : 2838 - 2842
  • [3] VIDEO CAMERA IDENTIFICATION USING AUDIO-VISUAL FEATURES
    Milani, S.
    Cuccovillo, L.
    Tagliasacchi, M.
    Tubaro, S.
    Aichroth, P.
    2014 5TH EUROPEAN WORKSHOP ON VISUAL INFORMATION PROCESSING (EUVIP 2014), 2014,
  • [4] AUDIO-VISUAL SCENE-AWARE DIALOG AND REASONING USING AUDIO-VISUAL TRANSFORMERS WITH JOINT STUDENT-TEACHER LEARNING
    Shah, Ankit
    Geng, Shijie
    Gao, Peng
    Cherian, Anoop
    Hori, Takaaki
    Marks, Tim K.
    Le Roux, Jonathan
    Hori, Chiori
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7732 - 7736
  • [5] Video genre categorization and representation using audio-visual information
    Ionescu, Bogdan
    Seyerlehner, Klaus
    Rasche, Christoph
    Vertan, Constantin
    Lambert, Patrick
    JOURNAL OF ELECTRONIC IMAGING, 2012, 21 (02)
  • [6] VidQ: Video Query Using Optimized Audio-Visual Processing
    Felemban, Noor
    Mehmeti, Fidan
    Porta, Thomas F.
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (03) : 1338 - 1352
  • [7] SYNCHRONIZED AUDIO-VISUAL FRAMES WITH FRACTIONAL POSITIONAL ENCODING FOR TRANSFORMERS IN VIDEO-TO-TEXT TRANSLATION
    Harzig, Philipp
    Einfalt, Moritz
    Lienhart, Rainer
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2041 - 2045
  • [8] Audio-visual quality and interactions between television audio and video
    Joly, A
    Montard, N
    Buttin, M
    ISSPA 2001: SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, 2001, : 438 - 441
  • [9] Combining audio and video metrics to assess audio-visual quality
    Becerra Martinez, Helard A.
    Farias, Mylene C. Q.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23993 - 24012
  • [10] Bootstrapping Audio-Visual Video Segmentation by Strengthening Audio Cues
    Chen, Tianxiang
    Tan, Zhentao
    Gong, Tao
    Chu, Qi
    Wu, Yue
    Liu, Bin
    Yu, Nenghai
    Lu, Le
    Ye, Jieping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2398 - 2409