Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引:1
|
作者
Hori, Chiori [1 ]
Hori, Takaaki [1 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
online video captioning; low-latency; audio-visual; transformer;
D O I
10.21437/Interspeech.2021-1975
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.
引用
收藏
页码:586 / 590
页数:5
相关论文
共 50 条
  • [21] An audio-visual approach to web video categorization
    Bogdan Emanuel Ionescu
    Klaus Seyerlehner
    Ionuţ Mironică
    Constantin Vertan
    Patrick Lambert
    Multimedia Tools and Applications, 2014, 70 : 1007 - 1032
  • [22] ADVANCES IN ONLINE AUDIO-VISUAL MEETING TRANSCRIPTION
    Yoshioka, Takuya
    Abramovski, Igor
    Aksoylar, Cem
    Chen, Zhuo
    David, Moshe
    Dimitriadis, Dimitrios
    Gong, Yifan
    Gurvich, Ilya
    Huang, Xuedong
    Huang, Yan
    Hurvitz, Aviv
    Jiang, Li
    Koubi, Sharon
    Krupka, Eyal
    Leichter, Ido
    Liu, Changliang
    Parthasarathy, Partha
    Vinnikov, Alon
    Wu, Lingfeng
    Xiao, Xiong
    Xiong, Wayne
    Wang, Huaming
    Wang, Zhenghao
    Zhang, Jun
    Zhao, Yong
    Zhou, Tianyan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 276 - 283
  • [23] Kansei enhancement for audio-visual contents using video prodution techniques
    Yamane, S
    Sato, M
    Mouri, T
    Mori, T
    Kasuga, M
    TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A351 - A354
  • [24] Video clip recognition using joint audio-visual processing model
    Kulesh, V
    Petrushin, VA
    Sethi, IK
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL I, PROCEEDINGS, 2002, : 500 - 503
  • [25] Video clip recognition using joint audio-visual processing model
    Kulesh, Victor
    Petrushin, Valery A.
    Sethi, Ishwar K.
    Proceedings - International Conference on Pattern Recognition, 2002, 16 (01): : 500 - 503
  • [26] Vision Transformers are Parameter-Efficient Audio-Visual Learners
    Lin, Yan-Bo
    Sung, Yi-Lin
    Lei, Jie
    Bansal, Mohit
    Bertasius, Gedas
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2299 - 2309
  • [27] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [28] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [29] Perceptual Quality of Audio-Visual Content with Common Video and Audio Degradations
    Becerra Martinez, Helard
    Hines, Andrew
    Farias, Mylene C. Q.
    APPLIED SCIENCES-BASEL, 2021, 11 (13):
  • [30] Combining text and audio-visual features in video indexing
    Chang, SF
    Manmatha, R
    Chua, TS
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1005 - 1008