Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引：1

作者：

Hori, Chiori ^{[1
]}

Hori, Takaaki ^{[1
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

online video captioning; low-latency; audio-visual; transformer;

D O I：

10.21437/Interspeech.2021-1975

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.

引用

页码：586 / 590

页数：5

共 50 条

[41] Audio-Visual Speaker Recognition for Video Broadcast News
Benoît Maison
Chalapathy Neti
Andrew Senior
Journal of VLSI signal processing systems for signal, image and video technology, 2001, 29 : 71 - 79
[42] Audio-visual event recognition in surveillance video sequences
Cristani, Marco
Bicego, Manuele
Murino, Vittorio
IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
[43] Audio-Visual Art Performance System Using Computer Video Output Based on Converting Component Video Signal to Audio
Ito, Yuichi
Stone, Carl
Yamada, Masashi
Miyazaki, Shinya
2013 INTERNATIONAL CONFERENCE ON CYBERWORLDS (CW), 2013, : 356 - 363
[44] Management Software Development for Online Music Audio-visual
Wang, Jian
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCES IN MECHANICAL ENGINEERING AND INDUSTRIAL INFORMATICS, 2015, 15 : 378 - 381
[45] An Audio-Visual Attention System for Online Association Learning
Heckmann, Martin
Brandl, Holger
Domont, Xavier
Bolder, Bram
Joublin, Frank
Goerick, Christian
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
[46] Analysis of Meaning Types Using Audio-Visual Media in Easy English Video
Nurnaningsih
Pratiwi, Veronika Unun
Astuti, Purwani Indri
Reynaldi, Aji
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING (ICASE 2018), 2018, 175 : 37 - 42
[47] AUDIO-VISUAL PROGRAMMING FOR THE PIANO CLASS + INCLUDING LESSON PLAN USING AUDIO-VISUAL MEDIA
LANCASTER, EL
CLAVIER, 1976, 15 (05): : 28 - 33
[48] Identification of story units in audio-visual sequences by joint audio and video processing
Saraceno, C
Leonardi, R
1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 363 - 367
[49] Efficient audio-visual information fusion using encoding pace synchronization for Audio-Visual Speech Separation
Xu, Xinmeng
Tu, Weiping
Yang, Yuhong
INFORMATION FUSION, 2025, 115
[50] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43

← 1 2 3 4 5 →