Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引：1

作者：

Hori, Chiori ^{[1
]}

Hori, Takaaki ^{[1
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

online video captioning; low-latency; audio-visual; transformer;

D O I：

10.21437/Interspeech.2021-1975

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.

引用

页码：586 / 590

页数：5

共 50 条

[1] Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers
Hori, Chiori
Hori, Takaaki
Le Roux, Jonathan
INTERSPEECH 2022, 2022, : 4511 - 4515
[2] Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Liu, Xubo
Huang, Qiushi
Mei, Xinhao
Liu, Haohe
Kong, Qiuqiang
Sun, Jianyuan
Li, Shengchen
Ko, Tom
Zhang, Yu
Tang, Lilian H.
Plumbley, Mark D.
Kilic, Volkan
Wang, Wenwu
INTERSPEECH 2023, 2023, : 2838 - 2842
[3] VIDEO CAMERA IDENTIFICATION USING AUDIO-VISUAL FEATURES
Milani, S.
Cuccovillo, L.
Tagliasacchi, M.
Tubaro, S.
Aichroth, P.
2014 5TH EUROPEAN WORKSHOP ON VISUAL INFORMATION PROCESSING (EUVIP 2014), 2014,
[4] AUDIO-VISUAL SCENE-AWARE DIALOG AND REASONING USING AUDIO-VISUAL TRANSFORMERS WITH JOINT STUDENT-TEACHER LEARNING
Shah, Ankit
Geng, Shijie
Gao, Peng
Cherian, Anoop
Hori, Takaaki
Marks, Tim K.
Le Roux, Jonathan
Hori, Chiori
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7732 - 7736
[5] Video genre categorization and representation using audio-visual information
Ionescu, Bogdan
Seyerlehner, Klaus
Rasche, Christoph
Vertan, Constantin
Lambert, Patrick
JOURNAL OF ELECTRONIC IMAGING, 2012, 21 (02)
[6] VidQ: Video Query Using Optimized Audio-Visual Processing
Felemban, Noor
Mehmeti, Fidan
Porta, Thomas F.
IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (03) : 1338 - 1352
[7] SYNCHRONIZED AUDIO-VISUAL FRAMES WITH FRACTIONAL POSITIONAL ENCODING FOR TRANSFORMERS IN VIDEO-TO-TEXT TRANSLATION
Harzig, Philipp
Einfalt, Moritz
Lienhart, Rainer
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2041 - 2045
[8] Audio-visual quality and interactions between television audio and video
Joly, A
Montard, N
Buttin, M
ISSPA 2001: SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, 2001, : 438 - 441
[9] Combining audio and video metrics to assess audio-visual quality
Becerra Martinez, Helard A.
Farias, Mylene C. Q.
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23993 - 24012
[10] Bootstrapping Audio-Visual Video Segmentation by Strengthening Audio Cues
Chen, Tianxiang
Tan, Zhentao
Gong, Tao
Chu, Qi
Wu, Yue
Liu, Bin
Yu, Nenghai
Lu, Le
Ye, Jieping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2398 - 2409

← 1 2 3 4 5 →