Identification of story units in audio-visual sequences by joint audio and video processing

被引:0
|
作者
Saraceno, C [1 ]
Leonardi, R [1 ]
机构
[1] Univ Brescia, SCL Dept Elect Automat, I-25123 Brescia, Italy
关键词
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, a novel technique, which uses a joint audio-visual analysis for scene identification and characterization, is proposed. The paper defines four different scene types: dialogues, stories, actions, and generic scenes. It then explains how any audio-visual material can be decomposed into a series of scenes obeying to the preview classification, by properly analyzing and then combining the underlying audio and visual information. A rule-based procedure is defined for such purpose. Before such rule-based decision can take place, a series of low-level pre-processing tasks care suggested to adequately measure audio and visual correlations. As far as visual information is concerned, it is proposed to measure similarities between non consecutive shots using a Learning Vector Quantization approach. An outlook on a possible implementation strategy for the overall scene identification task is suggested, and validated through a series of experimental simulations on real audio-visual data.
引用
收藏
页码:363 / 367
页数:5
相关论文
共 50 条
  • [31] An audio-visual approach to web video categorization
    Bogdan Emanuel Ionescu
    Klaus Seyerlehner
    Ionuţ Mironică
    Constantin Vertan
    Patrick Lambert
    [J]. Multimedia Tools and Applications, 2014, 70 : 1007 - 1032
  • [32] Video concept detection by audio-visual grouplets
    Jiang, Wei
    Loui, Alexander C.
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2012, 1 (04) : 223 - 238
  • [33] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    [J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [34] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    [J]. PERCEPTION, 2003, 32 : 96 - 96
  • [35] Learning word-like units from joint audio-visual analysis
    Harwath, David
    Glass, James
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 506 - 517
  • [36] Joint audio-video processing for biometric speaker identification
    Kanak, A
    Erzin, E
    Yemez, Y
    Tekalp, AM
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PROCEEDINGS: SPEECH II; INDUSTRY TECHNOLOGY TRACKS; DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS; NEURAL NETWORKS FOR SIGNAL PROCESSING, 2003, : 377 - 380
  • [37] Joint audio-video processing for biometric speaker identification
    Kanak, A
    Erzin, E
    Yemez, Y
    Tekalp, AM
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 561 - 564
  • [38] Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
    Monfort, Mathew
    Jin, SouYoung
    Liu, Alexander
    Harwath, David
    Feris, Rogerio
    Glass, James
    Oliva, Aude
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14866 - 14876
  • [39] Somatosensory contribution to audio-visual speech processing
    Ito, Takayuki
    Ohashi, Hiroki
    Gracco, Vincent L.
    [J]. CORTEX, 2021, 143 : 195 - 204
  • [40] Audio-visual interaction in the processing of location changes
    Schröger, E
    Widmann, A
    [J]. JOURNAL OF PSYCHOPHYSIOLOGY, 1998, 12 (03) : 322 - 323