ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING

被引:0
|
作者
Xu, David [1 ]
Harwath, David [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
关键词
visually grounded speech; self-supervised representation learning; adversarial training;
D O I
10.1109/ICASSP43922.2022.9746436
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present an adversarial data augmentation strategy for speech spectrograms, within the context of training a model to semantically ground spoken audio captions to the images they describe. Our approach uses a two-pass strategy during training: first, a forward pass through the model is performed in order to identify segments of the input utterance that have the highest similarity score to their corresponding image. These segments are then ablated from the speech signal, producing a new and more challenging training example. Our experiments on the SpokenCOCO dataset demonstrate that when using this strategy: 1) content-carrying words tend to be ablated, forcing the model to focus on other regions of the speech; 2) the resulting model achieves improved speech-to-image retrieval accuracy; 3) the number of words that can be accurately detected by the model increases.
引用
收藏
页码:7742 / 7746
页数:5
相关论文
共 50 条
  • [21] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [22] Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild
    He, Yibo
    Seng, Kah Phooi
    Ang, Li Minn
    SENSORS, 2023, 23 (04)
  • [23] Retraction Note: Detecting adversarial attacks on audio-visual speech recognition using deep learning method
    Rabie A. Ramadan
    International Journal of Speech Technology, 2022, 25 (Suppl 1) : 29 - 29
  • [24] RETRACTED ARTICLE: Detecting adversarial attacks on audio-visual speech recognition using deep learning method
    Rabie A. Ramadan
    International Journal of Speech Technology, 2022, 25 : 625 - 631
  • [25] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [26] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
  • [27] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
  • [28] Audio-Visual Segmentation
    Zhou, Jinxing
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
  • [29] Audio-visual resources and learning improvement: an experimental analysis
    Magadan-Diaz, Marta
    Rivas-Garcia, Jesus I.
    INTERNATIONAL JOURNAL OF LEARNING TECHNOLOGY, 2023, 18 (01) : 79 - 93
  • [30] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134