ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING

被引:0
|
作者
Xu, David [1 ]
Harwath, David [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
关键词
visually grounded speech; self-supervised representation learning; adversarial training;
D O I
10.1109/ICASSP43922.2022.9746436
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present an adversarial data augmentation strategy for speech spectrograms, within the context of training a model to semantically ground spoken audio captions to the images they describe. Our approach uses a two-pass strategy during training: first, a forward pass through the model is performed in order to identify segments of the input utterance that have the highest similarity score to their corresponding image. These segments are then ablated from the speech signal, producing a new and more challenging training example. Our experiments on the SpokenCOCO dataset demonstrate that when using this strategy: 1) content-carrying words tend to be ablated, forcing the model to focus on other regions of the speech; 2) the resulting model achieves improved speech-to-image retrieval accuracy; 3) the number of words that can be accurately detected by the model increases.
引用
收藏
页码:7742 / 7746
页数:5
相关论文
共 50 条
  • [1] Measuring the visual in audio-visual input
    Pujadas, Georgia
    Munoz, Carmen
    [J]. ITL-INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2023, 174 (02) : 263 - 290
  • [2] Audio-Visual Paths to Learning
    McClusky, F. D.
    [J]. EDUCATION, 1947, 68 (03): : 190 - 190
  • [3] AUDIO-VISUAL AIDS TO LEARNING
    不详
    [J]. BMJ-BRITISH MEDICAL JOURNAL, 1966, 2 (5521): : 1023 - +
  • [4] Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
    Zheng, Aihua
    Hu, Menglan
    Jiang, Bo
    Huang, Yan
    Yan, Yan
    Luo, Bin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 338 - 351
  • [5] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [6] Audio-visual input for learning L2 vocabulary and grammatical constructions
    Munoz, Carmen
    Pujadas, Georgia
    Pattemore, Anastasiia
    [J]. SECOND LANGUAGE RESEARCH, 2023, 39 (01) : 13 - 37
  • [7] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Audio-Visual Learning: A Comment on Research
    Allen, William H.
    [J]. SCHOOL AND SOCIETY, 1953, 78 (2014): : 55 - 57
  • [9] Deep Audio-visual Learning: A Survey
    Hao Zhu
    Man-Di Luo
    Rui Wang
    Ai-Hua Zheng
    Ran He
    [J]. International Journal of Automation and Computing, 2021, 18 : 351 - 376
  • [10] Deep Audio-visual Learning: A Survey
    Hao Zhu
    Man-Di Luo
    Rui Wang
    Ai-Hua Zheng
    Ran He
    [J]. Machine Intelligence Research, 2021, 18 (03) : 351 - 376