Cascaded multilingual audio-visual learning from videos

被引:0
|
作者
Rouditchenko, Andrew [1 ]
Boggust, Angie [1 ]
Harwath, David [2 ]
Thomas, Samuel [3 ]
Kuehne, Hilde [3 ]
Chen, Brian [4 ]
Panda, Rameswar [3 ]
Feris, Rogerio [3 ]
Kingsbury, Brian [3 ]
Picheny, Michael [5 ]
Glass, James [1 ]
机构
[1] MIT CSAIL, United States
[2] UT, Austin, United States
[3] IBM Research AI, United States
[4] Columbia University, United States
[5] NYU, United States
来源
arXiv | 2021年
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Large dataset
引用
收藏
相关论文
共 50 条
  • [21] Synchronization of Multiple Camera Videos Using Audio-Visual Features
    Shrestha, Prarthana
    Barbieri, Mauro
    Weda, Hans
    Sekulovski, Dragan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (01) : 79 - 92
  • [22] A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence
    Min, Xiongkuo
    Zhai, Guangtao
    Zhou, Jiantao
    Zhang, Xiao-Ping
    Yang, Xiaokang
    Guan, Xinping
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3805 - 3819
  • [23] Transfer Learning from Audio-Visual Grounding to Speech Recognition
    Hsu, Wei-Ning
    Harwath, David
    Glass, James
    INTERSPEECH 2019, 2019, : 3242 - 3246
  • [24] BAUM-2: a multilingual audio-visual affective face database
    Cigdem Eroglu Erdem
    Cigdem Turan
    Zafer Aydin
    Multimedia Tools and Applications, 2015, 74 : 7429 - 7459
  • [25] BAUM-2: a multilingual audio-visual affective face database
    Erdem, Cigdem Eroglu
    Turan, Cigdem
    Aydin, Zafer
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (18) : 7429 - 7459
  • [26] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [27] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [28] Learning Bimodal Structure in Audio-Visual Data
    Monaci, Gianluca
    Vandergheynst, Pierre
    Sommer, Friedrich T.
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (12): : 1898 - 1910
  • [29] ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING
    Xu, David
    Harwath, David
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7742 - 7746
  • [30] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657