Cascaded multilingual audio-visual learning from videos

被引:0
|
作者
Rouditchenko, Andrew [1 ]
Boggust, Angie [1 ]
Harwath, David [2 ]
Thomas, Samuel [3 ]
Kuehne, Hilde [3 ]
Chen, Brian [4 ]
Panda, Rameswar [3 ]
Feris, Rogerio [3 ]
Kingsbury, Brian [3 ]
Picheny, Michael [5 ]
Glass, James [1 ]
机构
[1] MIT CSAIL, United States
[2] UT, Austin, United States
[3] IBM Research AI, United States
[4] Columbia University, United States
[5] NYU, United States
来源
arXiv | 2021年
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Large dataset
引用
收藏
相关论文
共 50 条
  • [1] Cascaded Multilingual Audio-Visual Learning from Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Thomas, Samuel
    Kuehne, Hilde
    Chen, Brian
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Glass, James
    INTERSPEECH 2021, 2021, : 3006 - 3010
  • [2] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Chen, Brian
    Joshi, Dhiraj
    Thomas, Samuel
    Audhkhasi, Kartik
    Kuehne, Hilde
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Torralba, Antonio
    Glass, James
    INTERSPEECH 2021, 2021, : 1584 - 1588
  • [3] Multilingual Audio-Visual Smartphone Dataset and Evaluation
    Mandalapu, Hareesh
    Reddy, P. N. Aravinda
    Ramachandra, Raghavendra
    Rao, Krothapalli Sreenivasa
    Mitra, Pabitra
    Prasanna, S. R. Mahadeva
    Busch, Christoph
    IEEE ACCESS, 2021, 9 : 153240 - 153257
  • [4] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [5] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [6] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [7] Audio-Visual Paths to Learning
    McClusky, F. D.
    EDUCATION, 1947, 68 (03): : 190 - 190
  • [8] AUDIO-VISUAL AIDS TO LEARNING
    不详
    BMJ-BRITISH MEDICAL JOURNAL, 1966, 2 (5521): : 1023 - +
  • [9] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [10] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350