Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

被引:9
|
作者
Liang, Paul Pu [1 ]
Wu, Peter [1 ]
Liu Ziyin [2 ]
Morency, Louis-Philippe [1 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Univ Tokyo, Tokyo, Japan
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Multimodal learning; Meta-learning; Cross-modal alignment; Cross-modal retrieval;
D O I
10.1145/3474085.3475247
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How can we generalize to a new prediction task at test time when it also uses a new modality as input? More importantly, how can we do this with as little annotated data as possible? This problem of cross-modal generalization is a new research milestone with concrete impact on real-world applications. For example, can an AI system start understanding spoken language from mostly written text? Or can it learn the visual steps of a new recipe from only text descriptions? In this work, we formalize cross-modal generalization as a learning paradigm to train a model that can (1) quickly perform new tasks (from new domains) while (2) being originally trained on a different input modality. Such a learning paradigm is crucial for generalization to low-resource modalities such as spoken speech in rare languages while utilizing a different high-resource modality such as text. One key technical challenge that makes it different from other learning paradigms such as meta-learning and domain adaptation is the presence of different source and target modalities which will require different encoders. We propose an effective solution based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. This approach uses key ideas from cross-modal learning and meta-learning, and presents strong results on the cross-modal generalization problem. We benchmark several approaches on 3 real-world classification tasks: few-shot recipe classification from text to images of recipes, object classification from images to audio of objects, and language classification from text to spoken speech across 100 languages spanning many rare languages. Our results demonstrate strong performance even when the new target modality has only a few(1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.
引用
收藏
页码:2680 / 2689
页数:10
相关论文
共 50 条
  • [31] Robust cross-modal retrieval with alignment refurbishment
    Guo, Jinyi
    Ding, Jieyu
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2023, 24 (10) : 1403 - 1415
  • [32] Cross-Modal Search for Social Networks via Adversarial Learning
    Zhou, Nan
    Du, Junping
    Xue, Zhe
    Liu, Chong
    Li, Jinxuan
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2020, 2020
  • [33] Cross-Modal Data Augmentation for Tasks of Different Modalities
    Chen, Dong
    Zhuang, Yueting
    Shen, Zijin
    Yang, Carl
    Wang, Guoming
    Tang, Siliang
    Yang, Yi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7814 - 7824
  • [34] Cross-Modal Retrieval via Deep and Bidirectional Representation Learning
    He, Yonghao
    Xiang, Shiming
    Kang, Cuicui
    Wang, Jian
    Pan, Chunhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (07) : 1363 - 1377
  • [35] Cross-Modal Learning to Rank via Latent Joint Representation
    Wu, Fei
    Jiang, Xinyang
    Li, Xi
    Tang, Siliang
    Lu, Weiming
    Zhang, Zhongfei
    Zhuang, Yueting
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (05) : 1497 - 1509
  • [36] Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1041 - 1050
  • [37] XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14875 - 14885
  • [38] Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval
    Yang, Zhenguo
    Lin, Zehang
    Kang, Peipei
    Lv, Jianming
    Li, Qing
    Liu, Wenyin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [39] A Computational Model of Concept Generalization in Cross-Modal Reference
    Patrick McCrae
    Wolfgang Menzel
    Tsinghua Science and Technology, 2011, 16 (02) : 113 - 120
  • [40] Cross-modal generalization of anomia treatment to reading in aphasia
    Madden, Elizabeth B.
    Torrence, Janaki
    Kendall, Diane L.
    APHASIOLOGY, 2021, 35 (07) : 875 - 899