Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

被引:9
|
作者
Liang, Paul Pu [1 ]
Wu, Peter [1 ]
Liu Ziyin [2 ]
Morency, Louis-Philippe [1 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Univ Tokyo, Tokyo, Japan
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Multimodal learning; Meta-learning; Cross-modal alignment; Cross-modal retrieval;
D O I
10.1145/3474085.3475247
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How can we generalize to a new prediction task at test time when it also uses a new modality as input? More importantly, how can we do this with as little annotated data as possible? This problem of cross-modal generalization is a new research milestone with concrete impact on real-world applications. For example, can an AI system start understanding spoken language from mostly written text? Or can it learn the visual steps of a new recipe from only text descriptions? In this work, we formalize cross-modal generalization as a learning paradigm to train a model that can (1) quickly perform new tasks (from new domains) while (2) being originally trained on a different input modality. Such a learning paradigm is crucial for generalization to low-resource modalities such as spoken speech in rare languages while utilizing a different high-resource modality such as text. One key technical challenge that makes it different from other learning paradigms such as meta-learning and domain adaptation is the presence of different source and target modalities which will require different encoders. We propose an effective solution based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. This approach uses key ideas from cross-modal learning and meta-learning, and presents strong results on the cross-modal generalization problem. We benchmark several approaches on 3 real-world classification tasks: few-shot recipe classification from text to images of recipes, object classification from images to audio of objects, and language classification from text to spoken speech across 100 languages spanning many rare languages. Our results demonstrate strong performance even when the new target modality has only a few(1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.
引用
收藏
页码:2680 / 2689
页数:10
相关论文
共 50 条
  • [1] Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization
    Zhu, Lin
    Wang, Xinbing
    Zhou, Chenghu
    Ye, Nanyang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 11461 - 11469
  • [2] Text-based person search via cross-modal alignment learning
    Ke, Xiao
    Liu, Hao
    Xu, Peirong
    Lin, Xinru
    Guo, Wenzhong
    PATTERN RECOGNITION, 2024, 152
  • [3] Enhancing Cross-Modal Alignment in Multimodal Sentiment Analysis via Prompt Learning
    Wang, Xiaofan
    Li, Xiuhong
    Li, Zhe
    Zhou, Chenyu
    Chen, Fan
    Yang, Dan
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 541 - 554
  • [4] Category Alignment Adversarial Learning for Cross-Modal Retrieval
    He, Shiyuan
    Wang, Weiyang
    Wang, Zheng
    Xu, Xing
    Yang, Yang
    Wang, Xiaoming
    Shen, Heng Tao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (05) : 4527 - 4538
  • [5] Learning Relation Alignment for Calibrated Cross-modal Retrieval
    Ren, Shuhuai
    Lin, Junyang
    Zhao, Guangxiang
    Men, Rui
    Yang, An
    Zhou, Jingren
    Sun, Xu
    Yang, Hongxia
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 514 - 524
  • [6] Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment
    Jiang, Xinyang
    Wu, Fei
    Li, Xi
    Zhao, Zhou
    Lu, Weiming
    Tang, Siliang
    Zhuang, Yueting
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 69 - 78
  • [7] Enhancing Pulmonary Nodule Detection via Cross-Modal Alignment
    Zhu, Yumeng
    Xu, Yi
    Ni, Bingbing
    Zhang, Jie
    Yang, Xiaokang
    2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2017,
  • [8] ROBUST LATENT REPRESENTATIONS VIA CROSS-MODAL TRANSLATION AND ALIGNMENT
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4315 - 4319
  • [9] CROSS-MODAL GENERALIZATION OF HABITUATION IN THE RAT
    ADES, C
    SALLES, JB
    PERCEPTUAL AND MOTOR SKILLS, 1980, 50 (03) : 1345 - 1346
  • [10] Collaboratively Semantic Alignment and Metric Learning for Cross-Modal Hashing
    Li, Jiaxing
    Wong, Wai Keung
    Jiang, Lin
    Jiang, Kaihang
    Fang, Xiaozhao
    Xie, Shengli
    Wen, Jie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (05) : 2311 - 2328