Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

被引:9
|
作者
Liang, Paul Pu [1 ]
Wu, Peter [1 ]
Liu Ziyin [2 ]
Morency, Louis-Philippe [1 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Univ Tokyo, Tokyo, Japan
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Multimodal learning; Meta-learning; Cross-modal alignment; Cross-modal retrieval;
D O I
10.1145/3474085.3475247
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How can we generalize to a new prediction task at test time when it also uses a new modality as input? More importantly, how can we do this with as little annotated data as possible? This problem of cross-modal generalization is a new research milestone with concrete impact on real-world applications. For example, can an AI system start understanding spoken language from mostly written text? Or can it learn the visual steps of a new recipe from only text descriptions? In this work, we formalize cross-modal generalization as a learning paradigm to train a model that can (1) quickly perform new tasks (from new domains) while (2) being originally trained on a different input modality. Such a learning paradigm is crucial for generalization to low-resource modalities such as spoken speech in rare languages while utilizing a different high-resource modality such as text. One key technical challenge that makes it different from other learning paradigms such as meta-learning and domain adaptation is the presence of different source and target modalities which will require different encoders. We propose an effective solution based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. This approach uses key ideas from cross-modal learning and meta-learning, and presents strong results on the cross-modal generalization problem. We benchmark several approaches on 3 real-world classification tasks: few-shot recipe classification from text to images of recipes, object classification from images to audio of objects, and language classification from text to spoken speech across 100 languages spanning many rare languages. Our results demonstrate strong performance even when the new target modality has only a few(1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.
引用
收藏
页码:2680 / 2689
页数:10
相关论文
共 50 条
  • [21] Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval
    Deng, Cheng
    Tang, Xu
    Yan, Junchi
    Liu, Wei
    Gao, Xinbo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (02) : 208 - 218
  • [22] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [23] Few-Shot Class-Incremental Learning via Cross-Modal Alignment with Feature Replay
    Li, Yanan
    He, Linpu
    Lin, Feng
    Wang, Donghui
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT 1, 2025, 15031 : 19 - 33
  • [24] Meta Self-Paced Learning for Cross-Modal Matching
    Wei, Jiwei
    Xu, Xing
    Wang, Zheng
    Wang, Guoqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3835 - 3843
  • [25] Infant cross-modal learning
    Chow, Hiu Mei
    Tsui, Angeline Sin-Mei
    Ma, Yuen Ki
    Yat, Mei Ying
    Tseng, Chia-huei
    I-PERCEPTION, 2014, 5 (04): : 463 - 463
  • [26] Token Embeddings Alignment for Cross-Modal Retrieval
    Xie, Chen-Wei
    Wu, Jianmin
    Zheng, Yun
    Pan, Pan
    Hua, Xian-Sheng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4555 - 4563
  • [27] Cross-modal Variational Alignment of Latent Spaces
    Theodoridis, Thomas
    Chatzis, Theocharis
    Solachidis, Vassilios
    Dimitropoulos, Kosmas
    Daras, Petros
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4127 - 4136
  • [28] Neural entity alignment with cross-modal supervision
    Su, Fenglong
    Xu, Chengjin
    Yang, Han
    Chen, Zhongwu
    Jing, Ning
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [29] Adequate alignment and interaction for cross-modal retrieval
    Mingkang WANG
    Min MENG
    Jigang LIU
    Jigang WU
    虚拟现实与智能硬件(中英文), 2023, 5 (06) : 509 - 522
  • [30] Cross-Modal Translation and Alignment for Survival Analysis
    Zhou, Fengtao
    Chen, Hao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21428 - 21437