CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification

被引:0
|
作者
Wang, Peng [1 ,2 ]
Li, Dagang [1 ,2 ]
Hu, Xuesi [1 ,2 ,3 ]
Wang, Yongmei [1 ,2 ,4 ]
Zhang, Youhua [1 ,2 ]
机构
[1] Anhui Agr Univ, Sch Informat & Artificial Intelligence, Hefei, Anhui, Peoples R China
[2] Macau Univ Sci & Technol, Fac Innovat Engn, Sch Comp Sci & Engn, Ave Wai Long, Taipa 999078, Macao, Peoples R China
[3] Anhui Agr Univ, Sch Econ & Management, Hefei, Anhui, Peoples R China
[4] Anhui Prov Engn Lab Beidou Precis Agr Informat, Hefei, Anhui, Peoples R China
来源
关键词
Zero-shot text classification; CLIP; Multimodality;
D O I
10.1016/j.csl.2024.101748
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text-text matching paradigm or a text-image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text-image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
    Zhuang, Jiedong
    Hu, Jiaqi
    Mu, Lianrui
    Hu, Rui
    Liang, Xiaoyu
    Ye, Jiangnan
    Hu, Haoji
    COMPUTER VISION - ECCV 2024, PT X, 2025, 15068 : 236 - 253
  • [42] One size fits all: Enhanced zero-shot text classification for patient listening on social media
    Matoshi, Veton
    De Vuono, Maria Carmela
    Gaspari, Roberto
    Kroll, Mark
    Jantscher, Michael
    Nicolardi, Sara Lucia
    Mazzola, Giuseppe
    Rauch, Manuela
    Sabol, Vedran
    Salhofer, Eileen
    Mariani, Riccardo
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2025, 7
  • [43] Improving Zero-Shot Generalization for CLIP with Variational Adapter
    Lu, Ziqian
    Shen, Fengli
    Liu, Mushui
    Yu, Yunlong
    Li, Xi
    COMPUTER VISION - ECCV 2024, PT XX, 2025, 15078 : 328 - 344
  • [44] Zero-shot Learning Using Multimodal Descriptions
    Mall, Utkarsh
    Hariharan, Bharath
    Bala, Kavita
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 3930 - 3938
  • [45] Zero-shot Generalization of Multimodal Dialogue Agents
    Tavares, Diogo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6935 - 6939
  • [46] Multimodal Zero-Shot Hateful Meme Detection
    Zhu, Jiawen
    Lee, Roy Ka-Wei
    Chong, Wen-Haw
    PROCEEDINGS OF THE 14TH ACM WEB SCIENCE CONFERENCE, WEBSCI 2022, 2022, : 382 - 389
  • [47] Chart question answering with multimodal graph representation learning and zero-shot classification
    Farahani, Ali Mazraeh
    Adibi, Peyman
    Ehsani, Mohammad Saeed
    Hutter, Hans-Peter
    Darvishy, Alireza
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 270
  • [48] Latent Embeddings for Zero-shot Classification
    Xian, Yongqin
    Akata, Zeynep
    Sharma, Gaurav
    Nguyen, Quynh
    Hein, Matthias
    Schiele, Bernt
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 69 - 77
  • [49] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
    Chene, Zhiyong
    Li, Xinnuo
    Ai, Zhiqi
    Xu, Shugong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
  • [50] Zero-shot Topic Classification via Automatic Tagging on Chinese Text Datasets
    Cai, Xinyi
    Tian, Jiao
    Yu, Ke
    Xiao, Hongwang
    Zhang, Kai
    Tsai, Pei -Wei
    2022 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING, ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM, 2022, : 482 - 488