VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning

被引:0
|
作者
Fei, Nanyi [1 ]
Jiang, Hao [2 ]
Lu, Haoyu [3 ]
Long, Jinqiang [3 ]
Dai, Yanqi [3 ]
Fan, Tuo [2 ]
Cao, Zhao [2 ]
Lu, Zhiwu [3 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[2] Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China
[3] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-modal model; multi-task learning; cross-modal search;
D O I
10.1007/978-3-031-56027-9_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.
引用
收藏
页码:56 / 72
页数:17
相关论文
共 50 条
  • [41] Multi-Modal Multi-Task (3MT) Road Segmentation
    Milli, Erkan
    Erkent, Ozgur
    Ylmaz, Asm Egemen
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (09) : 5408 - 5415
  • [42] Multi-Modal Fusion for Multi-Task Fuzzy Detection of Rail Anomalies
    Liyuan, Yang
    Osman, Ghazali
    Abdul Rahman, Safawi
    Mustapha, Muhammad Firdaus
    IEEE ACCESS, 2024, 12 : 73925 - 73935
  • [43] Traffic Sign Recognition via Multi-Modal Tree-Structure Embedded Multi-Task Learning
    Lu, Xiao
    Wang, Yaonan
    Zhou, Xuanyu
    Zhang, Zhenjun
    Ling, Zhigang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2017, 18 (04) : 960 - 972
  • [44] MULTI-MODAL MULTI-TASK DEEP LEARNING FOR SPEAKER AND EMOTION RECOGNITION OF TV-SERIES DATA
    Novitasari, Sashi
    Quoc Truong Do
    Sakti, Sakriani
    Lestari, Dessi
    Nakamura, Satoshi
    2018 ORIENTAL COCOSDA - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2018, : 37 - 42
  • [45] Multi-task Learning using Multi-modal Encoder-Decoder Networks with Shared Skip Connections
    Kuga, Ryohei
    Kanezaki, Asako
    Samejima, Masaki
    Sugano, Yusuke
    Matsushita, Yasuyuki
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 403 - 411
  • [46] Multi-Task Collaboration for Cross-Modal Generation and Multi-Modal Ophthalmic Diseases Diagnosis
    Yu, Yang
    Zhu, Hongqing
    Qian, Tianwei
    Hou, Tong
    Huang, Bingcang
    IET IMAGE PROCESSING, 2025, 19 (01)
  • [47] Large Margin Multi-Modal Multi-Task Feature Extraction for Image Classification
    Luo, Yong
    Wen, Yonggang
    Tao, Dacheng
    Gui, Jie
    Xu, Chao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (01) : 414 - 427
  • [48] Deep Elastic Networks with Model Selection for Multi-Task Learning
    Ahn, Chanho
    Kim, Eunwoo
    Oh, Songhwai
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6528 - 6537
  • [49] A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis
    Akhtar, Md Shad
    Chauhan, Dushyant Singh
    Ekbal, Asif
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2020, 14 (03)
  • [50] VMMP: Verifiable privacy-preserving multi-modal multi-task prediction
    Bian, Mingyun
    Ren, Yanli
    He, Gang
    Feng, Guorui
    Zhang, Xinpeng
    INFORMATION SCIENCES, 2024, 669