VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning

被引：0

作者：

Fei, Nanyi ^{[1
]}

Jiang, Hao ^{[2
]}

Lu, Haoyu ^{[3
]}

Long, Jinqiang ^{[3
]}

Dai, Yanqi ^{[3
]}

Fan, Tuo ^{[2
]}

Cao, Zhao ^{[2
]}

Lu, Zhiwu ^{[3
]}

机构：

[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China

[2] Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China

[3] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China

来源：

ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I | 2024年 / 14608卷

基金：

中国国家自然科学基金;

关键词：

multi-modal model; multi-task learning; cross-modal search;

D O I：

10.1007/978-3-031-56027-9_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.

引用

页码：56 / 72

页数：17

共 50 条

[41] Multi-Modal Multi-Task (3MT) Road Segmentation
Milli, Erkan
Erkent, Ozgur
Ylmaz, Asm Egemen
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (09) : 5408 - 5415
[42] Multi-Modal Fusion for Multi-Task Fuzzy Detection of Rail Anomalies
Liyuan, Yang
Osman, Ghazali
Abdul Rahman, Safawi
Mustapha, Muhammad Firdaus
IEEE ACCESS, 2024, 12 : 73925 - 73935
[43] Traffic Sign Recognition via Multi-Modal Tree-Structure Embedded Multi-Task Learning
Lu, Xiao
Wang, Yaonan
Zhou, Xuanyu
Zhang, Zhenjun
Ling, Zhigang
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2017, 18 (04) : 960 - 972
[44] MULTI-MODAL MULTI-TASK DEEP LEARNING FOR SPEAKER AND EMOTION RECOGNITION OF TV-SERIES DATA
Novitasari, Sashi
Quoc Truong Do
Sakti, Sakriani
Lestari, Dessi
Nakamura, Satoshi
2018 ORIENTAL COCOSDA - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2018, : 37 - 42
[45] Multi-task Learning using Multi-modal Encoder-Decoder Networks with Shared Skip Connections
Kuga, Ryohei
Kanezaki, Asako
Samejima, Masaki
Sugano, Yusuke
Matsushita, Yasuyuki
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 403 - 411
[46] Multi-Task Collaboration for Cross-Modal Generation and Multi-Modal Ophthalmic Diseases Diagnosis
Yu, Yang
Zhu, Hongqing
Qian, Tianwei
Hou, Tong
Huang, Bingcang
IET IMAGE PROCESSING, 2025, 19 (01)
[47] Large Margin Multi-Modal Multi-Task Feature Extraction for Image Classification
Luo, Yong
Wen, Yonggang
Tao, Dacheng
Gui, Jie
Xu, Chao
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (01) : 414 - 427
[48] Deep Elastic Networks with Model Selection for Multi-Task Learning
Ahn, Chanho
Kim, Eunwoo
Oh, Songhwai
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6528 - 6537
[49] A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis
Akhtar, Md Shad
Chauhan, Dushyant Singh
Ekbal, Asif
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2020, 14 (03)
[50] VMMP: Verifiable privacy-preserving multi-modal multi-task prediction
Bian, Mingyun
Ren, Yanli
He, Gang
Feng, Guorui
Zhang, Xinpeng
INFORMATION SCIENCES, 2024, 669

← 1 2 3 4 5 →