VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

被引：9

作者：

Bakkali, Souhail ^{[1
]}

Ming, Zuheng ^{[1
,2
]}

Coustaty, Mickael ^{[1
]}

Rusinol, Marcal ^{[3
,4
]}

Ramos Terrades, Oriol ^{[3
]}

机构：

[1] La Rochelle Univ, L3i, La Rochelle, France

[2] Univ Sorbonne Paris Nord, L2TI, Villetaneuse, France

[3] Univ Autonoma Barcelona, CVC, Barcelona, Spain

[4] AllRead MLT, Barcelona, Spain

来源：

PATTERN RECOGNITION | 2023年 / 139卷

关键词：

Multimodal document representation; learning; Document classification; Contrastive learning; Self-Attention; Transformers;

D O I：

10.1016/j.patcog.2023.109419

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal learning from document data has achieved great success lately as it allows to pre-train se-mantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vi-sion cues, considering intra-and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra-and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space. Extensive experiments on public benchmark datasets demonstrate the effectiveness and the generality of our model both on low-scale and large-scale datasets. (c) 2023 Elsevier Ltd. All rights reserved.

引用

页数：11

共 50 条

[1] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
Wen, Keyu
Xia, Jin
Huang, Yuanyuan
Li, Linyang
Xu, Jiayan
Shao, Jie
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
[2] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Wang, Teng
Jiang, Wenhao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Yin, Chengguo
Luo, Ping
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Jiang, Chaoya
Ye, Wei
Xu, Haiyang
Huang, Songfang
Huang, Fei
Zhang, Shikun
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
[4] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Ma, Zhiyuan
Li, Jianjun
Li, Guohui
Huang, Kaiyan
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
[5] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Lu, Haoyu
Fei, Nanyi
Huo, Yuqi
Gao, Yizhao
Lu, Zhiwu
Wen, Ji-Rong
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680
[6] Contrastive Vision-Language Pre-training with Limited Resources
Cui, Quan
Zhou, Boyan
Guo, Yu
Yin, Weidong
Wu, Hao
Yoshie, Osamu
Chen, Yubo
[J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
[7] Vision-Language Pre-Training with Triple Contrastive Learning
Yang, Jinyu
Duan, Jiali
Tran, Son
Xu, Yi
Chanda, Sampath
Chen, Liqun
Zeng, Belinda
Chilimbi, Trishul
Huang, Junzhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
[8] Vision-language pre-training via modal interaction
Cheng, Hang
Ye, Hehui
Zhou, Xiaofei
Liu, Ximeng
Chen, Fei
Wang, Meiqing
[J]. PATTERN RECOGNITION, 2024, 156
[9] PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Guo, Zixin
Wang, Tzu-Jui Julius
Pehlivan, Selen
Radman, Abduljalil
Laaksonen, Jorma
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2261 - 2265
[10] Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training
Li, Zejun
Fan, Zhihao
Chen, JingJing
Zhang, Qi
Huang, Xuanjing
Wei, Zhongyu
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5939 - 5958

← 1 2 3 4 5 →