VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

被引:9
|
作者
Bakkali, Souhail [1 ]
Ming, Zuheng [1 ,2 ]
Coustaty, Mickael [1 ]
Rusinol, Marcal [3 ,4 ]
Ramos Terrades, Oriol [3 ]
机构
[1] La Rochelle Univ, L3i, La Rochelle, France
[2] Univ Sorbonne Paris Nord, L2TI, Villetaneuse, France
[3] Univ Autonoma Barcelona, CVC, Barcelona, Spain
[4] AllRead MLT, Barcelona, Spain
关键词
Multimodal document representation; learning; Document classification; Contrastive learning; Self-Attention; Transformers;
D O I
10.1016/j.patcog.2023.109419
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal learning from document data has achieved great success lately as it allows to pre-train se-mantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vi-sion cues, considering intra-and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra-and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space. Extensive experiments on public benchmark datasets demonstrate the effectiveness and the generality of our model both on low-scale and large-scale datasets. (c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
    Wen, Keyu
    Xia, Jin
    Huang, Yuanyuan
    Li, Linyang
    Xu, Jiayan
    Shao, Jie
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
  • [2] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
    Jiang, Chaoya
    Ye, Wei
    Xu, Haiyang
    Huang, Songfang
    Huang, Fei
    Zhang, Shikun
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
  • [4] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
    Ma, Zhiyuan
    Li, Jianjun
    Li, Guohui
    Huang, Kaiyan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
  • [5] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
    Lu, Haoyu
    Fei, Nanyi
    Huo, Yuqi
    Gao, Yizhao
    Lu, Zhiwu
    Wen, Ji-Rong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680
  • [6] Contrastive Vision-Language Pre-training with Limited Resources
    Cui, Quan
    Zhou, Boyan
    Guo, Yu
    Yin, Weidong
    Wu, Hao
    Yoshie, Osamu
    Chen, Yubo
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
  • [7] Vision-Language Pre-Training with Triple Contrastive Learning
    Yang, Jinyu
    Duan, Jiali
    Tran, Son
    Xu, Yi
    Chanda, Sampath
    Chen, Liqun
    Zeng, Belinda
    Chilimbi, Trishul
    Huang, Junzhou
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
  • [8] Vision-language pre-training via modal interaction
    Cheng, Hang
    Ye, Hehui
    Zhou, Xiaofei
    Liu, Ximeng
    Chen, Fei
    Wang, Meiqing
    [J]. PATTERN RECOGNITION, 2024, 156
  • [9] PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
    Guo, Zixin
    Wang, Tzu-Jui Julius
    Pehlivan, Selen
    Radman, Abduljalil
    Laaksonen, Jorma
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2261 - 2265
  • [10] Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training
    Li, Zejun
    Fan, Zhihao
    Chen, JingJing
    Zhang, Qi
    Huang, Xuanjing
    Wei, Zhongyu
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5939 - 5958