LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

被引:0
|
作者
Xu, Yang [1 ]
Xu, Yiheng [2 ]
Lv, Tengchao [2 ]
Cui, Lei [2 ]
Wei, Furu [2 ]
Wang, Guoxin [3 ]
Lu, Yijuan [3 ]
Florencio, Dinei [3 ]
Zhang, Cha [3 ]
Che, Wanxiang [1 ]
Zhang, Min [4 ]
Zhou, Lidong [2 ]
机构
[1] Harbin Inst Technol, Res Ctr Social Comp & Informat Retrieval, Harbin, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Microsoft Azure AI, Beijing, Peoples R China
[4] Soochow Univ, Suzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.8340 -> 0.8520), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).
引用
收藏
页码:2579 / 2591
页数:13
相关论文
共 50 条
  • [1] LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
    Tu, Yi
    Guo, Ya
    Chen, Huan
    Tang, Jinyang
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15200 - 15212
  • [2] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
    Li, Junlong
    Xu, Yiheng
    Cui, Lei
    Wei, Furu
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
  • [3] VRDU: A Benchmark for Visually-rich Document Understanding
    Wang, Zilong
    Zhou, Yichao
    Wei, Wei
    Lee, Chen-Yu
    Tata, Sandeep
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5184 - 5193
  • [4] WUKONG- READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding
    Bai, Haoli
    Liu, Zhiguang
    Meng, Xiaojun
    Li, Wentao
    Liu, Shuang
    Luo, Yifeng
    Xie, Nian
    Zheng, Rongfu
    Wang, Liangwei
    Hou, Lu
    Wei, Jiansheng
    Jiang, Xin
    Liu, Qun
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13386 - 13401
  • [5] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [6] MMPT'21: International JointWorkshop on Multi-Modal Pre-Training for Multimedia Understanding
    Liu, Bei
    Fu, Jianlong
    Chen, Shizhe
    Jin, Qin
    Hauptmann, Alexander
    Rui, Yong
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 694 - 695
  • [7] MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION
    Chan, David M.
    Ghosh, Shalini
    Chakrabarty, Debmalya
    Hoffmeister, Bjorn
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 246 - 250
  • [8] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [9] TableVLM: Multi-modal Pre-training for Table Structure Recognition
    Chen, Leiyuan
    Huang, Chengsong
    Zheng, Xiaoqing
    Lin, Jinshu
    Huang, Xuanjing
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
  • [10] Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
    Li, Qiwei
    Li, Zuchao
    Cai, Xiantao
    Du, Bo
    Zhao, Hai
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4513 - 4523