LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

被引：0

作者：

Xu, Yang ^{[1
]}

Xu, Yiheng ^{[2
]}

Lv, Tengchao ^{[2
]}

Cui, Lei ^{[2
]}

Wei, Furu ^{[2
]}

Wang, Guoxin ^{[3
]}

Lu, Yijuan ^{[3
]}

Florencio, Dinei ^{[3
]}

Zhang, Cha ^{[3
]}

Che, Wanxiang ^{[1
]}

Zhang, Min ^{[4
]}

Zhou, Lidong ^{[2
]}

机构：

[1] Harbin Inst Technol, Res Ctr Social Comp & Informat Retrieval, Harbin, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Microsoft Azure AI, Beijing, Peoples R China

[4] Soochow Univ, Suzhou, Peoples R China

来源：

59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.8340 -> 0.8520), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).

引用

页码：2579 / 2591

页数：13

共 50 条

[1] LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Tu, Yi
Guo, Ya
Chen, Huan
Tang, Jinyang
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15200 - 15212
[2] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
Li, Junlong
Xu, Yiheng
Cui, Lei
Wei, Furu
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
[3] VRDU: A Benchmark for Visually-rich Document Understanding
Wang, Zilong
Zhou, Yichao
Wei, Wei
Lee, Chen-Yu
Tata, Sandeep
[J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5184 - 5193
[4] WUKONG- READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding
Bai, Haoli
Liu, Zhiguang
Meng, Xiaojun
Li, Wentao
Liu, Shuang
Luo, Yifeng
Xie, Nian
Zheng, Rongfu
Wang, Liangwei
Hou, Lu
Wei, Jiansheng
Jiang, Xin
Liu, Qun
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13386 - 13401
[5] Multi-Modal Contrastive Pre-training for Recommendation
Liu, Zhuang
Ma, Yunpu
Schubert, Matthias
Ouyang, Yuanxin
Xiong, Zhang
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
[6] MMPT'21: International JointWorkshop on Multi-Modal Pre-Training for Multimedia Understanding
Liu, Bei
Fu, Jianlong
Chen, Shizhe
Jin, Qin
Hauptmann, Alexander
Rui, Yong
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 694 - 695
[7] MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION
Chan, David M.
Ghosh, Shalini
Chakrabarty, Debmalya
Hoffmeister, Bjorn
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 246 - 250
[8] MGeo: Multi-Modal Geographic Language Model Pre-Training
Ding, Ruixue
Chen, Boli
Xie, Pengjun
Huang, Fei
Li, Xin
Zhang, Qiang
Xu, Yao
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
[9] TableVLM: Multi-modal Pre-training for Table Structure Recognition
Chen, Leiyuan
Huang, Chengsong
Zheng, Xiaoqing
Lin, Jinshu
Huang, Xuanjing
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
[10] Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Li, Qiwei
Li, Zuchao
Cai, Xiantao
Du, Bo
Zhao, Hai
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4513 - 4523

← 1 2 3 4 5 →