MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding

被引:0
|
作者
Li, Junlong [1 ]
Xu, Yiheng [2 ]
Cui, Lei [2 ]
Wei, Furu [2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone, such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.
引用
收藏
页码:6078 / 6087
页数:10
相关论文
共 50 条
  • [1] LayoutLM: Pre-training of Text and Layout for Document Image Understanding
    Xu, Yiheng
    Li, Minghao
    Cui, Lei
    Huang, Shaohan
    Wei, Furu
    Zhou, Ming
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1192 - 1200
  • [2] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
    Xu, Yang
    Xu, Yiheng
    Lv, Tengchao
    Cui, Lei
    Wei, Furu
    Wang, Guoxin
    Lu, Yijuan
    Florencio, Dinei
    Zhang, Cha
    Che, Wanxiang
    Zhang, Min
    Zhou, Lidong
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2579 - 2591
  • [3] Hierarchical Multimodal Pre-training for Visually RichWebpage Understanding
    Xu, Hongshen
    Chen, Lu
    Zhao, Zihan
    Ma, Da
    Cao, Ruisheng
    Zhu, Zichen
    Yu, Kai
    [J]. PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 864 - 872
  • [4] Pre-training for Abstractive Document Summarization by Reinstating Source Text
    Zou, Yanyan
    Zhang, Xingxing
    Wei Lu
    Furu Wei
    Ming Zhou
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3646 - 3660
  • [5] PreSTU: Pre-Training for Scene-Text Understanding
    Kil, Jihyung
    Changpinyo, Soravit
    Chen, Xi
    Hu, Hexiang
    Goodman, Sebastian
    Chao, Wei-Lun
    Soricut, Radu
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15224 - 15234
  • [6] MPNet: Masked and Permuted Pre-training for Language Understanding
    Song, Kaitao
    Tan, Xu
    Qin, Tao
    Lu, Jianfeng
    Liu, Tie-Yan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [7] LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
    Tu, Yi
    Guo, Ya
    Chen, Huan
    Tang, Jinyang
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15200 - 15212
  • [8] Unified Language Model Pre-training for Natural Language Understanding and Generation
    Dong, Li
    Yang, Nan
    Wang, Wenhui
    Wei, Furu
    Liu, Xiaodong
    Wang, Yu
    Gao, Jianfeng
    Zhou, Ming
    Hon, Hsiao-Wuen
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [9] Self-training Improves Pre-training for Natural Language Understanding
    Du, Jingfei
    Grave, Edouard
    Gunel, Beliz
    Chaudhary, Vishrav
    Celebi, Onur
    Auli, Michael
    Stoyanov, Veselin
    Conneau, Alexis
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5408 - 5418
  • [10] Multimodal Pre-Training Based on Graph Attention Network for Document Understanding
    Zhang, Zhenrong
    Ma, Jiefeng
    Du, Jun
    Wang, Licheng
    Zhang, Jianshu
    [J]. IEEE Transactions on Multimedia, 2023, 25 : 6743 - 6755