TableVLM: Multi-modal Pre-training for Table Structure Recognition

被引:0
|
作者
Chen, Leiyuan [1 ,2 ]
Huang, Chengsong [1 ,2 ]
Zheng, Xiaoqing [1 ,2 ]
Lin, Jinshu [3 ]
Huang, Xuanjing [1 ,2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
[3] Hundsun, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tables are widely used in research and business, and are suitable for human consumption, but not easily machine-processable, particularly when tables are present in images. One of the main challenges to extracting data from images of tables is to accurately recognize table structures, especially for complex tables with cross rows and columns. In this study, we propose a novel multi-modal pre-training model for table structure recognition, named TableVLM. With a two-stream multi-modal transformer-based encoder-decoder architecture, TableVLM learns to capture rich table structure-related features by multiple carefullydesigned unsupervised objectives inspired by the notion of masked visual-language modeling. To pre-train this model, we also created a dataset, called ComplexTable, which consists of 1, 000K samples to be released publicly. Experiment results show that the model built on pre-trained TableVLM can improve the performance up to 1.97% in tree-editing-distancescore on ComplexTable.
引用
收藏
页码:2437 / 2449
页数:13
相关论文
共 50 条
  • [41] Multi-Modal Face Recognition
    Shen, Haihong
    Ma, Liqun
    Zhang, Qishan
    [J]. 2010 8TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2010, : 720 - 723
  • [42] MMCL-CPI: A multi-modal compound-protein interaction prediction model incorporating contrastive learning pre-training
    Qian, Ying
    Li, Xinyi
    Wu, Jian
    Zhang, Qian
    [J]. COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2024, 112
  • [43] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
    Liu, Tongtong
    Feng, Fangxiang
    Wang, Xiaojie
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565
  • [44] Uni4Eye++: A General Masked Image Modeling Multi-Modal Pre-Training Framework for Ophthalmic Image Classification and Segmentation
    Cai, Zhiyuan
    Lin, Li
    He, Huaqing
    Cheng, Pujin
    Tang, Xiaoying
    [J]. IEEE Transactions on Medical Imaging, 2024, 43 (12) : 4419 - 4429
  • [45] MULTI-MODAL LEARNING FOR GESTURE RECOGNITION
    Cao, Congqi
    Zhang, Yifan
    Lu, Hanqing
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2015,
  • [46] Multi-modal Sensing for Behaviour Recognition
    Wang, Ziwei
    Liu, Jiajun
    Arablouei, Reza
    Bishop-Hurley, Greg
    Matthews, Melissa
    Borges, Paulo
    [J]. PROCEEDINGS OF THE 2022 THE 28TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND NETWORKING, ACM MOBICOM 2022, 2022, : 900 - 902
  • [47] Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval
    Zhang, Liang
    Hu, Anwen
    Jin, Qin
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [48] PMMN: Pre-trained multi-Modal network for scene text recognition
    Zhang, Yu
    Fu, Zilong
    Huang, Fuyu
    Liu, Yizhi
    [J]. PATTERN RECOGNITION LETTERS, 2021, 151 : 103 - 111
  • [49] In Defense of Image Pre-Training for Spatiotemporal Recognition
    Li, Xianhang
    Wang, Huiyu
    Wei, Chen
    Mei, Jieru
    Yuille, Alan
    Zhou, Yuyin
    Xie, Cihang
    [J]. COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 675 - 691
  • [50] Multi-task Pre-training for Lhasa-Tibetan Speech Recognition
    Liu, Yigang
    Zhao, Yue
    Xu, Xiaona
    Xu, Liang
    Zhang, Xubei
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IX, 2023, 14262 : 78 - 90