End-to-End Compound Table Understanding with Multi-Modal Modeling

被引:4
|
作者
Li, Zaisheng [1 ]
Li, Yi [2 ]
Liang, Qiao [1 ,3 ]
Li, Pengfei [1 ]
Cheng, Zhanzhan [1 ]
Niu, Yi [1 ]
Pu, Shiliang [1 ]
Li, Xi [3 ]
机构
[1] Hikvis Res Inst, Hangzhou, Peoples R China
[2] ShanghaiTech Univ, Shanghai, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
关键词
Dataset; Table Understanding; Multi-Modal Learning;
D O I
10.1145/3503161.3547885
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Table is a widely used data form in webpages, spreadsheets, or PDFs to organize and present structural data. Although studies on table structure recognition have been successfully used to convert image-based tables into digital structural formats, solving many real problems still relies on further understanding of the table, such as cell relationship extraction. The current datasets related to table understanding are all based on the digit format. To boost research development, we release a new benchmark named ComFinTab with rich annotations that support both table recognition and understanding tasks. Unlike previous datasets containing the basic tables, ComFinTab contains a large ratio of compound tables, which is much more challenging and requires methods using multiple information sources. Based on the dataset, we also propose a uniform, concise task form with the evaluation metric to better evaluate the model's performance on the table understanding task in compound tables. Finally, a framework named CTUNet is proposed to integrate the compromised visual, semantic, and position features with a graph attention network, which can solve the table recognition task and the challenging table understanding task as a whole. Experimental results compared with some previous advanced table understanding methods demonstrate the effectiveness of our proposed model. Code and dataset are available at https://github.com/hikopensource/DAVAR-Lab-OCR.
引用
收藏
页码:4112 / 4121
页数:10
相关论文
共 50 条
  • [1] MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
    Kamath, Aishwarya
    Singh, Mannat
    Lecun, Yann
    Synnaeve, Gabriel
    Misra, Ishan
    Carion, Nicolas
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1760 - 1770
  • [2] Characterizing and Understanding End-to-End Multi-Modal Neural Networks on GPUs
    Hou, Xiaofeng
    Xu, Cheng
    Liu, Jiacheng
    Tang, Xuehan
    Sun, Lingyu
    Li, Chao
    Cheng, Kwang-Ting
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 125 - 128
  • [3] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
  • [4] End-to-end Knowledge Retrieval with Multi-modal Queries
    Luo, Man
    Fang, Zhiyuan
    Gokhale, Tejas
    Yang, Yezhou
    Baral, Chitta
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 8573 - 8589
  • [5] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] End-to-End Deep Multi-Modal Physiological Authentication With Smartbands
    Ekiz, Deniz
    Can, Yekta Said
    Dardagan, Yagmur Ceren
    Aydar, Furkan
    Kose, Rukiye Dilruba
    Ersoy, Cem
    [J]. IEEE SENSORS JOURNAL, 2021, 21 (13) : 14977 - 14986
  • [7] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
    Prakash, Aditya
    Chitta, Kashyap
    Geiger, Andreas
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7073 - 7083
  • [8] Multi-modal policy fusion for end-to-end autonomous driving
    Huang, Zhenbo
    Sun, Shiliang
    Zhao, Jing
    Mao, Liang
    [J]. INFORMATION FUSION, 2023, 98
  • [9] MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications
    Xu, Cheng
    Hou, Xiaofeng
    Liu, Jiacheng
    Li, Chao
    Huang, Tianhao
    Zhu, Xiaozhi
    Niu, Mo
    Sun, Lingyu
    Tang, Peng
    Xu, Tongqiao
    Cheng, Kwang-Ting
    Guo, Minyi
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, IISWC, 2023, : 154 - 166
  • [10] DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition
    Zhang, Yuhao
    Hossain, Md Zakir
    Rahman, Shafin
    [J]. HUMAN-COMPUTER INTERACTION, INTERACT 2021, PT III, 2021, 12934 : 227 - 237