End-to-End Compound Table Understanding with Multi-Modal Modeling

被引：4

作者：

Li, Zaisheng ^{[1
]}

Li, Yi ^{[2
]}

Liang, Qiao ^{[1
,3
]}

Li, Pengfei ^{[1
]}

Cheng, Zhanzhan ^{[1
]}

Niu, Yi ^{[1
]}

Pu, Shiliang ^{[1
]}

Li, Xi ^{[3
]}

机构：

[1] Hikvis Res Inst, Hangzhou, Peoples R China

[2] ShanghaiTech Univ, Shanghai, Peoples R China

[3] Zhejiang Univ, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Dataset; Table Understanding; Multi-Modal Learning;

D O I：

10.1145/3503161.3547885

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Table is a widely used data form in webpages, spreadsheets, or PDFs to organize and present structural data. Although studies on table structure recognition have been successfully used to convert image-based tables into digital structural formats, solving many real problems still relies on further understanding of the table, such as cell relationship extraction. The current datasets related to table understanding are all based on the digit format. To boost research development, we release a new benchmark named ComFinTab with rich annotations that support both table recognition and understanding tasks. Unlike previous datasets containing the basic tables, ComFinTab contains a large ratio of compound tables, which is much more challenging and requires methods using multiple information sources. Based on the dataset, we also propose a uniform, concise task form with the evaluation metric to better evaluate the model's performance on the table understanding task in compound tables. Finally, a framework named CTUNet is proposed to integrate the compromised visual, semantic, and position features with a graph attention network, which can solve the table recognition task and the challenging table understanding task as a whole. Experimental results compared with some previous advanced table understanding methods demonstrate the effectiveness of our proposed model. Code and dataset are available at https://github.com/hikopensource/DAVAR-Lab-OCR.

引用

页码：4112 / 4121

页数：10

共 50 条

[1] MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
Kamath, Aishwarya
Singh, Mannat
Lecun, Yann
Synnaeve, Gabriel
Misra, Ishan
Carion, Nicolas
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1760 - 1770
[2] Characterizing and Understanding End-to-End Multi-Modal Neural Networks on GPUs
Hou, Xiaofeng
Xu, Cheng
Liu, Jiacheng
Tang, Xuehan
Sun, Lingyu
Li, Chao
Cheng, Kwang-Ting
[J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 125 - 128
[3] Multi-Modal Data Augmentation for End-to-End ASR
Renduchintala, Adithya
Ding, Shuoyang
Wiesner, Matthew
Watanabe, Shinji
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
[4] End-to-end Knowledge Retrieval with Multi-modal Queries
Luo, Man
Fang, Zhiyuan
Gokhale, Tejas
Yang, Yezhou
Baral, Chitta
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 8573 - 8589
[5] End-to-end Multi-modal Video Temporal Grounding
Chen, Yi-Wen
Tsai, Yi-Hsuan
Yang, Ming-Hsuan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[6] End-to-End Deep Multi-Modal Physiological Authentication With Smartbands
Ekiz, Deniz
Can, Yekta Said
Dardagan, Yagmur Ceren
Aydar, Furkan
Kose, Rukiye Dilruba
Ersoy, Cem
[J]. IEEE SENSORS JOURNAL, 2021, 21 (13) : 14977 - 14986
[7] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Prakash, Aditya
Chitta, Kashyap
Geiger, Andreas
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7073 - 7083
[8] Multi-modal policy fusion for end-to-end autonomous driving
Huang, Zhenbo
Sun, Shiliang
Zhao, Jing
Mao, Liang
[J]. INFORMATION FUSION, 2023, 98
[9] MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications
Xu, Cheng
Hou, Xiaofeng
Liu, Jiacheng
Li, Chao
Huang, Tianhao
Zhu, Xiaozhi
Niu, Mo
Sun, Lingyu
Tang, Peng
Xu, Tongqiao
Cheng, Kwang-Ting
Guo, Minyi
[J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, IISWC, 2023, : 154 - 166
[10] DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition
Zhang, Yuhao
Hossain, Md Zakir
Rahman, Shafin
[J]. HUMAN-COMPUTER INTERACTION, INTERACT 2021, PT III, 2021, 12934 : 227 - 237

← 1 2 3 4 5 →