INFOSYNC: Information Synchronization across Multilingual Semi-structured Tables

被引：0

作者：

Khincha, Siddharth ^{[1
]}

Jain, Chelsi ^{[2
]}

Gupta, Vivek ^{[3
]}

Kataria, Tushar ^{[3
]}

Zhang, Shuo ^{[4
]}

机构：

[1] IIT Guwahati, Gauhati, India

[2] CTAE, Udaipur, Rajasthan, India

[3] Univ Utah, Salt Lake City, UT 84112 USA

[4] Bloomberg, New York, NY USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023 | 2023年

关键词：

WIKIPEDIA; BIAS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset INFOSYNC and a two-step method for tabular synchronization. INFOSYNC contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (similar to 3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on INFOSYNC, information alignment achieves an F1 score of 87.91 (en <-> non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 603 table pairs. Our approach obtains an acceptance rate of 77.28% on Wikipedia, showing the effectiveness of the proposed method.

引用

页码：2536 / 2559

页数：24

共 50 条

[21] Building Wikipedia Ontology with More Semi-structured Information Resources
Kawakami, Tokio
Morita, Takeshi
Yamaguchi, Takahira
SEMANTIC TECHNOLOGY, JIST 2017, 2017, 10675 : 3 - 18
[22] Learning information extraction rules for semi-structured and free text
Soderland, S
MACHINE LEARNING, 1999, 34 (1-3) : 233 - 272
[23] Unsupervised Extraction of Product Information from Semi-structured Sources
Walther, Maximilian
13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
[24] Chinese resume information extraction based on semi-structured text
Wentan, Yan
Yupeng, Qiao
Chinese Control Conference, CCC, 2017, : 11177 - 11182
[25] Supplementing domain knowledge to BERT with semi-structured information of documents
Chen, Jing
Wei, Zhihua
Wang, Jiaqi
Wang, Rui
Gong, Chuanyang
Zhang, Hongyun
Miao, Duoqian
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
[26] Bootstrapping Information Extraction from Semi-structured Web Pages
Carlson, Andrew
Schafer, Charles
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
[27] An approach to semantic information retrieval in heterogeneous semi-structured documents
Mrabet, Yassine
Bennacer, Nacéra
Pernelle, Nathalie
Thiam, Mouhamadou
CORIA 2010: Actes de la COnference en Recherche d'Information et Applications - Proceedings of the Conference on Information Retrieval and Applications, 2010, : 195 - 210
[28] Spatial Dependency Parsing for Semi-Structured Document Information Extraction
Hwang, Wonseok
Yim, Jinyeong
Park, Seunghyun
Yang, Sohee
Seo, Minjoon
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 330 - 343
[29] Recognition techniques for extracting information from semi-structured documents
Della Ventura, A
Gagliardi, I
Zonta, B
DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
[30] Chinese resume information extraction based on semi-structured text
Yan Wentan
Qiao Yupeng
PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 11177 - 11182

← 1 2 3 4 5 →