INFOSYNC: Information Synchronization across Multilingual Semi-structured Tables

被引:0
|
作者
Khincha, Siddharth [1 ]
Jain, Chelsi [2 ]
Gupta, Vivek [3 ]
Kataria, Tushar [3 ]
Zhang, Shuo [4 ]
机构
[1] IIT Guwahati, Gauhati, India
[2] CTAE, Udaipur, Rajasthan, India
[3] Univ Utah, Salt Lake City, UT 84112 USA
[4] Bloomberg, New York, NY USA
关键词
WIKIPEDIA; BIAS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset INFOSYNC and a two-step method for tabular synchronization. INFOSYNC contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (similar to 3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on INFOSYNC, information alignment achieves an F1 score of 87.91 (en <-> non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 603 table pairs. Our approach obtains an acceptance rate of 77.28% on Wikipedia, showing the effectiveness of the proposed method.
引用
收藏
页码:2536 / 2559
页数:24
相关论文
共 50 条
  • [21] Building Wikipedia Ontology with More Semi-structured Information Resources
    Kawakami, Tokio
    Morita, Takeshi
    Yamaguchi, Takahira
    SEMANTIC TECHNOLOGY, JIST 2017, 2017, 10675 : 3 - 18
  • [22] Learning information extraction rules for semi-structured and free text
    Soderland, S
    MACHINE LEARNING, 1999, 34 (1-3) : 233 - 272
  • [23] Unsupervised Extraction of Product Information from Semi-structured Sources
    Walther, Maximilian
    13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
  • [24] Chinese resume information extraction based on semi-structured text
    Wentan, Yan
    Yupeng, Qiao
    Chinese Control Conference, CCC, 2017, : 11177 - 11182
  • [25] Supplementing domain knowledge to BERT with semi-structured information of documents
    Chen, Jing
    Wei, Zhihua
    Wang, Jiaqi
    Wang, Rui
    Gong, Chuanyang
    Zhang, Hongyun
    Miao, Duoqian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [26] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [27] An approach to semantic information retrieval in heterogeneous semi-structured documents
    Mrabet, Yassine
    Bennacer, Nacéra
    Pernelle, Nathalie
    Thiam, Mouhamadou
    CORIA 2010: Actes de la COnference en Recherche d'Information et Applications - Proceedings of the Conference on Information Retrieval and Applications, 2010, : 195 - 210
  • [28] Spatial Dependency Parsing for Semi-Structured Document Information Extraction
    Hwang, Wonseok
    Yim, Jinyeong
    Park, Seunghyun
    Yang, Sohee
    Seo, Minjoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 330 - 343
  • [29] Recognition techniques for extracting information from semi-structured documents
    Della Ventura, A
    Gagliardi, I
    Zonta, B
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
  • [30] Chinese resume information extraction based on semi-structured text
    Yan Wentan
    Qiao Yupeng
    PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 11177 - 11182