Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

被引:1
|
作者
Espana-Bonet, Cristina [1 ]
Barron-Cedeno, Alberto [2 ]
Marquez, Lluis [3 ]
机构
[1] DFKI GmbH, Saarbrucken, Germany
[2] Univ Bologna, Forli, Italy
[3] Amazon, AWS AI Labs, Barcelona, Spain
关键词
Comparable corpora; Wikipedia category graph; Domain-specific corpora; Domainness metrics; RETRIEVAL; WEB;
D O I
10.1007/s10115-022-01767-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a language-independent graph-based method to build a-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia's category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84% on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.
引用
收藏
页码:1365 / 1397
页数:33
相关论文
共 50 条
  • [21] Domain Specific Multiword Extraction for English Corpora
    Kumari, Lalita
    Shukla, V. N.
    ICAIE 2009: PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND EDUCATION, VOLS 1 AND 2, 2009, : 89 - 93
  • [22] Extraction of bilingual lexicons from comparable corpora specialty: study of the lexical context
    Hazem, Amir
    Morin, Emmanuel
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2014, 55 (01): : 13 - 44
  • [23] Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining
    Chebel, Mohamed
    Latiri, Chiraz
    Gaussier, Eric
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT I, 2017, 10234 : 586 - 598
  • [24] Bilingual Lexicon Extraction with Temporal Distributed Word Representation from Comparable Corpora
    Zhang, Chunyue
    Zhao, Tiejun
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2015, 2015, 9362 : 380 - 387
  • [25] Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge
    Chu, Chenhui
    Nakazawa, Toshiaki
    Kurohashi, Sadao
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PART II, 2014, 8404 : 296 - 309
  • [26] Using Wikipedia for term extraction in the biomedical domain: first experiences
    Vivaldi, Jorge
    Rodriguez, Horacio
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2010, (45): : 251 - 254
  • [27] Adapting Open Domain Fact Extraction and Verification to COVID-FACT through In-Domain Language Modeling
    Liu, Zhenghao
    Xiong, Chenyan
    Dai, Zhuyun
    Sun, Si
    Sun, Maosong
    Liu, Zhiyuan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2395 - 2400
  • [28] Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering
    Tamber, Manveer Singh
    Pradeep, Ronak
    Lin, Jimmy
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT III, 2023, 13982 : 163 - 176
  • [29] Improved machine translation performance via parallel sentence extraction from comparable corpora
    Munteanu, DS
    Fraser, A
    Marcu, D
    HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 265 - 272
  • [30] Bilingual Lexicon Extraction using Locally Weighted Linear Regression from Comparable Corpora
    Zhang, Chunyue
    Zhao, Tiejun
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 13 - 16