Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

被引:1
|
作者
Espana-Bonet, Cristina [1 ]
Barron-Cedeno, Alberto [2 ]
Marquez, Lluis [3 ]
机构
[1] DFKI GmbH, Saarbrucken, Germany
[2] Univ Bologna, Forli, Italy
[3] Amazon, AWS AI Labs, Barcelona, Spain
关键词
Comparable corpora; Wikipedia category graph; Domain-specific corpora; Domainness metrics; RETRIEVAL; WEB;
D O I
10.1007/s10115-022-01767-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a language-independent graph-based method to build a-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia's category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84% on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.
引用
收藏
页码:1365 / 1397
页数:33
相关论文
共 50 条
  • [1] Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction
    Cristina España-Bonet
    Alberto Barrón-Cedeño
    Lluís Màrquez
    Knowledge and Information Systems, 2023, 65 : 1365 - 1397
  • [2] Wikipedia as Multilingual Source of Comparable Corpora
    Gamallo Otero, Pablo
    Gonzalez Lopez, Isaac
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 21 - 25
  • [3] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [4] Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia
    Chu, Chenhui
    Nakazawa, Toshiaki
    Kurohashi, Sadao
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2016, 15 (02)
  • [5] Terminology Extraction from Comparable Corpora for Latvian
    Gornostay, Tatiana
    Ramm, Anita
    Heid, Ulrich
    Morin, Emmanuel
    Harastani, Rima
    Planas, Emmanuel
    HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 66 - +
  • [6] Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia
    Goyal, Vishal
    Kumar, Ajit
    Lehal, Manpreet Singh
    INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 42 - 51
  • [8] Vector Disambiguation for Translation Extraction from Comparable Corpora
    Apidianaki, Marianna
    Ljubesic, Nikola
    Fiser, Darja
    INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2013, 37 (02): : 193 - 202
  • [9] Addressing polysemy in bilingual lexicon extraction from comparable corpora
    Fiser, Darja
    Ljubesic, Nikola
    Kubelka, Ozren
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3031 - 3035
  • [10] French-English terminology extraction from comparable corpora
    Daille, B
    Morin, E
    NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 707 - 718