GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

被引:0
|
作者
Costa-jussa, Marta R. [1 ]
Lin, Pau Li [1 ]
Espana-Bonet, Cristina [2 ,3 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain
[2] DFKI GmBH, Saarbrucken, Germany
[3] Saarland Univ, Saarbrucken, Germany
关键词
corpora; gender bias; Wikipedia; machine translation;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and to other domains than biographical entries), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.
引用
收藏
页码:4081 / 4088
页数:8
相关论文
共 5 条
  • [1] Exploiting Parallel Corpus for Automatic Extraction of Multilingual Names: Transliteration Perspective
    Kundu, Bibekananda
    Choudhury, Sanjay Kumar
    2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 608 - 612
  • [2] Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus
    De Silva, Lalindra
    Jayaratne, Lakshman
    2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 446 - 451
  • [3] Building an annotated corpus for automatic metadata extraction from multilingual journal article references
    Choi, Wonjun
    Yoon, Hwa-Mook
    Hyun, Mi-Hwan
    Lee, Hye-Jin
    Seol, Jae-Wook
    Lee, Kangsan Dajeong
    Yoon, Young Joon
    Kong, Hyesoo
    PLOS ONE, 2023, 18 (01):
  • [4] WikiOnto: A System For Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus
    de Silva, Lalindra
    Jayaratne, Lakshman
    2009 IEEE THIRD INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2009), 2009, : 571 - 576
  • [5] Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction
    Osochkin, Alexandr
    Piotrowska, Xenia
    Fomin, Vladimir
    GLOTTOMETRICS, 2021, 50 : 76 - 89