GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

被引：0

作者：

Costa-jussa, Marta R. ^{[1
]}

Lin, Pau Li ^{[1
]}

Espana-Bonet, Cristina ^{[2
,3
]}

机构：

[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain

[2] DFKI GmBH, Saarbrucken, Germany

[3] Saarland Univ, Saarbrucken, Germany

来源：

PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年

关键词：

corpora; gender bias; Wikipedia; machine translation;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and to other domains than biographical entries), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.

引用

页码：4081 / 4088

页数：8

共 5 条

[1] Exploiting Parallel Corpus for Automatic Extraction of Multilingual Names: Transliteration Perspective
Kundu, Bibekananda
Choudhury, Sanjay Kumar
2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 608 - 612
[2] Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus
De Silva, Lalindra
Jayaratne, Lakshman
2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 446 - 451
[3] Building an annotated corpus for automatic metadata extraction from multilingual journal article references
Choi, Wonjun
Yoon, Hwa-Mook
Hyun, Mi-Hwan
Lee, Hye-Jin
Seol, Jae-Wook
Lee, Kangsan Dajeong
Yoon, Young Joon
Kong, Hyesoo
PLOS ONE, 2023, 18 (01):
[4] WikiOnto: A System For Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus
de Silva, Lalindra
Jayaratne, Lakshman
2009 IEEE THIRD INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2009), 2009, : 571 - 576
[5] Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction
Osochkin, Alexandr
Piotrowska, Xenia
Fomin, Vladimir
GLOTTOMETRICS, 2021, 50 : 76 - 89

← 1 →