Creating Indonesian-Java']Javanese Parallel Corpora Using Wikipedia Articles

被引:0
|
作者
Trisedya, Bayu Distiawan [1 ]
Inastra, Dyah [1 ]
机构
[1] Univ Indonesia, Fac Comp Sci, Depok, Indonesia
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, we explore a few sentence alignment methods which have been used before for another domain. We want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. We used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, we used sentence length based and word correspondence methods. Meanwhile, we used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, we can see that Wikipedia is useful enough for our purpose because both approaches gave positive results.
引用
收藏
页码:239 / 245
页数:7
相关论文
共 8 条
  • [1] Corpus creation and language identification for code-mixed Indonesian-Java']Javanese-English Tweets
    Hidayatullah, Ahmad Fathan
    Apong, Rosyzie Anna
    Lai, Daphne T. C.
    Qazi, Atika
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [3] THE DALANG BEHIND THE WAYANG - THE ROLE OF THE SURAKARTA AND THE YOGYAKARTA DALANG IN INDONESIAN-JAVA']JAVANESE SOCIETY - VANGROENENDAEL,VMC
    PINK, P
    BIJDRAGEN TOT DE TAAL- LAND- EN VOLKENKUNDE, 1986, 142 (04): : 486 - 487
  • [4] Identifying Causal Relations Using Parallel Wikipedia Articles
    Hidey, Christopher
    McKeown, Kathleen
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1424 - 1433
  • [5] Creating and using large monolingual parallel corpora for sentential paraphrase generation
    Wubben, Sander
    van den Bosch, Antal
    Krahmer, Emiel
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 4292 - 4299
  • [6] Novel mutations of epidermolysis bullosa identified using whole-exome sequencing in Indonesian Java']Javanese patients
    Widhiati, Suci
    Danarti, Retno
    Trisnowati, Niken
    Purnomosari, Dewajani
    Wibawa, Tri
    Soebono, Hardyanto
    INTRACTABLE & RARE DISEASES RESEARCH, 2021, 10 (02) : 88 - 94
  • [7] Creating Sentence-Aligned Parallel Text Corpora from a Large Archive of Potential Parallel Text using BITS and Champollion
    Maeda, Kazuaki
    Ma, Xiaoyi
    Strassel, Stephanie
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3066 - 3069
  • [8] UNDERSTANDING THE PITCH AND OVERTONE OF SLENTHEM (INDONESIAN METALLOPHONE IN JAVA']JAVANESE GAMELAN ORCHESTRA) USING AUDIO-BASED APPROACHES THROUGH FAST FOURIER TRANSFORM (FFT)
    Hamdan, Sinin
    Said, Khairul Anwar Mohamad
    Sawawi, Marini
    Musib, Ahmad Faudzi
    Sinin, Aaliyawani Ezzerin
    Sosiati, Harini
    JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2024, 19 (03): : 834 - 852