Arabic vs. English: Comparative Statistical Study

被引:3
|
作者
Alotaiby, Fahad [1 ]
Foda, Salah [1 ]
Alkharashi, Ibrahim [2 ]
机构
[1] King Saud Univ, Coll Engn, Dept Elect Engn, Riyadh, Saudi Arabia
[2] King Abdulaziz City Sci & Technol, Comp Res Inst, Riyadh, Saudi Arabia
关键词
Arabic; English; Clitics; Tokenization; Unigram; Bigram; Trigram;
D O I
10.1007/s13369-013-0665-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Important research areas, such as automatic speech recognition, optical character recognition, and information retrieval, heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. However, Arabic is a richer and more complex language than English. Moreover, clitics have a heavy presence in the Arabic language. They can be attached to a stem or to each other without orthographic marks such as an apostrophe. This raises the need to study key statistics of the Arabic language and the statistical differences between Arabic and English on a large scale. Therefore, two large Arabic and English corpora collected from newswire text data, consisting of 600 million words each, are utilized. Hence, the distribution of word length, paragraph length, punctuation marks, unigrams, bigrams and trigrams is presented. In addition, the distribution of clitics in Arabic and their statistical effect are shown. As a result, it has been shown that the number of Arabic word-types is 76 % more than in English. However, lexicon size in Arabic could be reduced by 24.54 % when applying clitics tokenization.
引用
收藏
页码:809 / 820
页数:12
相关论文
共 50 条
  • [1] Arabic vs. English: Comparative Statistical Study
    Fahad Alotaiby
    Salah Foda
    Ibrahim Alkharashi
    Arabian Journal for Science and Engineering, 2014, 39 : 809 - 820
  • [2] Explication vs. implication in English-Arabic translation
    Al-Qinai, J
    THEORETICAL LINGUISTICS, 1999, 25 (2-3) : 235 - 255
  • [3] Empathy and Persona of English vs. Arabic Chatbots: A Survey and Future Directions
    Hamad, Omama
    Hamdi, Ali
    Shaban, Khaled
    TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 525 - 537
  • [4] Directness vs. indirectness: Egyptian Arabic and US English communication style
    Nelson, GL
    Al Batal, M
    El Bakary, W
    INTERNATIONAL JOURNAL OF INTERCULTURAL RELATIONS, 2002, 26 (01) : 39 - 57
  • [5] Cross-Language Plagiarism Detection Method: Arabic vs. English
    Hattab, Ezz
    PROCEEDINGS 2015 INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ESYSTEMS ENGINEERING DESE 2015, 2015, : 141 - 144
  • [6] Adequacy in Machine vs. Human Translation: A Comparative Study of English and Persian Languages
    Farahani, Mehrdad Vasheghani
    APPLIED LINGUISTICS RESEARCH JOURNAL, 2020, 4 (05): : 84 - 104
  • [7] Global vs. Local features for Gender Identification using Arabic and English Handwriting
    Ibrahim, Ahmed S.
    Youssef, Amira E.
    Abbott, A. Lynn
    2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 155 - 160
  • [8] A Comparative and Contrastive Study of Arabic and English Metonymic Expressions
    Eid, Omar Abdullah Al-Haj
    Abu-Gub, Mohammed Nour
    Shureteh, Halla
    RUPKATHA JOURNAL ON INTERDISCIPLINARY STUDIES IN HUMANITIES, 2023, 15 (03):
  • [9] A Comparative Study of Political Discourse Features in English and Arabic
    Alduhaim, Asmaa
    INTERNATIONAL JOURNAL OF ENGLISH LINGUISTICS, 2019, 9 (06) : 148 - 159
  • [10] A comparative study of tense and aspect categories in Arabic and English
    Mudhsh, Badri Abdulhakim D. M.
    COGENT ARTS & HUMANITIES, 2021, 8 (01):