Arabic vs. English: Comparative Statistical Study

被引:3
|
作者
Alotaiby, Fahad [1 ]
Foda, Salah [1 ]
Alkharashi, Ibrahim [2 ]
机构
[1] King Saud Univ, Coll Engn, Dept Elect Engn, Riyadh, Saudi Arabia
[2] King Abdulaziz City Sci & Technol, Comp Res Inst, Riyadh, Saudi Arabia
关键词
Arabic; English; Clitics; Tokenization; Unigram; Bigram; Trigram;
D O I
10.1007/s13369-013-0665-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Important research areas, such as automatic speech recognition, optical character recognition, and information retrieval, heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. However, Arabic is a richer and more complex language than English. Moreover, clitics have a heavy presence in the Arabic language. They can be attached to a stem or to each other without orthographic marks such as an apostrophe. This raises the need to study key statistics of the Arabic language and the statistical differences between Arabic and English on a large scale. Therefore, two large Arabic and English corpora collected from newswire text data, consisting of 600 million words each, are utilized. Hence, the distribution of word length, paragraph length, punctuation marks, unigrams, bigrams and trigrams is presented. In addition, the distribution of clitics in Arabic and their statistical effect are shown. As a result, it has been shown that the number of Arabic word-types is 76 % more than in English. However, lexicon size in Arabic could be reduced by 24.54 % when applying clitics tokenization.
引用
收藏
页码:809 / 820
页数:12
相关论文
共 50 条
  • [31] A COMPARATIVE STUDY OF METADISCOURSE MARKERS IN SOME SELECTED NEWS PROGRAMS ON VOA: THE CASE OF REGULAR ENGLISH PROGRAMS VS. SPECIAL ENGLISH PROGRAMS
    Behnam, Biook
    Mollanaghizadeh, Nasrin
    MODERN JOURNAL OF LANGUAGE TEACHING METHODS, 2015, 5 (01): : 242 - 255
  • [32] Logic-and-meaning Study of Arabic Literary Language: Morphogy vs. ishtiqaq
    Smirnov, Andrey V.
    VOPROSY FILOSOFII, 2023, (12) : 50 - 64
  • [33] Distinct neural mechanisms for reading Arabic vs. verbal numbers: An ERP study
    Proverbio, Alice Mado
    Bianco, Marco
    De Benedetto, Francesco
    EUROPEAN JOURNAL OF NEUROSCIENCE, 2020, 52 (11) : 4480 - 4489
  • [34] Comparative coordination vs. comparative subordination
    Timothy Osborne
    Natural Language & Linguistic Theory, 2009, 27 : 427 - 454
  • [35] Amoxapine as an atypical antipsychotic: A comparative study vs. risperidone
    Apiquian, R
    Fresan, A
    Ulloa, RE
    Nicolini, H
    Kapur, S
    SCHIZOPHRENIA BULLETIN, 2005, 31 (02) : 474 - 474
  • [36] Exercises with computer vs. Exercises in paper: a comparative study
    Trejos Buritica, Omar Ivan
    ACADEMIA Y VIRTUALIDAD, 2018, 11 (01):
  • [37] The effects of reading on pixel vs. paper: a comparative study
    Cinar, Murat
    Dogan, Dilek
    Seferoglu, Suleyman Sadi
    BEHAVIOUR & INFORMATION TECHNOLOGY, 2021, 40 (03) : 251 - 259
  • [38] A Comparative Study on Carbohydrate Estimation: GoCARB vs. Dietitians
    Vasiloglou, Maria F.
    Mougiakakou, Stavroula
    Aubry, Emilie
    Bokelmann, Anika
    Fricker, Rita
    Gomes, Filomena
    Guntermann, Cathrin
    Meyer, Alexa
    Studerus, Diana
    Stanga, Zeno
    NUTRIENTS, 2018, 10 (06)
  • [39] Traditional vs. Energetic and Perchlorate vs. "Green": A Comparative Study of the Choice of Binders and Oxidising Agents
    Lysien, Kinga
    Waskiewicz, Sylwia
    Stolarczyk, Agnieszka
    Mielanczyk, Anna
    Zakusylo, Roman
    Jarosz, Tomasz
    MOLECULES, 2023, 28 (15):
  • [40] Comparative Study of Thrombin Binding of Potassium vs. Sodium
    Carrell, Christopher J.
    Pineda, A. O.
    di Cera, E.
    Mathews, F. S.
    ACTA CRYSTALLOGRAPHICA A-FOUNDATION AND ADVANCES, 2005, 61 : C261 - C262