Arabic vs. English: Comparative Statistical Study

被引:3
|
作者
Alotaiby, Fahad [1 ]
Foda, Salah [1 ]
Alkharashi, Ibrahim [2 ]
机构
[1] King Saud Univ, Coll Engn, Dept Elect Engn, Riyadh, Saudi Arabia
[2] King Abdulaziz City Sci & Technol, Comp Res Inst, Riyadh, Saudi Arabia
关键词
Arabic; English; Clitics; Tokenization; Unigram; Bigram; Trigram;
D O I
10.1007/s13369-013-0665-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Important research areas, such as automatic speech recognition, optical character recognition, and information retrieval, heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. However, Arabic is a richer and more complex language than English. Moreover, clitics have a heavy presence in the Arabic language. They can be attached to a stem or to each other without orthographic marks such as an apostrophe. This raises the need to study key statistics of the Arabic language and the statistical differences between Arabic and English on a large scale. Therefore, two large Arabic and English corpora collected from newswire text data, consisting of 600 million words each, are utilized. Hence, the distribution of word length, paragraph length, punctuation marks, unigrams, bigrams and trigrams is presented. In addition, the distribution of clitics in Arabic and their statistical effect are shown. As a result, it has been shown that the number of Arabic word-types is 76 % more than in English. However, lexicon size in Arabic could be reduced by 24.54 % when applying clitics tokenization.
引用
收藏
页码:809 / 820
页数:12
相关论文
共 50 条
  • [41] Comparative coordination vs. comparative subordination
    Osborne, Timothy
    NATURAL LANGUAGE & LINGUISTIC THEORY, 2009, 27 (02) : 427 - 454
  • [42] Numerical vs. statistical probabilistic model checking: An empirical study
    Younes, HLS
    Kwiatkowska, M
    Norman, G
    Parker, D
    TOOLS AND ALGORITHMS FOR THE CONSTRUCTION AND ANALYSIS OF SYSTEMS, PROCEEDINGS, 2004, 2988 : 46 - 60
  • [43] Carbon neutrality vs. neutralite carbone: A comparative study on French and English users' perceptions and social capital on Twitter
    Yao, Qi
    Li, Rita Yi Man
    Song, Lingxi
    FRONTIERS IN ENVIRONMENTAL SCIENCE, 2022, 10
  • [44] READABILITY OF ARABIC VS ENGLISH PATIENT EDUCATIONAL MATERIALS
    Malik, Abdulaziz
    El-Haj, Mahmoud
    Paasche-Orlow, Michael K.
    JOURNAL OF GENERAL INTERNAL MEDICINE, 2018, 33 : S320 - S321
  • [45] Automatic extraction of specialized verbal units A comparative study on Arabic, English and French
    Ghazzawi, Nizar
    Robichaud, Benoit
    Drouin, Patrick
    Sadat, Fatiha
    TERMINOLOGY, 2018, 23 (02): : 207 - 237
  • [46] Advertising in translation: English vs. Greek
    Sidiropoulou, M
    META, 1998, 43 (02) : 191 - 204
  • [47] More on English vs. SI units
    Lock, Frank
    PHYSICS TEACHER, 2017, 55 (09): : 517 - 517
  • [48] Statistical rankings vs. movable feet
    不详
    MONTHLY LABOR REVIEW, 1999, 122 (05) : 51 - 51
  • [49] Comparative analysis of forecasting for air cargo volume: Statistical techniques vs. machine learning
    Jiaming Liu
    Lina Ding
    Xiaoyu Guan
    Jiao Gui
    Jianbin Xu
    Journal of Data, Information and Management, 2020, 2 (4): : 243 - 255
  • [50] Comparative Desert Vs. Fairness
    Gordon-Solmon, Kerah
    LAW AND PHILOSOPHY, 2017, 36 (04) : 367 - 387