Studying the history of the Arabic language: language technology and a large-scale historical corpus

被引:8
|
作者
Belinkov, Yonatan [1 ,2 ]
Magidow, Alexander [3 ]
Barron-Cedeno, Alberto [4 ]
Shmidman, Avi [5 ,6 ]
Romanov, Maxim [7 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[2] Harvard John A Paulson Sch Engn & Appl Sci, Cambridge, MA 02138 USA
[3] Univ Rhode Isl, Dept Modern & Class Languages & Literatures, Kingston, RI 02881 USA
[4] HBKU, Qatar Comp Res Inst, Doha, Qatar
[5] Bar Ilan Univ, Dept Hebrew Literature, IL-5290002 Ramat Gan, Israel
[6] Dicta Israel Ctr Text Anal, Ve Olamo 8, IL-9546306 Jerusalem, Israel
[7] Univ Vienna, Dept Hist, Vienna, Austria
基金
以色列科学基金会;
关键词
Arabic; Corpus; Periodization; Text reuse; Historical linguistics;
D O I
10.1007/s10579-019-09460-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties.Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text.We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques.Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.
引用
收藏
页码:771 / 805
页数:35
相关论文
共 50 条
  • [1] Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Yonatan Belinkov
    Alexander Magidow
    Alberto Barrón-Cedeño
    Avi Shmidman
    Maxim Romanov
    [J]. Language Resources and Evaluation, 2019, 53 : 771 - 805
  • [2] Extracting answers to natural language questions from large-scale corpus
    Li, P
    Wang, XL
    Guan, Y
    Zhao, YM
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 690 - 694
  • [3] LANS: Large-scale Arabic News Summarization Corpus
    Alhamadani, Abdulaziz
    Zhang, Xuchao
    He, Jianfeng
    Khatri, Aadyant
    Lu, Chang-Tien
    [J]. ArabicNLP 2023 - 1st Arabic Natural Language Processing Conference, Proceedings, 2023, : 89 - 100
  • [4] A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models
    Santos, David
    Auquilla, Andres
    Siguenza-Guzman, Lorena
    Pena, Mario
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGIES (TICEC 2021), 2021, 1456 : 87 - 100
  • [5] A corpus-based connectionist architecture for large-scale natural language parsing
    Tepper, JA
    Powell, HM
    Palmer-Brown, D
    [J]. CONNECTION SCIENCE, 2002, 14 (02) : 93 - 114
  • [6] An Arabic Sign Language Corpus for Instructional Language in School
    Almohimeed, Abdulaziz
    Wald, Mike
    Damper, Robert
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : A7 - A10
  • [7] Large-scale distributed language modeling
    Emami, Ahmad
    Papineni, Kishore
    Sorensen, Jeffrey
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 37 - +
  • [8] MOALLEMCorpus: A Large-Scale Multimedia Corpus for Children Education of Arabic Vocabularies
    Al-Maadeed, Somaya
    AlJa'am, Jihad
    Khalifa, Batoul
    Abou Elsaud, Samir
    [J]. PROCEEDINGS OF THE 2021 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE (EDUCON), 2021, : 891 - 896
  • [9] Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
    Wu, Stephen T.
    Liu, Hongfang
    Li, Dingcheng
    Tao, Cui
    Musen, Mark A.
    Chute, Christopher G.
    Shah, Nigam H.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (E1) : E149 - E156
  • [10] Large-Scale Network Involvement in Language Processing
    Wylie, Korey P.
    Regner, Michael F.
    [J]. JOURNAL OF NEUROSCIENCE, 2014, 34 (47): : 15505 - 15507