Moroccan Data-Driven Spelling Normalization Using Character Neural Embedding

被引:2
|
作者
Tachicart, Ridouane [1 ]
Bouzoubaa, Karim [1 ]
机构
[1] Mohammed V Univ Rabat, Mohammadia Sch Engineers, Ave Ibn Sina BP 765 Agdal, Rabat 10090, Morocco
关键词
Moroccan Arabic; lexicon; NLP; word embedding; neural networks; normalization;
D O I
10.1142/S2196888821500044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of Web use in Morocco today, Internet has become an important source of information. Specifically, across social media, the Moroccan people use several languages in their communication leaving behind unstructured user-generated text (UGT) that presents several opportunities for Natural Language Processing. Among the languages found in this data, Moroccan Arabic (MA) stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with an emphasis on Moroccan Arabic. For this purpose, we follow several steps, using some tools such as a language identification system, in order to conduct a deep study of this data. The most interesting findings that have emerged are the use of code-switching, multi-script and low amount of words in the Moroccan UGT. Moreover, we used the investigated data in order to build a new Moroccan language resource. The latter consists in building a Moroccan words orthographic variants lexicon following an unsupervised approach and using character neural embedding. This lexicon can be useful for several NLP tasks such as spelling normalization.
引用
收藏
页码:113 / 131
页数:19
相关论文
共 50 条
  • [1] Manifold embedding data-driven mechanics
    Bahmani, Bahador
    Sun, WaiChing
    Journal of the Mechanics and Physics of Solids, 2022, 166
  • [2] Manifold embedding data-driven mechanics
    Bahmani, Bahador
    Sun, WaiChing
    JOURNAL OF THE MECHANICS AND PHYSICS OF SOLIDS, 2022, 166
  • [3] Data-driven design of embedding observers using automatic differentiation
    Fiedler, Julius
    Gerbet, Daniel
    Roebenack, Klaus
    AT-AUTOMATISIERUNGSTECHNIK, 2024, 72 (08) : 745 - 756
  • [4] A Transformer Model with Spatiotemporal Input Embedding for fNIRS data-Driven Neural Decoding
    Lee, Hyunmin
    Kim, Taehun
    An, Jinung
    2024 12TH INTERNATIONAL WINTER CONFERENCE ON BRAIN-COMPUTER INTERFACE, BCI 2024, 2024,
  • [5] A Data-Driven Approach to Efficient Character Articulation
    Chen, Yin
    Lai, Yu-Kun
    Cheng, Zhi-Quan
    Martin, Ralph R.
    Jin, Shi-Yao
    2013 INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN AND COMPUTER GRAPHICS (CAD/GRAPHICS), 2013, : 32 - 37
  • [6] Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
    Arora, Monika
    Kansal, Vineet
    SOCIAL NETWORK ANALYSIS AND MINING, 2019, 9 (01)
  • [7] Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
    Monika Arora
    Vineet Kansal
    Social Network Analysis and Mining, 2019, 9
  • [8] Language Agnostic Data-Driven Inverse Text Normalization
    Chen, Szu-Jui
    Paul, Debjyoti
    Pang, Yutong
    Su, Peng
    Zhang, Xuedong
    INTERSPEECH 2023, 2023, : 451 - 455
  • [9] Data-Driven Lexical Normalization for Medical Social Media
    Dirkson, Anne
    Verberne, Suzan
    Sarker, Abeed
    Kraaij, Wessel
    MULTIMODAL TECHNOLOGIES AND INTERACTION, 2019, 3 (03)
  • [10] A Mostly Data-driven Approach to Inverse Text Normalization
    Pusateri, Ernest
    Ambati, Bharat Ram
    Brooks, Elizabeth
    Platek, Ondrej
    McAllaster, Donald
    Nagesha, Venki
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2784 - 2788