Correction of Spaces in Persian Sentences for Tokenization

被引:0
|
作者
Panahandeh, Mahnaz [1 ]
Ghanbari, Shirin [2 ]
机构
[1] Amirkabir Univ Technol, Dept Comp Engn & Informat Technol, Tehran, Iran
[2] Univ Essex, Dept Comp Sci & Elect, Colchester, Essex, England
关键词
tokenization; Persian; Natural Language Processing; space; normalization;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The exponential growth of the Internet and its users and the emergence of Web 2.0 have caused a large volume of textual data to be created. Automatic analysis of such data can be used in making decisions. As online text is created by different producers with different styles of writing, pre-processing is a necessity prior to any processes related to natural language tasks. An essential part of textual preprocessing prior to the recognition of the word vocabulary is normalization, which includes the correction of spaces that particularly in the Persian language this includes both full-spaces between words and half-spaces. Through the review of user comments within social media services, it can be seen that in many cases users do not adhere to grammatical rules of inserting both forms of spaces, which increases the complexity of the identification of words and henceforth, reducing the accuracy of further processing on the text. In this study, current issues in the normalization and tokenization of preprocessing tools within the Persian language and essentially identifying and correcting the separation of words are and the correction of spaces are proposed. The results obtained and compared to leading preprocessing tools highlight the significance of the proposed methodology.
引用
收藏
页码:670 / 674
页数:5
相关论文
共 50 条
  • [1] Cleft Sentences in Avestan and Old Persian (with a view to Middle Persian)
    Widmer, Paul
    [J]. INDO-IRANIAN JOURNAL, 2012, 55 (02) : 119 - 137
  • [2] Shallow semantic parsing of Persian sentences
    Department of Artificial Intelligence, Azad University of Mashhad, Ostad Yousefi 0098511-6627512, Ghasem Abad - Mashhad, Iran
    不详
    [J]. PACLIC 23 - Proc. 23rd Pacific Asia Conf. Lang. Inf. Comput., 2009, (150-159): : 150 - 159
  • [3] Research on the Automatic Extraction of Persian Simple Sentences
    Li, Wei
    [J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON SENSOR NETWORK AND COMPUTER ENGINEERING, 2016, 68 : 77 - 82
  • [4] Grammaticality Judgment of Garden Path Sentences in Persian
    Marefat, Hamideh
    Arabmofrad, Ali
    [J]. JOURNAL OF COGNITIVE SCIENCE, 2008, 9 (01) : 49 - 69
  • [5] Resolving relative clause attachment ambiguity in Persian sentences
    Shabani, Karim
    [J]. LINGUA, 2018, 212 : 10 - 19
  • [6] Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences
    An, Yuan
    Kalinowski, Alexander
    Greenberg, Jane
    [J]. 2021 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT DATA SCIENCE TECHNOLOGIES AND APPLICATIONS (IDSTA), 2021, : 138 - 145
  • [7] Comprehension of Complex Sentences in the Persian-Speaking Patients With Aphasia
    Shiani, Amir
    Joghataei, Mohammad Taghi
    Ashayeri, Hassan
    Kamali, Mohammad
    Razavi, Mohammad Reza
    Yadegari, Fariba
    [J]. BASIC AND CLINICAL NEUROSCIENCE, 2019, 10 (03) : 199 - 208
  • [8] Neural spelling correction: translating incorrect sentences to correct sentences for multimedia
    Chanjun Park
    Kuekyeng Kim
    YeongWook Yang
    Minho Kang
    Heuiseok Lim
    [J]. Multimedia Tools and Applications, 2021, 80 : 34591 - 34608
  • [9] Neural spelling correction: translating incorrect sentences to correct sentences for multimedia
    Park, Chanjun
    Kim, Kuekyeng
    Yang, YeongWook
    Kang, Minho
    Lim, Heuiseok
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (26-27) : 34591 - 34608
  • [10] A fuzzy approach for Persian text segmentation based on semantic similarity of sentences
    Shahabi, Amir Shahab
    Kangavari, Mohammad Reza
    [J]. INTELLIGENT INFORMATION PROCESSING III, 2006, 228 : 411 - +