Correction of Spaces in Persian Sentences for Tokenization

被引：0

作者：

Panahandeh, Mahnaz ^{[1
]}

Ghanbari, Shirin ^{[2
]}

机构：

[1] Amirkabir Univ Technol, Dept Comp Engn & Informat Technol, Tehran, Iran

[2] Univ Essex, Dept Comp Sci & Elect, Colchester, Essex, England

来源：

2019 IEEE 5TH CONFERENCE ON KNOWLEDGE BASED ENGINEERING AND INNOVATION (KBEI 2019) | 2019年

关键词：

tokenization; Persian; Natural Language Processing; space; normalization;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The exponential growth of the Internet and its users and the emergence of Web 2.0 have caused a large volume of textual data to be created. Automatic analysis of such data can be used in making decisions. As online text is created by different producers with different styles of writing, pre-processing is a necessity prior to any processes related to natural language tasks. An essential part of textual preprocessing prior to the recognition of the word vocabulary is normalization, which includes the correction of spaces that particularly in the Persian language this includes both full-spaces between words and half-spaces. Through the review of user comments within social media services, it can be seen that in many cases users do not adhere to grammatical rules of inserting both forms of spaces, which increases the complexity of the identification of words and henceforth, reducing the accuracy of further processing on the text. In this study, current issues in the normalization and tokenization of preprocessing tools within the Persian language and essentially identifying and correcting the separation of words are and the correction of spaces are proposed. The results obtained and compared to leading preprocessing tools highlight the significance of the proposed methodology.

引用

页码：670 / 674

页数：5

共 50 条

[1] Cleft Sentences in Avestan and Old Persian (with a view to Middle Persian)
Widmer, Paul
[J]. INDO-IRANIAN JOURNAL, 2012, 55 (02) : 119 - 137
[2] Shallow semantic parsing of Persian sentences
Department of Artificial Intelligence, Azad University of Mashhad, Ostad Yousefi 0098511-6627512, Ghasem Abad - Mashhad, Iran
不详
[J]. PACLIC 23 - Proc. 23rd Pacific Asia Conf. Lang. Inf. Comput., 2009, (150-159): : 150 - 159
[3] Research on the Automatic Extraction of Persian Simple Sentences
Li, Wei
[J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON SENSOR NETWORK AND COMPUTER ENGINEERING, 2016, 68 : 77 - 82
[4] Grammaticality Judgment of Garden Path Sentences in Persian
Marefat, Hamideh
Arabmofrad, Ali
[J]. JOURNAL OF COGNITIVE SCIENCE, 2008, 9 (01) : 49 - 69
[5] Resolving relative clause attachment ambiguity in Persian sentences
Shabani, Karim
[J]. LINGUA, 2018, 212 : 10 - 19
[6] Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences
An, Yuan
Kalinowski, Alexander
Greenberg, Jane
[J]. 2021 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT DATA SCIENCE TECHNOLOGIES AND APPLICATIONS (IDSTA), 2021, : 138 - 145
[7] Comprehension of Complex Sentences in the Persian-Speaking Patients With Aphasia
Shiani, Amir
Joghataei, Mohammad Taghi
Ashayeri, Hassan
Kamali, Mohammad
Razavi, Mohammad Reza
Yadegari, Fariba
[J]. BASIC AND CLINICAL NEUROSCIENCE, 2019, 10 (03) : 199 - 208
[8] Neural spelling correction: translating incorrect sentences to correct sentences for multimedia
Chanjun Park
Kuekyeng Kim
YeongWook Yang
Minho Kang
Heuiseok Lim
[J]. Multimedia Tools and Applications, 2021, 80 : 34591 - 34608
[9] Neural spelling correction: translating incorrect sentences to correct sentences for multimedia
Park, Chanjun
Kim, Kuekyeng
Yang, YeongWook
Kang, Minho
Lim, Heuiseok
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (26-27) : 34591 - 34608
[10] A fuzzy approach for Persian text segmentation based on semantic similarity of sentences
Shahabi, Amir Shahab
Kangavari, Mohammad Reza
[J]. INTELLIGENT INFORMATION PROCESSING III, 2006, 228 : 411 - +

← 1 2 3 4 5 →