Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

被引:0
|
作者
Ikeda, Kazushi [1 ]
Yanagihara, Tadashi [1 ]
Matsumoto, Kazunori [1 ]
Takishima, Yasuhiro [1 ]
机构
[1] KDDI R&D Labs Inc, Saitama 3568502, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculia expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods.
引用
收藏
页码:401 / 411
页数:11
相关论文
共 50 条
  • [1] An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents
    Mallek, Maha
    Fournier, Sebastien
    Guetari, Ramzi
    Espinasse, Bernard
    Chaari, Wided Lejouad
    [J]. 2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2020, : 821 - 826
  • [2] Unsupervised Abstractive Summarization of Bengali Text Documents
    Chowdhury, Radia Rayan
    Nayeem, Mir Tafseer
    Mim, Tahsin Tasnim
    Chowdhury, Md Saifur Rahman
    Jannat, Taufiqul
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2612 - 2619
  • [3] Handwritten Text Documents Binarization and Skew Normalization Approaches
    Panwar, Subhash
    Nain, Neeta
    [J]. 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2012), 2012,
  • [4] A Multilingual Text Normalization Approach
    Bigi, Brigitte
    [J]. HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 515 - 526
  • [5] Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization
    Batanovic, Vuk
    Nikolic, Bosko
    [J]. 2016 24TH TELECOMMUNICATIONS FORUM (TELFOR), 2016, : 889 - 892
  • [6] An unsupervised semantic sentence ranking scheme for text documents
    Zhang, Hao
    Wang, Jie
    [J]. INTEGRATED COMPUTER-AIDED ENGINEERING, 2021, 28 (01) : 17 - 33
  • [7] Improving Text Normalization via Unsupervised Model and Discriminative Reranking
    Li, Chen
    Liu, Yang
    [J]. 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: STUDENT RESEARCH WORKSHOP (ACL 2014), 2014, : 86 - 93
  • [8] A hybrid mood classification approach for blog text
    Jung, Yuchul
    Park, Hogun
    Myaeng, Sung Hyon
    [J]. PRICAI 2006: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4099 : 1099 - 1103
  • [9] Unsupervised Sentiment Analysis Approach Based on Clustering for Arabic Text
    Al-Saqqa, Samar
    Al-Naymat, Ghazi
    [J]. EDUCATION EXCELLENCE AND INNOVATION MANAGEMENT: A 2025 VISION TO SUSTAIN ECONOMIC DEVELOPMENT DURING GLOBAL CHALLENGES, 2020, : 4243 - 4254
  • [10] Unsupervised clustering of text entities in heterogeneous grey level documents
    Bres, S
    Eglin, W
    Gagneux, A
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 224 - 227