Syllable n-gram approach for Identification and Classification of genres in Telugu language

被引:0
|
作者
Kumari, K. Pranitha [1 ]
Reddy, A. Venugopal [1 ]
机构
[1] Osmania Univ, OUCE, Dept CSE, Hyderabad 500007, Andhra Pradesh, India
关键词
Telugu web genres; genre identification; genre classification; syllable extraction; character n-gram features;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The use of internet in India is increasing day by day and availability of information in Indian languages on the web is also increasing. So there is a need to classify the web data to improve the search results. Research is going on topic-based text classification but, the genre (non-topical) based web page classification for Telugu web pages is so far not considered. This work attempts to identify the web genres in Telugu language. In this paper, three web genres were identified from the Telugu language web pages based on the social acceptance and communicative purpose i.e. discourse functionality. Syllable extraction algorithm to extract character n-gram features is proposed. The classification was performed using SVM, Naive Bayes and Random forest classifiers. The classification results obtained show that the proposed algorithm gave better performance in terms of F-measure and accuracy.
引用
收藏
页码:125 / 129
页数:5
相关论文
共 50 条
  • [1] Analysis of N-gram model on Telugu Document Classification
    Rani, B. Padmaja
    Vardhan, B. Vishnu
    Durga, A. Kanaka
    Reddy, L. Pratap
    Babu, A. Vinaya
    [J]. 2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 3199 - +
  • [2] A variant of n-gram based language classification
    Tomovic, Andrija
    Janicic, Predrag
    [J]. AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
  • [3] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [4] Active Learning for Language Identification with N-gram Technique
    Feng , Yuxin
    [J]. 2021 2ND INTERNATIONAL CONFERENCE ON BIG DATA & ARTIFICIAL INTELLIGENCE & SOFTWARE ENGINEERING (ICBASE 2021), 2021, : 560 - 564
  • [5] Tagging Syllable Boundaries With Joint N-Gram Models
    Schmid, Helmut
    Moebius, Bernd
    Weidenkaff, Julia
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 49 - 52
  • [6] Language Identification of Short Text Segments with N-gram Models
    Vatanen, Tommi
    Vayrynen, Jaakko J.
    Virpioja, Sami
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3423 - 3430
  • [7] An N-gram Based Chinese Syllable Evaluation Approach for Speech Recognition Error Detection
    Wang, Xingjian
    Li, Lei
    [J]. IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 224 - 229
  • [8] Music Genre Classification: A N-gram based Musicological Approach
    Zheng, Eve
    Moh, Melody
    Moh, Teng-Sheng
    [J]. 2017 7TH IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2017, : 671 - 677
  • [9] Combining naive Bayes and n-gram language models for text classification
    Peng, FC
    Schuurmans, D
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 335 - 350
  • [10] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952