Text categorization based on a new classification by thresholds

被引:0
|
作者
Walid Cherif
Abdellah Madani
Mohamed Kissi
机构
[1] Rabat-Institutes,Laboratory SI2M, Department of Computer Science, National Institute of Statistics and Applied Economics
[2] University Chouaib Doukkali,Laboratory LAROSERI, Department of Computer Science, Faculty of Sciences
[3] University Hassan II Casablanca,Laboratory LIM, Department of Computer Science, Faculty of Sciences and Technology
来源
关键词
Natural language processing; Text mining; Automated text categorization; Feature selection; Machine learning; Classification by thresholds;
D O I
暂无
中图分类号
学科分类号
摘要
Automated text categorization attempts to provide an effective solution to today’s unprecedented growth of textual data. Due to its capacity to organize a huge and varied amount of texts from which it is possible to gain invaluable insights, it has become an emerging investigative field for the research community. However, although several mathematical approaches have been studied to formalize the main components of a text categorization system: text representation, features extraction, and the classification process; such systems still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. In this sense, this paper introduces an alternative way to process this problem. First, it starts by reducing the original set of features by using a newly proposed metric. And second, the added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. Moreover, some standard pretreatments such as stemming can be abandoned with this approach. The experimental results showed that this new text categorization method outperforms the state-of-the-art methods. As a result, the obtained f-measures on the 20 Newsgroups, BBC News, Reuters, and AG news datasets were, respectively, 95.06%, 98.21%, 88.44%, 95.70%, while standard approaches returned considerably lower scores.
引用
下载
收藏
页码:433 / 447
页数:14
相关论文
共 50 条
  • [31] Text categorization based on subtopic clusters
    Chik, FCY
    Luk, RWP
    Chung, KFL
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3513 : 203 - 214
  • [32] A Learning Based Handwritten Text Categorization
    Sarker, Goutam
    Dhua, Silpi
    Besra, Monica
    2015 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTER ENGINEERING AND APPLICATIONS (ICACEA), 2015, : 465 - 471
  • [33] A KNN BASED ALGORITHM FOR TEXT CATEGORIZATION
    Bucar, Joze
    Povh, Janez
    SOR'13 PROCEEDINGS: THE 12TH INTERNATIONAL SYMPOSIUM ON OPERATIONAL RESEARCH IN SLOVENIA, 2013, : 367 - 372
  • [34] Text Categorization Based on Topic Model
    School of Computer Science and Technology, China University of Mining and Technology, Jiangsu Province, Xuzhou
    221116, China
    不详
    100081, China
    Int. J. Comput. Intell. Syst., 2009, 4 (398-409): : 398 - 409
  • [35] Text Categorization Based on Topic Model
    Zhou, Shibin
    Li, Kan
    Liu, Yushu
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2009, 2 (04) : 398 - 409
  • [36] Text categorization based on domain ontology
    He, QM
    Qiu, L
    Zhao, GT
    Wang, SK
    WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 319 - 324
  • [37] Research of Text Categorization Based on Ontology
    Wang Jiayun
    Zhang Rui
    Wang Peng
    PROCEEDINGS OF 2009 CONFERENCE ON COMMUNICATION FACULTY, 2009, : 167 - 170
  • [38] Graph based KNN for Text Categorization
    Jo, Taeho
    2018 20TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2018, : 260 - 265
  • [39] Research of Text Categorization Based on SVM
    Wang, Meihua
    Zhang, Hongbin
    Ding, Renshuang
    PROCEEDINGS OF THE 2011 INTERNATIONAL CONFERENCE ON INFORMATICS, CYBERNETICS, AND COMPUTER ENGINEERING (ICCE2011), VOL 2: INFORMATION SYSTEMS AND COMPUTER ENGINEERING, 2011, 111 : 69 - 77
  • [40] Macro Features Based Text Categorization
    Wang, Dandan
    Chen, Qingcai
    Wang, Xiaolong
    Tang, Buzhou
    NEURAL INFORMATION PROCESSING, PT II, 2011, 7063 : 211 - 219