Automatic extraction of domain-specific stopwords from labeled documents

被引:0
|
作者
Makrehchi, Masoud [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Dept Elect & Comp Engn, Pattern Anal & Machine Intelligence Lab, Waterloo, ON N2L 3G1, Canada
来源
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.
引用
收藏
页码:222 / 233
页数:12
相关论文
共 50 条
  • [1] Extraction of Informative Expressions from Domain-specific Documents
    Yamamoto, Eiko
    Isahara, Hitoshi
    Terada, Akira
    Abe, Yasunori
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1611 - 1617
  • [2] Extracting domain-specific stopwords for text classifiers
    Makrehchi, Masoud
    Kamel, Mohamed S.
    [J]. INTELLIGENT DATA ANALYSIS, 2017, 21 (01) : 39 - 62
  • [3] Term extraction from sparse, ungrammatical domain-specific documents
    Ittoo, Ashwin
    Bouma, Gosse
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (07) : 2530 - 2540
  • [4] Automatic Extraction and Decryption of Abbreviations from Domain-Specific Texts
    Egorov, Michil
    Funkner, Anastasia
    [J]. PHEALTH 2021, 2021, 285 : 281 - 284
  • [5] Information Extraction of Domain-specific Business Documents with Limited Data
    Minh-Tien Nguyen
    Le Thai Linh
    Dung Tien Le
    Nguyen Hong Son
    Do Hoang Thai Duong
    Bui Cong Minh
    Akira Shojiguchi
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [6] Semi-automatic extraction of multiword terms from domain-specific corpora
    Pajic, Vesna
    Stankovic, Stasa Vujicic
    Stankovic, Ranka
    Pajic, Milos
    [J]. ELECTRONIC LIBRARY, 2018, 36 (03): : 550 - 567
  • [7] Automatic Extraction of Indonesian Stopwords
    Achsan, Harry Tursulistyono Yani
    Suhartanto, Heru
    Wibowo, Wahyu Catur
    Dewi, Deshinta A.
    Ismed, Khairul
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (02) : 166 - 171
  • [8] DEXTER: Automatic Extraction of Domain-Specific Glossaries for Language Teaching
    Perinan-Pascual, Carlos
    Mestre-Mestre, Eva M.
    [J]. CURRENT WORK IN CORPUS LINGUISTICS: WORKING WITH TRADITIONALLY- CONCEIVED CORPORA AND BEYOND (CILC2015), 2015, 198 : 377 - 385
  • [9] DOMAIN-SPECIFIC AUTOMATIC PROGRAMMING
    BARSTOW, DR
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1985, 11 (11) : 1321 - 1336
  • [10] Domain-specific keyphrase extraction
    Frank, E
    Paynter, GW
    Witten, IH
    Gutwin, C
    Nevill-Manning, CG
    [J]. IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 668 - 673