Using bigrams detection for text categorization in scientific domain

被引:0
|
作者
Montejo Raez, Arturo [1 ]
Perea Ortega, Jose Manuel [1 ]
Martin Valdivia, Maria Teresa [1 ]
Urena Lopez, L. Alfonso [1 ]
机构
[1] Univ Jaen, Dept Informat, Escuela Politecn Super, E-23071 Jaen, Spain
来源
关键词
Bigrams; Text Categorization; Multi-words; HEP collection;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This paper presents some experiments using the technique of multi-words detection for text categorization in scientific domain. We have used part of the collection of scientific papers of High Energy Physics (HEP) provided by the European Laboratory for Particle Physics (CERN). The supervised machine learning algorithms employed have been Rocchio and PLAUM. The technique of multi-words detection used has been limited to fixed sequences of maximum two terms, known as bigrams. The aim of this study is to determine whether the use of frequent bigrams as unique features may be an improvement for text categorization task in this specific domain. Our conclusion is that multi-words detection should not be used for this task in the HEP domain.
引用
收藏
页码:91 / 98
页数:8
相关论文
共 50 条
  • [1] The use of bigrams to enhance text categorization
    Tan, CM
    Wang, YF
    Lee, CD
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2002, 38 (04) : 529 - 546
  • [2] Improving text categorization using domain knowledge
    Zhu, JB
    Chen, WL
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3513 : 103 - 113
  • [3] Using text categorization techniques for intrusion detection
    Liao, YH
    Vemuri, VR
    [J]. USENIX ASSOCIATION PROCEEDINGS OF THE 11TH USENIX SECURITY SYMPOSIUM, 2002, : 51 - 59
  • [4] Threat detection using Internet agents and text categorization
    Goldberg, JL
    [J]. IC'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, VOLS I AND II, 2001, : 689 - 695
  • [5] Eliminating high-degree biased character bigrams for dimensionality reduction in Chinese text categorization
    Xue, DJ
    Sun, MS
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 2997 : 197 - 208
  • [6] Text categorization based on domain ontology
    He, QM
    Qiu, L
    Zhao, GT
    Wang, SK
    [J]. WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 319 - 324
  • [7] Text Categorization by Fuzzy Domain Adaptation
    Behbood, Vahid
    Lu, Jie
    Zhang, Guangquan
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ - IEEE 2013), 2013,
  • [8] A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization
    Li, Jingyang
    Sun, Maosong
    Zhang, Xian
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 545 - 552
  • [9] Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in Chinese text categorization
    Xue, DJ
    Sun, MS
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 584 - 595
  • [10] Text representations for text categorization: A case study in biomedical domain
    Lan, Man
    Tan, Chew Lim
    Su, Jian
    Low, Hwee Boon
    [J]. 2007 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-6, 2007, : 2556 - +