Knowledge Based Dimensionality Reduction for Technical Text Mining

被引:0
|
作者
Shalaby, Walid [1 ]
Zadrozny, Wlodek [1 ]
Gallagher, Sean [1 ]
机构
[1] Univ North Carolina Charlotte, Dept Comp Sci, Charlotte, NC 28223 USA
关键词
Dimensionality Reduction; Feature Selection; Text Classification; Patent Classification; Knowledge Bases;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we propose a novel technique for dimensionality reduction using freely available online knowledge bases. The complexity of our method is linearly proportional to the size of the full feature set, making it applicable efficiently to huge and complex datasets. We demonstrate this approach by investigating its effectiveness on patent data, the largest free technical text. We report empirical results on classification of the CLEF-IP 2010 dataset using bigram features supported by mentions in Wikipedia, Wiktionary, and GoogleBooks knowledge bases. We achieve a 13-fold reduction in number of bigrams features and a 1.7% increase in classification accuracy over the unigrams baseline. These results give concrete evidence that significant accuracy improvements and massive reduction in dimensionality could be achieved using our approach, hence help alleviating the tradeoff between task complexity and accuracy.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Text and knowledge mining for coreference resolution
    Harabagiu, SM
    Bunescu, RC
    Maiorano, SJ
    2ND MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2001, : 55 - 62
  • [32] Text analysis and knowledge mining system
    Nasukawa, T
    Nagano, T
    IBM SYSTEMS JOURNAL, 2001, 40 (04) : 967 - 984
  • [33] Mining Knowledge Graphs From Text
    Pujara, Jay
    Singh, Sameer
    WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, : 789 - 790
  • [34] Scientific Text Mining and Knowledge Graphs
    Jiang, Meng
    Shang, Jingbo
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3537 - 3538
  • [35] Dimensionality reduction in text classification using scatter method
    Saarikoski, Jyri
    Laurikkala, Jorma
    Jarvelin, Kalervo
    Siermala, Markku
    Juhola, Martti
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2014, 6 (01) : 1 - 21
  • [36] Text mining patents for biomedical knowledge
    Rodriguez-Esteban, Raul
    Bundschus, Markus
    DRUG DISCOVERY TODAY, 2016, 21 (06) : 997 - 1002
  • [37] Improving Knowledge-Based Systems with statistical techniques, text mining, and neural networks for non-technical loss detection
    Guerrero, Juan I.
    Leon, Carlos
    Monedero, Inigo
    Biscarri, Felix
    Biscarri, Jesus
    KNOWLEDGE-BASED SYSTEMS, 2014, 71 : 376 - 388
  • [38] Text classification based on nonlinear dimensionality reduction techniques and support vector machines.
    Shi, Lukui
    Zhang, Jun
    Liu, Enhai
    He, Pilian
    ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 1, PROCEEDINGS, 2007, : 674 - +
  • [39] A Novel Approach for Ontology-based Dimensionality Reduction for Web Text Document Classification
    Elhadad, Mohamed K.
    Badran, Khaled M.
    Salama, Gouda I.
    2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 373 - 378
  • [40] An Experimental Investigation on PCA Based on Cosine Similarity and Correlation for Text Feature Dimensionality Reduction
    Abdulhussain, Maysa I.
    Gan, John Q.
    2015 7TH COMPUTER SCIENCE AND ELECTRONIC ENGINEERING CONFERENCE (CEEC), 2015, : 1 - 4