Metric learning for text documents

被引:80
|
作者
Lebanon, G [1 ]
机构
[1] Purdue Univ, Dept Stat, W Lafayette, IN 47907 USA
关键词
distance learning; text analysis; machine learning;
D O I
10.1109/TPAMI.2006.77
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pull-back metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure.
引用
收藏
页码:497 / 508
页数:12
相关论文
共 50 条
  • [1] Concept learning of text documents
    An, JY
    Chen, YPP
    IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 698 - 701
  • [2] Text Document Clustering with Metric Learning
    Wang, Jinlong
    Wu, Shunyao
    Huy Quan Vu
    Li, Gang
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 783 - 784
  • [3] A novel statistical mechanics-based metric for characterization of text documents
    Biswas, Mainak
    Chakrabartty, Shubhro
    Suri, Jasjit S.
    Song, Hanjung
    COMPUTERS & ELECTRICAL ENGINEERING, 2020, 87
  • [4] Learning-based transformation for text documents
    Ma, LP
    Shepherd, J
    Wong, RK
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVIII, PROCEEDINGS: INFORMATION SYSTEMS, CONCEPTS AND APPLICATIONS OF SYSTEMICS, CYBERNETICS AND INFORMATICS, 2002, : 180 - 185
  • [5] Learning a taxonomy from a set of text documents
    Paukkeri, Mari-Sanna
    Perez Garcia-Plaza, Alberto
    Fresno, Victor
    Martinez Unanue, Raquel
    Honkela, Timo
    APPLIED SOFT COMPUTING, 2012, 12 (03) : 1138 - 1148
  • [6] Learning statistics from raw text documents
    Ben Chaabene, Nour El Houda
    Mallek, Maha
    2018 5TH INTERNATIONAL CONFERENCE ON CONTROL, DECISION AND INFORMATION TECHNOLOGIES (CODIT), 2018, : 321 - 326
  • [7] Deep Metric Learning for Scene Text Detection
    Zhu, Qi-Hai
    Zhu, Rui
    Li, Ning
    Yang, Yu-Bin
    2017 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2017, : 1025 - 1029
  • [8] Incorporating virtual relevant documents for learning in text categorization
    Lee, KS
    Kageura, K
    DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 62 - 72
  • [9] Heavy Weight Ontology Learning using Text Documents
    Kumar, Vikas
    Chaudhary, Sanjay
    2014 INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2014, : 110 - 114
  • [10] Learning to classify text from labeled and unlabeled documents
    Nigam, K
    McCallum, A
    Thrun, S
    Mitchell, T
    FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, 1998, : 792 - 799