Metric learning for text documents

被引:80
|
作者
Lebanon, G [1 ]
机构
[1] Purdue Univ, Dept Stat, W Lafayette, IN 47907 USA
关键词
distance learning; text analysis; machine learning;
D O I
10.1109/TPAMI.2006.77
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pull-back metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure.
引用
收藏
页码:497 / 508
页数:12
相关论文
共 50 条
  • [41] Locating text in color documents
    Strouthopoulos, C
    Papamarkos, N
    Atsalakis, A
    Chamzas, C
    2001 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2001, : 1066 - 1069
  • [42] A database of coptic text documents
    Delattre, Alain
    COPTIC STUDIES ON THE THRESHOLD OF A NEW MILLENNIUM, VOLS I AND II, 2004, 133 : 491 - 493
  • [43] Text Simplification of Patent Documents
    Kang, Jeongwoo
    Souili, Achille
    Cavallucci, Denis
    AUTOMATED INVENTION FOR SMART INDUSTRIES, 2018, 541 : 225 - 237
  • [44] Text identification in color documents
    Strouthopoulos, C
    Papamarkos, N
    Atsalakis, A
    Chamzas, C
    ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 702 - 705
  • [45] Text Summarization of Spanish Documents
    Umadevi, K. S.
    Chopra, Romansha
    Singh, Nivedita
    Aruru, Likitha
    Kannan, R. Jagadeesh
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1793 - 1797
  • [46] Text binarization in color documents
    Badekas, Efthimios
    Nikolaou, Nikos
    Papamarkos, Nikos
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2006, 16 (06) : 262 - 274
  • [47] Towards Ensemble-Based Imbalanced Text Classification Using Metric Learning
    Komamizu, Takahiro
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II, 2023, 14147 : 188 - 202
  • [48] Hierarchical clustering of text documents
    L. S. Lomakina
    V. B. Rodionov
    A. S. Surkova
    Automation and Remote Control, 2014, 75 : 1309 - 1315
  • [49] Discriminative clustering of text documents
    Peltonen, J
    Sinkkonen, J
    Kaski, S
    ICONIP'02: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING: COMPUTATIONAL INTELLIGENCE FOR THE E-AGE, 2002, : 1956 - 1960
  • [50] Detecting Plagiarism in Text Documents
    Hariharan, Shanmugasundaram
    Kamal, Sirajudeen
    Faisal, Abdul Vadud Mohamed
    Azharudheen, Sheik Mohamed
    Raman, Bhaskaran
    INFORMATION PROCESSING AND MANAGEMENT, 2010, 70 : 497 - 500