Effectively Mining Wikipedia for Clustering Multilingual Documents

被引:0
|
作者
Kumar, N. Kiran [1 ]
Santosh, G. S. K. [1 ]
Varma, Vasudeva [1 ]
机构
[1] Int Inst Informat Technol, Hyderabad, Andhra Pradesh, India
关键词
Multilingual Document Clustering; Wikipedia; Document Representation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents Multilingual Document Clustering (MDC) usingWikipedia on comparable corpora. Particularly, we utilized the cross lingual links, category, outlinks, Infobox information present in Wikipedia to enrich the document representation. We have used Bisecting k-means algorithm for clustering multilingual documents based on the document similarities. Experiments are conducted based on the usage of English and Hindi Wikipedia. We have considered English and Hindi Datasets provided by FIRE' 10(1) for Ad-hoc Cross-Lingual document retrieval task on Indian languages. No language specific tools are used, which makes the proposed approach easily extendable for other languages. The system is evaluated using F-score and Purity measures and the results obtained are encouraging.
引用
收藏
页码:254 / 257
页数:4
相关论文
共 50 条
  • [1] Leveraging Wikipedia knowledge to classify multilingual biomedical documents
    Mourino Garcia, Marcos Antonio
    Perez Rodriguez, Roberto
    Anido Rifon, Luis
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2018, 88 : 37 - 57
  • [2] Clustering Documents with Active Learning using Wikipedia
    Huang, Anna
    Milne, David
    Frank, Eibe
    Witten, Ian H.
    [J]. ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 839 - 844
  • [3] Multilingual Document Clustering Using Wikipedia as External Knowledge
    Kumar, Kiran N.
    Santosh, G. S. K.
    Varma, Vasudeva
    [J]. MULTIDISCIPLINARY INFORMATION RETRIEVAL, 2011, 6653 : 108 - 117
  • [4] Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents
    Spanakis, Gerasimos
    Siolas, Georgios
    Stafylopatis, Andreas
    [J]. COMPUTER JOURNAL, 2012, 55 (03): : 299 - 312
  • [5] Multi-view Clustering of Multilingual Documents
    Kim, Young-Min
    Amini, Massih-Reza
    Goutte, Cyril
    Gallinari, Patrick
    [J]. SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 821 - 822
  • [6] Clustering Documents Using a Wikipedia-Based Concept Representation
    Huang, Anna
    Milne, David
    Frank, Eibe
    Witten, Ian H.
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 628 - 636
  • [7] Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering
    Nakamura, Tatsuya
    Shirakawa, Masumi
    Hara, Takahiro
    Nishio, Shojiro
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2019, 18 (02)
  • [8] An FCA-based method for multilingual documents clustering
    Farhat, Mahran
    Gammoudi, Mohamed Mohsen
    [J]. VISION 2020: INNOVATION MANAGEMENT, DEVELOPMENT SUSTAINABILITY, AND COMPETITIVE ECONOMIC GROWTH, 2016, VOLS I - VII, 2016, : 3682 - 3693
  • [9] Building a Multilingual Wikipedia
    Vrandecic, Denny
    [J]. COMMUNICATIONS OF THE ACM, 2021, 64 (04) : 38 - 41
  • [10] A Graph-based Approach to Mining Multilingual Word Associations from Wikipedia
    Ye, Zheng
    Huang, Xiangji
    Lin, Hongfei
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 690 - 691