Bengali Document Clustering using Word Movers Distance

被引:0
|
作者
Ahmad, Adnan [1 ]
Amin, Md. Ruhul [1 ]
Chowdhury, Farida [1 ]
机构
[1] Shahjalal Univ Sci & Technol, Dept Comp Sci & Engn, Search Engine Pipilika, Sylhet, Bangladesh
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a pipeline architecture for Bengali Document clustering and apply it for clustering Bengali news documents using different clustering algorithms. Our goal is to cluster news from different online newspapers according to the topic describing the identical stories. We used Word Movers Distance (WMD), a relatively new algorithm to measure document distances, which is based on vector representation of words. Later, we conduct Bengali document clustering using several other algorithms, namely K-means, Hierarchical Clustering Algorithm (HCA) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). We also evaluate the clusters against a manually prepared set of clusters, which we consider as the ground truth. Our experiment shows that, HCA performs best with a F1-score of 92%, which is the most similar to the number of clusters and cluster members compared to the ground truth. We also released a live working demo where the program collects the recent news from popular online Bengali newspapers and create clusters according to the news story.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Combining Distributed Word Representation and Document Distance for Short Text Document Clustering
    Kongwudhikunakorn, Supavit
    Waiyamai, Kitsana
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2020, 16 (02): : 277 - 300
  • [2] Distributed document clustering using word-clusters
    Deb, Debzani
    Angryk, Rafal A.
    2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 376 - 383
  • [3] Spoken Document Clustering Using Word Confusion Networks
    Ikbal, Shajith
    Joshi, Sachindra
    Verma, Ashish
    Deshmukh, Om D.
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1378 - 1381
  • [4] DOCUMENT CLUSTERING USING MULTI WORD EXPRESSIONS WITH ENTITIES CONSTRUCTION
    Selvanayagi, A.
    Amuthan, M.
    INTERNATIONAL JOURNAL OF LIFE SCIENCE AND PHARMA RESEARCH, 2019, : 47 - 52
  • [5] Efficient Word Image Retrieval Using Earth Movers Distance Embedded to Wavelets Coefficients Domain
    Saabni, Raid
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 314 - 318
  • [6] Bengali Word Embeddings and It's Application in Solving Document Classification Problem
    Ahmad, Adnan
    Amin, Mohammad Ruhul
    PROCEEDINGS OF THE 2016 19TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2016, : 425 - 430
  • [7] Automated Bengali Document Summarization By Collaborating Individual Word & Sentence Scoring
    Chandro, Porimol
    Arif, Md Faizul Huq
    Rahman, Md Mahbubur
    Siddik, Md Saeed
    Rahman, Mohammad Sayeedur
    Rahman, Md Abdur
    2018 21ST INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2018,
  • [8] Fuzzy Document Clustering Approach using Word Net Lexical Categories
    Gharib, Tarek F.
    Fouad, Mohammed M.
    Aref, Mostafa M.
    ADVANCES TECHNIQUES IN COMPUTING SCIENCES AND SOFTWARE ENGINEERING, 2010, : 181 - +
  • [9] An Extractive Text Summarization Technique for Bengali Document(s) using K-means Clustering Algorithm
    Akter, Sumya
    Asa, Aysa Siddika
    Uddin, Md. Palash
    Hossain, Md. Delowar
    Roy, Shikhor Kumer
    Ibn Afjal, Masud
    2017 IEEE INTERNATIONAL CONFERENCE ON IMAGING, VISION & PATTERN RECOGNITION (ICIVPR), 2017,
  • [10] Word Embedding of Dimensionality Reduction for Document Clustering
    Zhu, Pengyu
    Lang, Qi
    Liu, Xiaodong
    2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 4371 - 4376