CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

被引:2
|
作者
Lydia, E. Laxmi [1 ]
Moses, G. Jose [2 ]
Varadarajan, Vijayakumar [3 ]
Nonyelu, Fredi [4 ]
Maseleno, Andino [5 ]
Perumal, Eswaran [6 ]
Shankar, K. [6 ]
机构
[1] Vignans Inst Informat Technol, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[2] Raghu Engn Coll Autonomous, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW, Australia
[4] Briteyellow Ltd, Bedford, England
[5] STMIK Pringsewu, Lampung, Indonesia
[6] Alagappa Univ, Dept Comp Applicat, Karaikkudi, Tamil Nadu, India
关键词
Text Mining; Hadoop MapReduce; Indexing; Lucene; Clustering; NMF; K-means;
D O I
10.22452/mjcs.sp2020no1.8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.
引用
收藏
页码:108 / 123
页数:16
相关论文
共 50 条
  • [41] Improving Players' Profiles Clustering from Game Data Through Feature Extraction
    Rodrigues, Luiz A. L.
    Brancher, Jacques D.
    2018 17TH BRAZILIAN SYMPOSIUM ON COMPUTER GAMES AND DIGITAL ENTERTAINMENT (SBGAMES 2018), 2018, : 177 - 186
  • [42] A study on using data clustering for feature extraction to improve the quality of classification
    Piernik, Maciej
    Morzy, Tadeusz
    KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (07) : 1771 - 1805
  • [43] A study on using data clustering for feature extraction to improve the quality of classification
    Maciej Piernik
    Tadeusz Morzy
    Knowledge and Information Systems, 2021, 63 : 1771 - 1805
  • [44] Time Estimation and Resource Minimization Scheme for Apache Spark and Hadoop Big Data Systems With Failures
    Lee, Jinbae
    Kim, Bobae
    Chung, Jong-Moon
    IEEE ACCESS, 2019, 7 : 9658 - 9666
  • [45] Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine
    Bao, Shunxing
    Weitendorf, Frederick D.
    Plassard, Andrew J.
    Huo, Yuankai
    Gokhale, Aniruddha
    Landman, Bennett A.
    MEDICAL IMAGING 2017: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2017, 10138
  • [46] A Big Data Framework for Mining Sensor Data Using Hadoop
    El-Shafeiy, Engy A.
    El-Desouky, Ali I.
    STUDIES IN INFORMATICS AND CONTROL, 2017, 26 (03): : 365 - 376
  • [47] Effective feature representation using symbolic approach for classification and clustering of big data
    Lavanya, P. G.
    Kouser, K.
    Suresha, Mallappa
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [48] Feature extraction using clustering of protein
    Bonet, Isis
    Saeys, Yvan
    Abalo, Ricardo Grau
    Garcia, Maria M.
    Sanchez, Robersy
    Van de Peer, Yves
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2006, 4225 : 614 - 623
  • [49] Incomplete Big Data Clustering Algorithm Using Feature Selection and Partial Distance
    Bu, Fanyu
    Chen, Zhikui
    Zhang, Qingchen
    Wang, Xin
    2014 5TH INTERNATIONAL CONFERENCE ON DIGITAL HOME (ICDH), 2014, : 263 - 266
  • [50] Analyzing Social Media through Big Data using InfoSphere BigInsights and Apache Flume
    Birjali, Marouane
    Beni-Hssane, Abderrahim
    Erritali, Mohammed
    8TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN 2017) / 7TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2017) / AFFILIATED WORKSHOPS, 2017, 113 : 280 - 285