CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

被引:2
|
作者
Lydia, E. Laxmi [1 ]
Moses, G. Jose [2 ]
Varadarajan, Vijayakumar [3 ]
Nonyelu, Fredi [4 ]
Maseleno, Andino [5 ]
Perumal, Eswaran [6 ]
Shankar, K. [6 ]
机构
[1] Vignans Inst Informat Technol, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[2] Raghu Engn Coll Autonomous, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW, Australia
[4] Briteyellow Ltd, Bedford, England
[5] STMIK Pringsewu, Lampung, Indonesia
[6] Alagappa Univ, Dept Comp Applicat, Karaikkudi, Tamil Nadu, India
关键词
Text Mining; Hadoop MapReduce; Indexing; Lucene; Clustering; NMF; K-means;
D O I
10.22452/mjcs.sp2020no1.8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.
引用
收藏
页码:108 / 123
页数:16
相关论文
共 50 条
  • [31] A Novel Indexing Technique for Web Documents using Hierarchical Clustering
    Gupta, Deepti
    Bhatia, Komal Kumar
    Sharma, A. K.
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2009, 9 (09): : 168 - 175
  • [32] Security framework using Hadoop for Big Data
    Johri, Prashant
    Kumar, Arun
    Das, Sanjoy
    Arora, Sanchita
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2017, : 268 - 272
  • [33] Big Data Compression using SPIHT in Hadoop
    Jati, Grafika
    Kusuma, Ilham
    Hilman, M. H.
    Jatmiko, Wisnu
    2016 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS), 2016, : 133 - 137
  • [34] Using Hadoop on the Mainframe: A Big Solution for the Challenges of Big Data
    Seay, Cameron
    Agrawal, Rajeev
    Kadadi, Anirudh
    Barel, Yannick
    2015 12TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY - NEW GENERATIONS, 2015, : 765 - 769
  • [35] Clustering of Association Rules for Big Datasets using Hadoop MapReduce
    Moahmmed, Salahadin A.
    Alasow, Mohamed A.
    El-Alfy, El-Sayed M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (03) : 536 - 545
  • [36] Big Data Analysis Using Hadoop Cluster
    Saldhi, Ankita
    Goel, Abhinav
    Yadav, Dipesh
    Saldhi, Ankur
    Saksena, Dhruv
    Indu, S.
    2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 572 - 575
  • [37] Automatic Feature Detection and Clustering Using Random Indexing
    Nakouri, Haifa
    Limam, Mohamed
    IMAGE AND SIGNAL PROCESSING, ICISP 2014, 2014, 8509 : 586 - 593
  • [38] Discovery multiple data structures in Big Data through global optimization and clustering methods
    Bifulco, Ida
    Cirillo, Stefano
    2018 22ND INTERNATIONAL CONFERENCE INFORMATION VISUALISATION (IV), 2018, : 117 - 121
  • [39] The development of a low-cost big data cluster using Apache Hadoop and Raspberry Pi. A complete guide
    Neto, Antonio Jose Alves
    Neto, Jose Aprigio Carneiro
    Moreno, Edward David
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 104
  • [40] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
    Wei, Chih-Chiang
    Chou, Tzu-Hao
    ATMOSPHERE, 2020, 11 (08)