CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

被引:2
|
作者
Lydia, E. Laxmi [1 ]
Moses, G. Jose [2 ]
Varadarajan, Vijayakumar [3 ]
Nonyelu, Fredi [4 ]
Maseleno, Andino [5 ]
Perumal, Eswaran [6 ]
Shankar, K. [6 ]
机构
[1] Vignans Inst Informat Technol, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[2] Raghu Engn Coll Autonomous, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW, Australia
[4] Briteyellow Ltd, Bedford, England
[5] STMIK Pringsewu, Lampung, Indonesia
[6] Alagappa Univ, Dept Comp Applicat, Karaikkudi, Tamil Nadu, India
关键词
Text Mining; Hadoop MapReduce; Indexing; Lucene; Clustering; NMF; K-means;
D O I
10.22452/mjcs.sp2020no1.8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.
引用
收藏
页码:108 / 123
页数:16
相关论文
共 50 条
  • [1] Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
    Laxmi Lydia, E.
    Sharmili, N.
    Nguyen, Phong Thanh
    Hashim, Wahidah
    Maseleno, Andino
    Test Engineering and Management, 2019, 81 (11-12): : 1107 - 1130
  • [2] Big Data Analysis using Apache Hadoop
    Manikandan, Shankar Ganesh
    Ravi, Siddarth
    2014 INTERNATIONAL CONFERENCE ON IT CONVERGENCE AND SECURITY (ICITCS), 2014,
  • [3] Inverted Indexing In Big Data Using Hadoop Multiple Node Cluster
    Velusamy, Kaushik
    Vijayaraju, Nivetha
    Venkitaramanan, Deepthi
    Suresh, Greeshma
    Madhu, Divya
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2013, 4 (11) : 156 - 161
  • [4] Optimization of Multiple Queries for Big Data with Apache Hadoop/Hive
    Garg, Varun
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 938 - 941
  • [5] Color and Texture Feature Extraction using Apache Hadoop Framework
    Sabarad, Akash K.
    Kankudti, Mohamed Humair
    Meena, S. M.
    Husain, Moula
    1ST INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION ICCUBEA 2015, 2015, : 585 - 588
  • [6] Clustering on Big Data Using Hadoop MapReduce
    Akthar, Nadeem
    Ahamad, Mohd Vasim
    Khan, Shahbaz
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 789 - 795
  • [7] Processing of Big Educational Data in the Cloud Using Apache Hadoop
    Machova, Renata
    Komarkova, Jitka
    Lnenicka, Martin
    INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 46 - 49
  • [8] Shared Disk Big Data Analytics with Apache Hadoop
    Mukherjee, Anirban
    Datta, Joydip
    Jorapur, Raghavendra
    Singhvi, Ravi
    Haloi, Saurav
    Akram, Wasim
    2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [9] Content Based Audiobooks Indexing using Apache Hadoop Framework
    Shetty, Sonal
    Sabarad, Akash
    Hebballi, Harish
    Husain, Moula
    Meena, S. M.
    Nagaralli, Shiddu
    PROCEEDING OF THE THIRD INTERNATIONAL SYMPOSIUM ON WOMEN IN COMPUTING AND INFORMATICS (WCI-2015), 2015, : 496 - 501
  • [10] Improved CURE Clustering for Big Data using Hadoop and Mapreduce
    Lathiya, Piyush
    Rani, Rinkle
    2016 INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT), VOL 3, 2015, : 241 - 245