CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

被引:2
|
作者
Lydia, E. Laxmi [1 ]
Moses, G. Jose [2 ]
Varadarajan, Vijayakumar [3 ]
Nonyelu, Fredi [4 ]
Maseleno, Andino [5 ]
Perumal, Eswaran [6 ]
Shankar, K. [6 ]
机构
[1] Vignans Inst Informat Technol, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[2] Raghu Engn Coll Autonomous, Comp Sci & Engn, Visakhapatnam, Andhra Pradesh, India
[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW, Australia
[4] Briteyellow Ltd, Bedford, England
[5] STMIK Pringsewu, Lampung, Indonesia
[6] Alagappa Univ, Dept Comp Applicat, Karaikkudi, Tamil Nadu, India
关键词
Text Mining; Hadoop MapReduce; Indexing; Lucene; Clustering; NMF; K-means;
D O I
10.22452/mjcs.sp2020no1.8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.
引用
收藏
页码:108 / 123
页数:16
相关论文
共 50 条
  • [21] Chaotic Association Feature Extraction of Big Data Clustering Based on the Internet of Things
    Liu, Xiaomin
    Singh, Thipendra Pal
    Gupta, Rajeev Kumar
    Onyema, Edeh Michael
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (03): : 333 - 342
  • [22] An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX
    Mohammed, Wria Mohammed Salih
    Maa, Alaa Khalil Ju
    ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2024, 13
  • [23] Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop
    Yu, Yanwei
    Zhao, Jindong
    Wang, Xiaodong
    Wang, Qin
    Zhang, Yonggang
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2015,
  • [24] A Comparative Study of Various Clustering Techniques on Big Data Sets using Apache Mahout
    Eluri, Venkateswara Reddy
    AL-Jabri, Amina Salim Mohd
    Ramesh, M.
    Jane, Mare
    2016 3RD MEC INTERNATIONAL CONFERENCE ON BIG DATA AND SMART CITY (ICBDSC), 2016, : 374 - 377
  • [25] Segmenting and Indexing Old Documents Using a Letter Extraction
    Coustaty, Mickael
    Dubois, Sloven
    Ogier, Jean-Marc
    Menard, Michel
    GRAPHICS RECOGNITION: ACHIEVEMENTS, CHALLENGES, AND EVOLUTION, 2010, 6020 : 142 - 149
  • [26] A Big Data Framework for Satellite Images Processing using Apache Hadoop and RasterFrames: A Case Study of Surface Water Extraction in Phu Tho, Viet Nam
    Dung Nguyen
    Hong Anh Le
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (12) : 780 - 786
  • [27] Processing Big Data with Apache Hadoop in the Current Challenging Era of COVID-19
    Azeroual, Otmane
    Fabre, Renaud
    BIG DATA AND COGNITIVE COMPUTING, 2021, 5 (01)
  • [28] Comparison of Feature Extraction Methods for Brazilian Legal Documents Clustering
    Lima, Joao Pedro
    Costa, Jose Alfredo
    Araujo, Diogenes Carlos
    2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [29] EXTRACTION OF CHARACTERS FROM FORM DOCUMENTS BY FEATURE POINT CLUSTERING
    FAN, KC
    LU, JM
    WANG, LS
    LIAO, HY
    PATTERN RECOGNITION LETTERS, 1995, 16 (09) : 963 - 970
  • [30] Big Data: Mining of Log File through Hadoop
    Kotiyal, Bina
    Kumar, Ankit
    Pant, Bhaskar
    Goudar, R. H.
    2013 INTERNATIONAL CONFERENCE ON HUMAN COMPUTER INTERACTIONS (ICHCI), 2013,