Content Based Audiobooks Indexing using Apache Hadoop Framework

被引:0
|
作者
Shetty, Sonal [1 ]
Sabarad, Akash [1 ]
Hebballi, Harish [1 ]
Husain, Moula [1 ]
Meena, S. M. [1 ]
Nagaralli, Shiddu [1 ]
机构
[1] BVBCET, Vidya Nagar, Hubli, India
关键词
Hadoop; MapReduce; tf-idf and CMU SPHINX-4;
D O I
10.1145/2791405.2791485
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, content based audio indexing has become the key research area, as the audio content defines the content more precisely and has comparatively subservient density. In this paper, we present conversion of audio books into textual information using CMU SPHINX-4 speech transcriber and efficient indexing of audio books using term frequency inverse document frequency(tf-idf) weights on Apache Hadoop MapReduce framework. In the first phase, audiobook datasets are converted into textual words by training CMU SPHINX 4 speech recognizer with acoustic models. In the next phase, the keywords present in the text file generated from the speech recognizer are filtered using tf-idf weights. Finally, we index audio files based on the keywords extracted from the speech converted text file. As, conversion of speech to text and indexing of audio are space and time intensive tasks, we ported execution of these algorithms on Hadoop MapReduce Framework. Porting content based indexing of audio books on to a Hadoop distributed framework resulted in considerable improvement in time and space utilization. As the amount of data being uploaded and downloaded is escalating, this can be further extended to indexing of image, video and other multimedia forms.
引用
收藏
页码:496 / 501
页数:6
相关论文
共 50 条
  • [31] Processing of Big Educational Data in the Cloud Using Apache Hadoop
    Machova, Renata
    Komarkova, Jitka
    Lnenicka, Martin
    INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 46 - 49
  • [32] Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark
    Mavridis, Ilias
    Karatza, Helen
    JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 125 : 133 - 151
  • [33] Recommending Top N Movies Using Content-Based Filtering and Collaborative Filtering with Hadoop and Hive Framework
    Bharti, Roshan
    Gupta, Deepak
    RECENT DEVELOPMENTS IN MACHINE LEARNING AND DATA ANALYTICS, 2019, 740 : 109 - 118
  • [34] Context Based Genuine Content Recommendation System Using Hadoop
    Bende, Sachin
    Shedge, Rajashree
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 208 - 215
  • [35] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
    Wei, Chih-Chiang
    Chou, Tzu-Hao
    ATMOSPHERE, 2020, 11 (08)
  • [36] Automated Indexing of Structured Scientific Metadata Using Apache Solr
    Guntupally, Kavya
    Dumas, Kyle
    Darnell, Wade
    Crow, Michael
    Devarakonda, Ranjeet
    Giri, Prakash
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5685 - 5687
  • [37] Retrieval and extraction of Unique Patterns from Compressed Text Data using the SVD Technique on Hadoop Apache Mahout Framework
    Dhumal, Poonam
    Deshmukh, S. S.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2016,
  • [38] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    JOURNAL OF BIG DATA, 2020, 7 (01)
  • [39] A Cloud Computing Implementation of XML Indexing Method Using Hadoop
    Hsu, Wen-Chiao
    Liao, I-En
    Shih, Hsiao-Chen
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2012), PT III, 2012, 7198 : 256 - 265
  • [40] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    N. Ahmed
    Andre L. C. Barczak
    Teo Susnjak
    Mohammed A. Rashid
    Journal of Big Data, 7