Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark

被引:0
|
作者
Akritidis, Leonidas [1 ]
Bozanis, Panayiotis [1 ]
Fevgas, Athanasios [1 ]
机构
[1] Univ Thessaly, Dept Elect & Comp Engn, Data Struct & Engn Lab, Volos, Greece
关键词
TEXT;
D O I
10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00140
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of classifying a research article into one or more fields of science is of particular importance for the academic search engines and digital libraries. A robust classification algorithm offers the users a wide variety of useful tools, such as the refinement of their search results, the browsing of articles by category, the recommendation of other similar articles, etc. In the current literature we encounter approaches which attempt to address this problem without taking into consideration important parameters such as the previous history of the authors and the categorization of the scientific journals which publish the articles. In addition, the existing works overlook the huge volume of the involved academic data. In this paper, we expand an existing effective algorithm for research articles classification, and we parallelize it on Apache Spark -a parallelization framework which is capable of sharing large amounts of data into the main memory of the nodes of a cluster-to enable the processing of large academic datasets. Furthermore, we present data manipulation methodologies which are useful not only for this particular problem, but also for most parallel machine learning approaches. In our experimental evaluation, we demonstrate that our proposed algorithm is considerably more accurate than the supervised learning approaches implemented within the machine learning library of Spark, whereas it outperforms them in terms of execution speed by a significant margin.
引用
收藏
页码:987 / 994
页数:8
相关论文
共 50 条
  • [1] A Supervised Learning Model for High-Dimensional and Large-Scale Data
    Peng, Chong
    Cheng, Jie
    Cheng, Qiang
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2017, 8 (02)
  • [2] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [3] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [4] A Large-Scale Sentiment Data Classification for Online Reviews Under Apache Spark
    Al-Saqqa, Samar
    Al-Naymat, Ghazi
    Awajan, Arafat
    [J]. 9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 183 - 189
  • [5] Visualizing Large-scale and High-dimensional Data
    Tang, Jian
    Liu, Jingzhou
    Zhang, Ming
    Mei, Qiaozhu
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 287 - 297
  • [6] High-Dimensional Signature Compression for Large-Scale Image Classification
    Sanchez, Jorge
    Perronnin, Florent
    [J]. 2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011, : 1665 - 1672
  • [7] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    [J]. 2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
  • [8] RECURSIVE REDUCTION NET FOR LARGE-SCALE HIGH-DIMENSIONAL DATA
    Ke, Tsung-Wei
    Liu, Tyng-Luh
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 1903 - 1907
  • [9] Feature screening with large-scale and high-dimensional survival data
    Yi, Grace Y.
    He, Wenqing
    Carroll, Raymond. J.
    [J]. BIOMETRICS, 2022, 78 (03) : 894 - 907
  • [10] A fast classification strategy for SVM on the large-scale high-dimensional datasets
    Li, I-Jing
    Wu, Jiunn-Lin
    Yeh, Chih-Hung
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2018, 21 (04) : 1023 - 1038