Large-scale text processing pipeline with Apache Spark

被引:0
|
作者
Svyatkovskiy, A. [1 ,2 ,3 ]
Imai, K. [1 ]
Kroeger, M. [1 ]
Shiraito, Y. [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Polit, Princeton, NJ 08544 USA
[3] Princeton Univ, Ctr Stat & Machine Learning, Princeton, NJ 08544 USA
关键词
Spark; Avro; Spark ML; Spark GraphFrames; INNOVATIONS; DIFFUSION; STATES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
引用
下载
收藏
页码:3928 / 3935
页数:8
相关论文
共 50 条
  • [21] Efficient Large-Scale GPS Trajectory Compression on Spark: A Pipeline-Based Approach
    Xiong, Wen
    Wang, Xiaoxuan
    Li, Hao
    ELECTRONICS, 2023, 12 (17)
  • [22] Efficient Large Scale NLP Feature Engineering with Apache Spark
    Esmaeilzadeh, Armin
    Heidari, Maryam
    Abdolazimi, Reyhaneh
    Hajibabaee, Parisa
    Malekzadeh, Masoud
    2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 274 - 280
  • [23] Large Scale Distributed Data Science using Apache Spark
    Shanahan, James G.
    Dai, Liang
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 2323 - 2324
  • [24] Ensemble Learning for Large Scale Virtual Screening on Apache Spark
    Sid, Karima
    Batouche, Mohamed
    COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 244 - 256
  • [25] An Apache Spark Implementation of Block Power Method for Computing Dominant Eigenvalues and Eigenvectors of Large-Scale Matrices
    Ji, Hao
    Weinberg, Seth H.
    Li, Min
    Wang, Jianxin
    Li, Yaohang
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016), 2016, : 554 - 559
  • [26] A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data Using Apache Spark in Cloud
    Yang, Cheng
    Bao, Weidong
    Zhu, Xiaomin
    Wang, Ji
    Xiao, Wenhua
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2018, PT III, 2018, 11336 : 293 - 310
  • [27] Large-Scale Learning with AdaGrad on Spark
    Hadgu, Asmelash Teka
    Nigam, Aastha
    Diaz-Aviles, Ernesto
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2828 - 2830
  • [28] Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark
    Liu Peng
    Zhao Hui-han
    Teng Jia-yu
    Yang Yan-yan
    Liu Ya-feng
    Zhu Zong-wei
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2019, 26 (01) : 1 - 12
  • [29] Spark-Based Large-Scale Matrix Inversion for Big Data Processing
    Liu, Jun
    Liang, Yang
    Ansari, Nirwan
    IEEE ACCESS, 2016, 4 : 2166 - 2176
  • [30] Spark-based Large-scale Matrix Inversion for Big Data Processing
    Liang, Yang
    Liu, Jun
    Fang, Cheng
    Ansari, Nirwan
    2016 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2016,