Large-scale text processing pipeline with Apache Spark

被引:0
|
作者
Svyatkovskiy, A. [1 ,2 ,3 ]
Imai, K. [1 ]
Kroeger, M. [1 ]
Shiraito, Y. [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Polit, Princeton, NJ 08544 USA
[3] Princeton Univ, Ctr Stat & Machine Learning, Princeton, NJ 08544 USA
关键词
Spark; Avro; Spark ML; Spark GraphFrames; INNOVATIONS; DIFFUSION; STATES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
引用
下载
收藏
页码:3928 / 3935
页数:8
相关论文
共 50 条
  • [1] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [2] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [3] Large-Scale Network Embedding in Apache Spark
    Lin, Wenqing
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
  • [4] Large-Scale Text Similarity Computing with Spark
    Bao, Xiaoan
    Dai, Shichao
    Zhang, Na
    Yu, Chenghai
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (04): : 95 - 100
  • [5] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [6] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
  • [7] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
    Sherar, Matthew
    Zulkernine, Farhana
    2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
  • [8] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    ACM/IMS Transactions on Data Science, 2020, 1 (03):
  • [9] Large-scale virtual screening on public cloud resources with Apache Spark
    Capuccini, Marco
    Ahmed, Laeeq
    Schaal, Wesley
    Laure, Erwin
    Spjuth, Ola
    JOURNAL OF CHEMINFORMATICS, 2017, 9
  • [10] Large-scale digital forensic investigation for Windows registry on Apache Spark
    Lee, Jun-Ha
    Kwon, Hyuk-Yoon
    PLOS ONE, 2022, 17 (12):