Efficient Large-Scale GPS Trajectory Compression on Spark: A Pipeline-Based Approach

被引:3
|
作者
Xiong, Wen [1 ,2 ]
Wang, Xiaoxuan [1 ,2 ]
Li, Hao [1 ]
机构
[1] Yunnan Normal Univ, Sch Informat, Kunming 650500, Peoples R China
[2] Engn Res Ctr Comp Vis & Intelligent Control Techno, Yunnan Prov Dept Educ, Kunming 650500, Peoples R China
基金
中国国家自然科学基金;
关键词
trajectory compression; big data; spark; parallelized algorithm; MAPREDUCE;
D O I
10.3390/electronics12173569
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Every day, hundreds of thousands of vehicles, including buses, taxis, and ride-hailing cars, continuously generate GPS positioning records. Simultaneously, the traffic big data platform of urban transportation systems has already collected a large amount of GPS trajectory datasets. These incremental and historical GPS datasets require more and more storage space, placing unprecedented cost pressure on the big data platform. Therefore, it is imperative to efficiently compress these large-scale GPS trajectory datasets, saving storage cost and subsequent computing cost. However, a set of classical trajectory compression algorithms can only be executed in a single-threaded manner and are limited to running in a single-node environment. Therefore, these trajectory compression algorithms are insufficient to compress this incremental data, which often amounts to hundreds of gigabytes, within an acceptable time frame. This paper utilizes Spark, a popular big data processing engine, to parallelize a set of classical trajectory compression algorithms. These algorithms consist of the DP (Douglas-Peucker), the TD-TR (Top-Down Time-Ratio), the SW (Sliding Window), SQUISH (Spatial Quality Simplification Heuristic), and the V-DP (Velocity-Aware Douglas-Peucker). We systematically evaluate these parallelized algorithms on a very large GPS trajectory dataset, which contains 117.5 GB of data produced by 20,000 taxis. The experimental results show that: (1) It takes only 438 s to compress this dataset in a Spark cluster with 14 nodes; (2) These parallelized algorithms can save an average of 26% on storage cost, and up to 40%. In addition, we design and implement a pipeline-based solution that automatically performs preprocessing and compression for continuous GPS trajectories on the Spark platform.
引用
下载
收藏
页数:21
相关论文
共 50 条
  • [1] On a Pipeline-based Architecture for Parallel Visualization of Large-scale Scientific Data
    Chu, Dongliang
    Wu, Chase Q.
    PROCEEDINGS OF 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPPW 2016), 2016, : 88 - 97
  • [2] Efficient approach to modeling of large-scale fluid pipeline network
    He, S.H.
    Zhong, J.
    Zhongguo Jixie Gongcheng/China Mechanical Engineering, 2001, 12 (02):
  • [3] Large-scale text processing pipeline with Apache Spark
    Svyatkovskiy, A.
    Imai, K.
    Kroeger, M.
    Shiraito, Y.
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
  • [4] Cloud-of-clouds Storage Made Efficient: A Pipeline-based Approach
    Shen, Jiajie
    Gu, Jiazhen
    Zhou, Yangfan
    Wang, Xin
    2016 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS), 2016, : 724 - 727
  • [5] MELT: Mapreduce-based Efficient Large-scale Trajectory Anonymization
    Ward, Katrina
    Lin, Dan
    Madria, Sanjay
    SSDBM 2017: 29TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2017,
  • [6] A deep convolutional neural network based approach for vehicle classification using large-scale GPS trajectory data
    Dabiri, Sina
    Markovic, Nikola
    Heaslip, Kevin
    Reddy, Chandan K.
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2020, 116
  • [7] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [8] Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark
    Thuong-Cang Phan
    Anh-Cang Phan
    Thi-To-Quyen Tran
    Ngoan-Thanh Trieu
    ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING (ICCSAMA 2019), 2020, 1121 : 391 - 402
  • [9] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    ACM/IMS Transactions on Data Science, 2020, 1 (03):
  • [10] An Efficient Large-Scale Volume Data Compression Algorithm
    Xiao, Degui
    Zhao, Liping
    Yang, Lei
    Li, Zhiyong
    Li, Kenli
    ADVANCES IN NEURAL NETWORKS - ISNN 2009, PT 3, PROCEEDINGS, 2009, 5553 : 567 - 575