Efficient Large-Scale GPS Trajectory Compression on Spark: A Pipeline-Based Approach

被引:3
|
作者
Xiong, Wen [1 ,2 ]
Wang, Xiaoxuan [1 ,2 ]
Li, Hao [1 ]
机构
[1] Yunnan Normal Univ, Sch Informat, Kunming 650500, Peoples R China
[2] Engn Res Ctr Comp Vis & Intelligent Control Techno, Yunnan Prov Dept Educ, Kunming 650500, Peoples R China
基金
中国国家自然科学基金;
关键词
trajectory compression; big data; spark; parallelized algorithm; MAPREDUCE;
D O I
10.3390/electronics12173569
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Every day, hundreds of thousands of vehicles, including buses, taxis, and ride-hailing cars, continuously generate GPS positioning records. Simultaneously, the traffic big data platform of urban transportation systems has already collected a large amount of GPS trajectory datasets. These incremental and historical GPS datasets require more and more storage space, placing unprecedented cost pressure on the big data platform. Therefore, it is imperative to efficiently compress these large-scale GPS trajectory datasets, saving storage cost and subsequent computing cost. However, a set of classical trajectory compression algorithms can only be executed in a single-threaded manner and are limited to running in a single-node environment. Therefore, these trajectory compression algorithms are insufficient to compress this incremental data, which often amounts to hundreds of gigabytes, within an acceptable time frame. This paper utilizes Spark, a popular big data processing engine, to parallelize a set of classical trajectory compression algorithms. These algorithms consist of the DP (Douglas-Peucker), the TD-TR (Top-Down Time-Ratio), the SW (Sliding Window), SQUISH (Spatial Quality Simplification Heuristic), and the V-DP (Velocity-Aware Douglas-Peucker). We systematically evaluate these parallelized algorithms on a very large GPS trajectory dataset, which contains 117.5 GB of data produced by 20,000 taxis. The experimental results show that: (1) It takes only 438 s to compress this dataset in a Spark cluster with 14 nodes; (2) These parallelized algorithms can save an average of 26% on storage cost, and up to 40%. In addition, we design and implement a pipeline-based solution that automatically performs preprocessing and compression for continuous GPS trajectories on the Spark platform.
引用
下载
收藏
页数:21
相关论文
共 50 条
  • [41] Efficient optimization of a large-scale biorefinery system using a novel decomposition based approach
    Punnathanam, Varun
    Shastri, Yogendra
    CHEMICAL ENGINEERING RESEARCH & DESIGN, 2020, 160 : 175 - 189
  • [42] AN EFFICIENT APPROACH FOR LARGE-SCALE PROJECT-PLANNING BASED ON FUZZY DELPHI METHOD
    CHANG, IS
    TSUJIMURA, Y
    GEN, M
    TOZAWA, T
    FUZZY SETS AND SYSTEMS, 1995, 76 (03) : 277 - 288
  • [43] Large-scale log compressing system based on differential compression
    Tang, Qiu
    Jiang, Lei
    Dai, Qiong
    Tongxin Xuebao/Journal on Communications, 2015, 36
  • [44] LARGE-SCALE DESALINATION BY VAPOR COMPRESSION
    BULANG, W
    DESALINATION, 1983, 45 (MAY) : 263 - 263
  • [45] An analysis and validation pipeline for large-scale RNAi-based screens
    Plank, Michael
    Hu, Guang
    Silva, A. Sofia
    Wood, Shona H.
    Hesketh, Emily E.
    Janssens, Georges
    Macedo, Andre
    de Magalhaes, Joao Pedro
    Church, George M.
    SCIENTIFIC REPORTS, 2013, 3
  • [46] An analysis and validation pipeline for large-scale RNAi-based screens
    Michael Plank
    Guang Hu
    A. Sofia Silva
    Shona H. Wood
    Emily E. Hesketh
    Georges Janssens
    André Macedo
    João Pedro de Magalhães
    George M. Church
    Scientific Reports, 3
  • [47] An efficient classification approach for large-scale mobile ubiquitous computing
    Tang, Feilong
    You, Ilsun
    Tang, Can
    Guo, Minyi
    INFORMATION SCIENCES, 2013, 232 : 419 - 436
  • [48] An Efficient Approach to Solve the Large-Scale Semidefinite Programming Problems
    Zheng, Yongbin
    Yan, Yuzhuang
    Liu, Sheng
    Huang, Xinsheng
    Xu, Wanying
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2012, 2012
  • [49] Churros: a Docker-based pipeline for large-scale epigenomic analysis
    Wang, Jiankang
    Nakato, Ryuichiro
    DNA RESEARCH, 2024, 31 (01)
  • [50] MMSVC: An Efficient Unsupervised Learning Approach for Large-Scale Datasets
    Gu, Hong
    Zhao, Guangzhou
    Zhang, Jianliang
    LIFE SYSTEM MODELING AND INTELLIGENT COMPUTING, 2010, 6330 : 1 - 9