Large-Scale Network Embedding in Apache Spark

被引:10
|
作者
Lin, Wenqing [1 ]
机构
[1] Tencent, Interact Entertainment Grp, Shenzhen, Guangdong, Peoples R China
关键词
network embedding; distributed computing; graph partitioning;
D O I
10.1145/3447548.3467136
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to 4.25% and 4.27% improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to 91.11% in running time and up to 12.80% in the corresponding evaluation metrics.
引用
收藏
页码:3271 / 3279
页数:9
相关论文
共 50 条
  • [1] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [2] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [3] Building a Large-Scale Microscopic Road Network Traffic Simulator in Apache Spark
    Fu, Zishan
    Yu, Jia
    Sarwat, Mohamed
    [J]. 2019 20TH INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2019), 2019, : 320 - 328
  • [4] Large-scale text processing pipeline with Apache Spark
    Svyatkovskiy, A.
    Imai, K.
    Kroeger, M.
    Shiraito, Y.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
  • [5] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [6] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    [J]. 2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
  • [7] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
    Sherar, Matthew
    Zulkernine, Farhana
    [J]. 2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
  • [8] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    [J]. ACM/IMS Transactions on Data Science, 2020, 1 (03):
  • [9] Large-scale virtual screening on public cloud resources with Apache Spark
    Capuccini, Marco
    Ahmed, Laeeq
    Schaal, Wesley
    Laure, Erwin
    Spjuth, Ola
    [J]. JOURNAL OF CHEMINFORMATICS, 2017, 9
  • [10] Large-scale digital forensic investigation for Windows registry on Apache Spark
    Lee, Jun-Ha
    Kwon, Hyuk-Yoon
    [J]. PLOS ONE, 2022, 17 (12):