Faster Parallel Training of Word Embeddings

被引:0
|
作者
Wszola, Eliza [1 ]
Jaggi, Martin [2 ]
Puschel, Markus [1 ]
机构
[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland
[2] Ecole Polytech Fed Lausanne, Sch Comp & Commun Sci, Lausanne, Switzerland
关键词
machine learning; natural language processing; parallel computing; performance; word2vec; fasttext;
D O I
10.1109/HiPC53243.2021.00017
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information. In this paper, we aim at improving the execution speed of fastText training on homogeneous multi- and manycore CPUs while maintaining accuracy. We present a novel open-source implementation that flexibly incorporates various algorithmic Variants including negative sample sharing, batched updates. and a byte-pair encoding-based alternative for subword units. We build these novel variants over a fastText implementation that we carefully optimized for the architecture, memory hierarchy, and parallelism of current manycore CPUs. Our experiments on three languages demonstrate 3-20x speed-up in training time at competitive semantic and syntactic accuracy.
引用
收藏
页码:31 / 41
页数:11
相关论文
共 50 条
  • [1] Parallel Data-Local Training for Optimizing Word2Vec Embeddings for Word and Graph Embeddings
    Moon, Gordon E.
    Newman-Griffis, Denis
    Kim, Jinsung
    Sukumaran-Rajam, Aravind
    Fosler-Lussier, Eric
    Sadayappan, P.
    [J]. PROCEEDINGS OF 2019 5TH IEEE/ACM WORKSHOP ON MACHINE LEARNING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS (MLHPC 2019), 2019, : 44 - 55
  • [2] Learning Word Embeddings in Parallel by Alignment
    Zubair, Sahil
    Zubair, Mohammad
    [J]. 2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2017, : 566 - 571
  • [3] Faster Training by Selecting Samples Using Embeddings
    Gonzalez, Santiago
    Landgraf, Joshua
    Miikkulainen, Risto
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [4] Multilingual Training of Crosslingual Word Embeddings
    Duong, Long
    Kanayama, Hiroshi
    Ma, Tengfei
    Bird, Steven
    Cohn, Trevor
    [J]. 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 894 - 904
  • [5] A survey on training and evaluation of word embeddings
    Torregrossa, Francois
    Allesiardo, Robin
    Claveau, Vincent
    Kooli, Nihel
    Gravier, Guillaume
    [J]. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2021, 11 (02) : 85 - 103
  • [6] A survey on training and evaluation of word embeddings
    François Torregrossa
    Robin Allesiardo
    Vincent Claveau
    Nihel Kooli
    Guillaume Gravier
    [J]. International Journal of Data Science and Analytics, 2021, 11 : 85 - 103
  • [7] Training Temporal Word Embeddings with a Compass
    Di Carlo, Valerio
    Bianchi, Federico
    Palmonari, Matteo
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6326 - 6334
  • [8] Quantifying Context Overlap for Training Word Embeddings
    Zhuang, Yimeng
    Xie, Jinghui
    Zheng, Yinhe
    Zhu, Xuan
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 587 - 593
  • [9] Unsupervised Joint Training of Bilingual Word Embeddings
    Marie, Benjamin
    Fujita, Atsushi
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3224 - 3230
  • [10] Asynchronous Training of Word Embeddings for Large Text Corpora
    Anand, Avishek
    Khosla, Megha
    Singh, Jaspreet
    Zab, Jan-Hendrik
    Zhang, Zijian
    [J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 168 - 176