Scaling Word2Vec on Big Corpus

被引:0
|
作者
Bofang Li
Aleksandr Drozd
Yuhe Guo
Tao Liu
Satoshi Matsuoka
Xiaoyong Du
机构
[1] Renmin University of China,
[2] Tokyo Institute of Technology,undefined
[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined
[4] RIKEN Center for Computational Science,undefined
来源
关键词
Machine learning; Natural language processing; High performance computing; Word embeddings;
D O I
暂无
中图分类号
学科分类号
摘要
Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.
引用
收藏
页码:157 / 175
页数:18
相关论文
共 50 条
  • [1] Scaling Word2Vec on Big Corpus
    Li, Bofang
    Drozd, Aleksandr
    Guo, Yuhe
    Liu, Tao
    Matsuoka, Satoshi
    Du, Xiaoyong
    [J]. DATA SCIENCE AND ENGINEERING, 2019, 4 (02) : 157 - 175
  • [2] Using Word2Vec to Process Big Text Data
    Ma, Long
    Zhang, Yanqing
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2895 - 2897
  • [3] Modelling of Topic from Hindi Corpus using Word2Vec
    Panigrahi, Sabitra Sankalp
    Panigrahi, Narayan
    Paul, Biswajit
    [J]. 2018 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, CONTROL AND COMMUNICATION TECHNOLOGY (IAC3T), 2018, : 97 - 100
  • [4] The Spectral Underpinning of word2vec
    Jaffe, Ariel
    Kluger, Yuval
    Lindenbaum, Ofir
    Patsenker, Jonathan
    Peterfreund, Erez
    Steinerberger, Stefan
    [J]. FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS, 2020, 6
  • [5] Emerging Trends Word2Vec
    Church, Kenneth Ward
    [J]. NATURAL LANGUAGE ENGINEERING, 2017, 23 (01) : 155 - 162
  • [6] Stability of Word Embeddings Using Word2Vec
    Chugh, Mansi
    Whigham, Peter A.
    Dick, Grant
    [J]. AI 2018: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11320 : 812 - 818
  • [7] Word2vec for Arabic Word Sense Disambiguation
    Laatar, Rim
    Aloulou, Chafik
    Belghuith, Lamia Hadrich
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 308 - 311
  • [8] Corpus specificity in LSA and word2vec: the role of out-of-domain documents
    Altszyler, Edgar
    Sigman, Mariano
    Slezak, Diego Fernandez
    [J]. REPRESENTATION LEARNING FOR NLP, 2018, : 1 - 10
  • [9] Considerations about learning Word2Vec
    Giovanni Di Gennaro
    Amedeo Buonanno
    Francesco A. N. Palmieri
    [J]. The Journal of Supercomputing, 2021, 77 : 12320 - 12335
  • [10] Considerations about learning Word2Vec
    Di Gennaro, Giovanni
    Buonanno, Amedeo
    Palmieri, Francesco A. N.
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (11): : 12320 - 12335