Scaling Word2Vec on Big Corpus

被引:0
|
作者
Bofang Li
Aleksandr Drozd
Yuhe Guo
Tao Liu
Satoshi Matsuoka
Xiaoyong Du
机构
[1] Renmin University of China,
[2] Tokyo Institute of Technology,undefined
[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined
[4] RIKEN Center for Computational Science,undefined
来源
关键词
Machine learning; Natural language processing; High performance computing; Word embeddings;
D O I
暂无
中图分类号
学科分类号
摘要
Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.
引用
收藏
页码:157 / 175
页数:18
相关论文
共 50 条
  • [41] Encrypted Malicious Traffic Detection Based on Word2Vec
    Ferriyan, Andrey
    Thamrin, Achmad Husni
    Takeda, Keiji
    Murai, Jun
    [J]. ELECTRONICS, 2022, 11 (05)
  • [42] Word Mover's Embedding: From Word2Vec to Document Embedding
    Wu, Lingfei
    Yen, Ian En-Hsu
    Xu, Kun
    Xu, Fangli
    Balakrishnan, Avinash
    Chen, Pin-Yu
    Ravikumar, Pradeep
    Witbrock, Michael J.
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4524 - 4534
  • [43] Representation of Semantic Word Embeddings Based on SLDA and Word2vec Model
    Tang Huanling
    Zhu Hui
    Wei Hongmin
    Zheng Han
    Mao Xueli
    Lu Mingyu
    Guo Jin
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2023, 32 (03) : 647 - 654
  • [44] Tuning Word2vec for Large Scale Recommendation Systems
    Chamberlain, Benjamin P.
    Rossi, Emanuele
    Shiebler, Dan
    Sedhain, Suvash
    Bronstein, Michael M.
    [J]. RECSYS 2020: 14TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, 2020, : 732 - 737
  • [45] Extending Word2Vec with Domain-Specific Labels
    Svana, Milos
    [J]. PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 157 - 160
  • [46] A User Profile Modeling Method Based on Word2Vec
    Hu, Jianqiao
    Jin, Feng
    Zhang, Guigang
    Wang, Jian
    Yang, Yi
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2017, : 410 - 414
  • [47] Representation of Semantic Word Embeddings Based on SLDA and Word2vec Model
    TANG Huanling
    ZHU Hui
    WEI Hongmin
    ZHENG Han
    MAO Xueli
    LU Mingyu
    GUO Jin
    [J]. Chinese Journal of Electronics, 2023, 32 (03) : 647 - 654
  • [48] The Improved Model for word2vec Based on Part of Speech and Word Order
    Pan, Bo
    Yu, Chong-Chong
    Zhang, Qing-Chuan
    Xu, Shi-Xuan
    Cao, Shuai
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2018, 46 (08): : 1976 - 1982
  • [49] Chinese Sentiment Classification Using Extended Word2Vec
    张胜
    张鑫
    程佳军
    王晖
    [J]. Journal of Donghua University(English Edition), 2016, 33 (05) : 823 - 826
  • [50] Using Word2Vec Recommendation for Improved Purchase Prediction
    Esmeli, Ramazan
    Bader-El-Den, Mohamed
    Abdullahi, Hassana
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,