Scaling Word2Vec on Big Corpus

被引:0
|
作者
Bofang Li
Aleksandr Drozd
Yuhe Guo
Tao Liu
Satoshi Matsuoka
Xiaoyong Du
机构
[1] Renmin University of China,
[2] Tokyo Institute of Technology,undefined
[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined
[4] RIKEN Center for Computational Science,undefined
来源
关键词
Machine learning; Natural language processing; High performance computing; Word embeddings;
D O I
暂无
中图分类号
学科分类号
摘要
Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.
引用
收藏
页码:157 / 175
页数:18
相关论文
共 50 条
  • [31] A detailed review on word embedding techniques with emphasis on word2vec
    Johnson, S. Joshua
    Murty, M. Ramakrishna
    Navakanth, I.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37979 - 38007
  • [32] Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo
    Kale, Aparna Sunil
    Pandya, Vinay
    Di Troia, Fabio
    Stamp, Mark
    [J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2023, 19 (01) : 1 - 16
  • [33] Key word extraction for short text via word2vec, doc2vec, and textrank
    Li, Jun
    Huang, Guimin
    Fan, Chunli
    Sun, Zhenglin
    Zhu, Hongtao
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2019, 27 (03) : 1794 - 1805
  • [34] Word2vec's Distributed Word Representation for Hindi Word Sense Disambiguation
    Kumari, Archana
    Lobiyal, D. K.
    [J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2020), 2020, 11969 : 325 - 335
  • [35] Word2vec Word Similarities on IBM's TrueNorth Neurosynaptic System
    Mendat, Daniel R.
    Cassidy, Andrew S.
    Zarrella, Guido
    Andreou, Andreas G.
    [J]. 2018 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS): ADVANCED SYSTEMS FOR ENHANCING HUMAN HEALTH, 2018, : 595 - 598
  • [36] ExMrec2vec: Explainable Movie Recommender System based on Word2vec
    Samih, Amina
    Ghadi, Abderrahim
    Fennan, Abdelhadi
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (08) : 653 - 660
  • [37] Research on the Construction of Sentiment Dictionary Based on Word2vec
    Song, Xiao-yu
    Zhao, Yang
    Jin, Li-ting
    Sun, Yue
    Liu, Tong
    [J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [38] 基于Word2vec的文档分类方法
    陈杰
    陈彩
    梁毅
    [J]. 计算机系统应用, 2017, 26 (11) : 159 - 164
  • [39] Construction Method of Sentiment Lexicon Based on Word2vec
    Yuan, Zhengwu
    Duan, Lian
    [J]. PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 848 - 851
  • [40] Research on Chinese Text Classification Based on Word2vec
    Yang, Zhi-Tong
    Zheng, Jun
    [J]. 2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1166 - 1170