Improving Word Embeddings for Low Frequency Words by Pseudo Contexts

被引:1
|
作者
Li, Fang [1 ]
Wang, Xiaojie [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Comp, Beijing, Peoples R China
关键词
Word embedding; Low frequency word;
D O I
10.1007/978-3-319-69005-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates relations between word semantic density and word frequency. A distributed representations based word average similarity is defined as the measure of word semantic density. We find that the average similarities of low frequency words are always bigger than that of high frequency words, when the frequency approaches to 400 around, the average similarity tends to stable. The finding keeps correct with changes of the size of training corpus, dimension of distributed representations and number of negative samples in skip-gram model. It also keeps on 17 different languages. Basing on the finding, we propose a pseudo context skip-gram model, which makes use of context words of semantic nearest neighbors of target words. Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.
引用
收藏
页码:37 / 47
页数:11
相关论文
共 50 条
  • [1] Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts
    Xun, Guangxu
    Li, Yaliang
    Gao, Jing
    Zhang, Aidong
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 535 - 543
  • [2] Automatic Ranking of Swear Words using Word Embeddings and Pseudo-Relevance Feedback
    D'Haro, Luis Fernando
    Banchs, Rafael E.
    [J]. 2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 815 - 820
  • [3] Using pseudo-senses for improving the extraction of synonyms from word embeddings
    Ferret, Olivier
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 351 - 357
  • [4] Word Embeddings of Monosemous Words in Dictionary for Word Sense Disambiguation
    Sasaki, Minoru
    [J]. SEMAPRO 2018: THE TWELFTH INTERNATIONAL CONFERENCE ON ADVANCES IN SEMANTIC PROCESSING, 2018, : 4 - 7
  • [5] Improving WordNet using Word Embeddings
    Chiru, Costin-Gabriel
    Truica, Ciprian-Octavian
    Apostol, Elena-Simona
    Ionescu, Alexandru
    [J]. 2021 23RD INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2021), 2021, : 121 - 128
  • [6] Analysis of The Characteristics of Similar Words Computed by Word Embeddings
    Zhou, Shuhui
    Liu, Peihan
    Liu, Lizhen
    Song, Wei
    Cheng, Miaomiao
    [J]. PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 327 - 330
  • [7] Generating Bags of Words from the Sums of Their Word Embeddings
    White, Lyndon
    Togneri, Roberto
    Liu, Wei
    Bennamoun, Mohammed
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT I, 2018, 9623 : 91 - 102
  • [8] Improving Bilingual Lexicon Induction for Low Frequency Words
    Huang, Jiaji
    Cai, Xingyu
    Church, Kenneth
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1310 - 1314
  • [9] Improving Automatic Categorization of Technical vs. Laymen Medical Words using FastText Word Embeddings
    Pylieva, Hanna
    Chernodub, Artem
    Grabar, Natalia
    Hamon, Thierry
    [J]. PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON INFORMATICS & DATA- DRIVEN MEDICINE (IDDM 2018), 2018, 2255 : 93 - 102
  • [10] Improving Word Embeddings Using Kernel PCA
    Gupta, Vishwani
    Giesselbach, Sven
    Rueping, Stefan
    Bauckhage, Christian
    [J]. 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 200 - 208