Further results on latent discourse models and word embeddings

被引:0
|
作者
Khalife, Sammy [1 ,4 ]
Goncalves, Douglas [2 ]
Allouah, Youssef [3 ,5 ]
Liberti, Leo [1 ]
机构
[1] Inst Polytech Paris, Ecole Polytech, LIX, CNRS, F-91128 Palaiseau, France
[2] MTM CFM Univ Fed Santa Catarina, BR-88040900 Florianopolis, SC, Brazil
[3] Inst Polytech Paris, Ecole Polytech, F-91128 Palaiseau, France
[4] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD 21218 USA
[5] Ecole Polytech Fed Lausanne EPFL, IC Sch, Lausanne, Switzerland
关键词
Generative models; latent variable models; asymptotic concentration; natural language processing; matrix factorization;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We discuss some properties of generative models for word embeddings. Namely, (Arora et al., 2016) proposed a latent discourse model implying the concentration of the partition function of the word vectors. This concentration phenomenon led to an asymptotic linear relation between the pointwise mutual information (PMI) of pairs of words and the scalar product of their vectors. Here, we first revisit this concentration phenomenon and prove it under slightly weaker assumptions, for a set of random vectors symmetrically distributed around the origin. Second, we empirically evaluate the relation between PMI and scalar products of word vectors satisfying the concentration property. Our empirical results indicate that, in practice, this relation does not hold with arbitrarily small error. This observation is further supported by two theoretical results: (i) the error cannot be exactly zero because the corresponding shifted PMI matrix cannot be positive semidefinite; (ii) under mild assumptions, there exist pairs of words for which the error cannot be close to zero. We deduce that either natural language does not follow the assumptions of the considered generative model, or the current word vector generation methods do not allow the construction of the hypothesized word embeddings.
引用
收藏
页数:36
相关论文
共 50 条
  • [1] Further results on latent discourse models and word embeddings
    Khalife, Sammy
    Gonçalves, Douglas
    Allouah, Youssef
    Liberti, Leo
    [J]. Journal of Machine Learning Research, 2021, 22
  • [2] LEWIS: Latent Embeddings for Word Images and their Semantics
    Gordo, Albert
    Almazan, Jon
    Murray, Naila
    Perronnin, Florent
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1242 - 1250
  • [3] Decoupled Word Embeddings using Latent Topics
    Park, Heesoo
    Lee, Jongwuk
    [J]. PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 875 - 882
  • [4] Jointly Learning Word Embeddings and Latent Topics
    Shi, Bei
    Lam, Wai
    Jameel, Shoaib
    Schockaert, Steven
    Lai, Kwun Ping
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 375 - 384
  • [5] Integrating Word Embeddings into IBM Word Alignment Models
    Anh-Cuong Le
    Tuan-Phong Nguyen
    Quoc-Long Tran
    [J]. PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 79 - 84
  • [6] Improving Implicit Discourse Relation Recognition with Discourse-specific Word Embeddings
    Wu, Changxing
    Shi, Xiaodong
    Chen, Yidong
    Su, Jinsong
    Wang, Boli
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 269 - 274
  • [7] Incorporating Latent Meanings of Morphological Compositions to Enhance Word Embeddings
    Xu, Yang
    Liu, Jiawei
    Yang, Wei
    Huang, Liusheng
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1232 - 1242
  • [8] Gaussian LDA for Topic Models with Word Embeddings
    Das, Rajarshi
    Zaheer, Manzil
    Dyer, Chris
    [J]. PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, 2015, : 795 - 804
  • [9] Neutralizing Gender Bias in Word Embeddings with Latent Disentanglement and Counterfactual Generation
    Shin, Seungjae
    Song, Kyungwoo
    Jang, JoonHo
    Kim, Hyemi
    Joo, Weonyoung
    Moon, Il-Chul
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [10] Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings
    Al-Sabahi, Kamal
    Zhang Zuping
    Kang, Yang
    [J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2019, 13 (01): : 254 - 276