Word Embeddings as Statistical Estimators

被引:0
|
作者
Dey, Neil [1 ]
Singer, Matthew [1 ]
Williams, Jonathan P. [1 ,2 ]
Sengupta, Srijan [1 ]
机构
[1] North Carolina State Univ, Dept Stat, Raleigh, NC 27695 USA
[2] Norwegian Acad Sci & Letters, Ctr Adv Study, Oslo, Norway
基金
美国国家卫生研究院;
关键词
Copula; Word2Vec; distributed representation; statistical linguistics; language modeling; missing values SVD;
D O I
10.1007/s13571-024-00331-1
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (Adv. Neural Inf. Process. Syst., 27, 2177-2185 2014). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.
引用
收藏
页数:27
相关论文
共 50 条
  • [21] Linguistic Information in Word Embeddings
    Basirat, Ali
    Tang, Marc
    [J]. AGENTS AND ARTIFICIAL INTELLIGENCE, ICAART 2018, 2019, 11352 : 492 - 513
  • [22] Evaluation of Croatian Word Embeddings
    Svoboda, Lukas
    Beliga, Slobodan
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1512 - 1518
  • [23] Adaptive Compression of Word Embeddings
    Kim, Yeachan
    Kim, Kang-Min
    Lee, SangKeun
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3950 - 3959
  • [24] Exploring Numeracy in Word Embeddings
    Naik, Aakanksha
    Ravichander, Abhilasha
    Rose, Carolyn
    Hovy, Eduard
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3374 - 3380
  • [25] Word Embeddings with Limited Memory
    Ling, Shaoshi
    Song, Yangqiu
    Roth, Dan
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2016), VOL 2, 2016, : 387 - 392
  • [26] Ontology Matching with Word Embeddings
    Zhang, Yuanzhe
    Wang, Xuepeng
    Lai, Siwei
    He, Shizhu
    Liu, Kang
    Zhao, Jun
    Lv, Xueqiang
    [J]. CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 34 - 45
  • [27] Word Embeddings for Comment Coherence
    Cimasa, Alfonso
    Corazza, Anna
    Coviello, Carmen
    Scanniello, Giuseppe
    [J]. 2019 45TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA 2019), 2019, : 244 - 251
  • [28] Chinese Word Embeddings with Subwords
    Yang, Gang
    Xu, Hongzhe
    Li, Wen
    [J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [29] Complementary Learning of Word Embeddings
    Song, Yan
    Shi, Shuming
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4368 - 4374
  • [30] Unsupervised Multilingual Word Embeddings
    Chen, Xilun
    Cardie, Claire
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 261 - 270