A Comparative Study of Methods for Visualizable Semantic Embedding of Small Text Corpora

被引:3
|
作者
Choudhary, Rishabh [1 ]
Doboli, Simona [2 ]
Minai, Ali A. [1 ]
机构
[1] Univ Cincinnati, Dept Elect Engn & Comp Sci, Cincinnati, OH 45221 USA
[2] Hofstra Univ, Dept Comp Sci, Hempstead, NY 11550 USA
关键词
semantic spaces; text embedding; language models; semantic visualization; REPRESENTATIONS; BRAIN;
D O I
10.1109/IJCNN52387.2021.9534250
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text embedding has recently emerged as a very useful and successful method for semantic representation. Following initial word-level embedding methods such as Latent Semantic Analysis (LSA) and topic-based bag-of-words approaches like Latent Dirichlet Allocation (LDA), the focus has turned to language models and text encoders implemented as neural networks - ranging from word-level models to those embedding whole documents. The distinctive feature of these models is their ability to infer semantic spaces at all levels based purely on data, with no need for complexities such as syntactic analysis or ontology building. Many of these models are available pretrained on enormous amounts of data, providing downstream applications with general-purpose semantic spaces. In particular, embedding models at the sentence level or higher are most useful in applications because the meaning of text only becomes clear at that level. Most text embedding methods produce text embeddings in high-dimensional spaces, with a dimensionality ranging from a few hundred to thousands. However, it is often useful to visualize semantic spaces in very low dimension, which requires the use of dimensionality reduction methods. It is not clear what language models and what method of dimensionality reduction would work well in these cases. In this paper, we compare four text embedding methods in combination with three methods of dimensionality reduction to map three related real-world datasets comprising textual descriptions of items in a particular domain (sports) to a 2-dimensional semantic visualization space. The results provide several insights into the utility of these methods for data of this type.
引用
收藏
页数:8
相关论文
共 50 条
  • [11] Comparative Document Analysis for Large Text Corpora
    Ren, Xiang
    Lv, Yuanhua
    Wang, Kuansan
    Han, Jiawei
    WSDM'17: PROCEEDINGS OF THE TENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2017, : 325 - 334
  • [12] A Comparative Study of Sentence Embedding Models for Assessing Semantic Variation
    Mistry, Deven M.
    Minai, Ali A.
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PART X, 2023, 14263 : 1 - 12
  • [13] A Comparative Study of Word Embedding Models for Arabic Text Processing
    Assiri, Fatmah
    Alghamdi, Nuha
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (09): : 399 - 403
  • [14] A Comparative Study of Word Embedding Models for Arabic Text Processing
    Assiri, Fatmah
    Alghamdi, Nuha
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (08): : 399 - 403
  • [15] A comparative study of dictionaries and corpora as methods for language resource addition
    Shinsuke Mori
    Graham Neubig
    Language Resources and Evaluation, 2016, 50 : 245 - 261
  • [16] Text mining of bilingual parallel corpora with a measure of semantic similarity
    Lee, CH
    Yang, HC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 470 - 475
  • [17] A comparative study of dictionaries and corpora as methods for language resource addition
    Mori, Shinsuke
    Neubig, Graham
    LANGUAGE RESOURCES AND EVALUATION, 2016, 50 (02) : 245 - 261
  • [18] A Comparative Study on Various Text Classification Methods
    Khanna, Samarth
    Tiwari, Bishnu
    Das, Priyanka
    Das, Asit Kumar
    COMPUTATIONAL INTELLIGENCE IN PATTERN RECOGNITION, CIPR 2020, 2020, 1120 : 539 - 549
  • [19] A comparative study of two short text semantic similarity measures
    O'Shea, James
    Bandar, Zuhair
    Crockett, Keeley
    McLean, David
    AGENT AND MULTI-AGENT SYSTEMS: TECHNOLOGIES AND APPLICATIONS, PROCEEDINGS, 2008, 4953 : 172 - 181
  • [20] Semantic Embedding Uncertainty Learning for Image and Text Matching
    Wang, Yan
    Su, Yu-Ting
    Li, Wenhui
    Yan, Chenggang
    Zheng, Bolun
    Li, Xuanya
    Liu, An-An
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 864 - 869