HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings

被引:1
|
作者
Mueller, Sven [1 ]
Brunzel, Michael [1 ]
Kaun, Daniela [1 ]
Biswas, Russa [1 ,2 ]
Koutraki, Maria [3 ]
Tietz, Tabea [1 ,2 ]
Sack, Harald [1 ,2 ]
机构
[1] Karlsruhe Inst Technol, Inst AIFB, Karlsruhe, Germany
[2] FIZ Karlsruhe, Leibniz Inst Informat Infrastruct, Karlsruhe, Germany
[3] Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
来源
关键词
Word embeddings; Document vectors; Wikidata; Cultural heritage; Visualization; Recommender system;
D O I
10.1007/978-3-030-32327-1_27
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Written text can be understood as a means to acquire insights into the nature of past and present cultures and societies. Numerous projects have been devoted to digitizing and publishing historical textual documents in digital libraries which scientists can utilize as valuable resources for research. However, the extent of textual data available exceeds humans' abilities to explore the data efficiently. In this paper, a framework is presented which combines unsupervised machine learning techniques and natural language processing on the example of historical text documents on the 19th century of the USA. Named entities are extracted from semi-structured text, which is enriched with complementary information from Wikidata. Word embeddings are leveraged to enable further analysis of the text corpus, which is visualized in a web-based application.
引用
收藏
页码:136 / 140
页数:5
相关论文
共 50 条
  • [41] Document Clustering Meets Topic Modeling with Word Embeddings
    Costa, Gianni
    Ortale, Riccardo
    [J]. PROCEEDINGS OF THE 2020 SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM), 2020, : 244 - 252
  • [42] Helmholtz Principle on word embeddings for automatic document segmentation
    Krzeminski, Dominik
    Balinsky, Helen
    Balinsky, Alexander
    [J]. PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018), 2018,
  • [43] A Digital Text Watermarking for Word Document
    Zhang, Shi-ru
    Meng, Xiao-chun
    Liu, Xin-fu
    Chen, Wen-Yuan
    [J]. INTERNATIONAL CONFERENCE MACHINERY, ELECTRONICS AND CONTROL SIMULATION, 2014, 614 : 347 - 351
  • [44] Emotion Detection from Text via Ensemble Classification Using Word Embeddings
    Herzig, Jonathan
    Shmueli-Scheuer, Michal
    Konopnicki, David
    [J]. ICTIR'17: PROCEEDINGS OF THE 2017 ACM SIGIR INTERNATIONAL CONFERENCE THEORY OF INFORMATION RETRIEVAL, 2017, : 269 - 272
  • [45] Comparison of Word Embeddings of Unaligned Audio and Text Data Using Persistent Homology
    Yessenbayev, Zhandos
    Kozhirbayev, Zhanibek
    [J]. SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 700 - 711
  • [46] Measuring document similarity with weighted averages of word embeddings
    Seegmiller, Bryan
    Papanikolaou, Dimitris
    Schmidt, Lawrence D. W.
    [J]. EXPLORATIONS IN ECONOMIC HISTORY, 2023, 87
  • [47] Text Similarity Function Based on Word Embeddings for Short Text Analysis
    Pascual, Adrian Jimenez
    Fujita, Sumio
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2017), PT I, 2018, 10761 : 391 - 402
  • [48] Knowledge-enhanced document embeddings for text classification
    Sinoara, Roberta A.
    Camacho-Collados, Jose
    Rossi, Rafael G.
    Navigli, Roberto
    Rezende, Solange O.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 955 - 971
  • [49] Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings
    Le, Matt
    Roller, Stephen
    Papaxanthos, Laetitia
    Kiela, Douwe
    Nickel, Maximilian
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3231 - 3241
  • [50] Word-class embeddings for multiclass text classification
    Alejandro Moreo
    Andrea Esuli
    Fabrizio Sebastiani
    [J]. Data Mining and Knowledge Discovery, 2021, 35 : 911 - 963