HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings

被引:1
|
作者
Mueller, Sven [1 ]
Brunzel, Michael [1 ]
Kaun, Daniela [1 ]
Biswas, Russa [1 ,2 ]
Koutraki, Maria [3 ]
Tietz, Tabea [1 ,2 ]
Sack, Harald [1 ,2 ]
机构
[1] Karlsruhe Inst Technol, Inst AIFB, Karlsruhe, Germany
[2] FIZ Karlsruhe, Leibniz Inst Informat Infrastruct, Karlsruhe, Germany
[3] Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
来源
关键词
Word embeddings; Document vectors; Wikidata; Cultural heritage; Visualization; Recommender system;
D O I
10.1007/978-3-030-32327-1_27
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Written text can be understood as a means to acquire insights into the nature of past and present cultures and societies. Numerous projects have been devoted to digitizing and publishing historical textual documents in digital libraries which scientists can utilize as valuable resources for research. However, the extent of textual data available exceeds humans' abilities to explore the data efficiently. In this paper, a framework is presented which combines unsupervised machine learning techniques and natural language processing on the example of historical text documents on the 19th century of the USA. Named entities are extracted from semi-structured text, which is enriched with complementary information from Wikidata. Word embeddings are leveraged to enable further analysis of the text corpus, which is visualized in a web-based application.
引用
收藏
页码:136 / 140
页数:5
相关论文
共 50 条
  • [1] Asynchronous Training of Word Embeddings for Large Text Corpora
    Anand, Avishek
    Khosla, Megha
    Singh, Jaspreet
    Zab, Jan-Hendrik
    Zhang, Zijian
    [J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 168 - 176
  • [2] Arabic Text Classification Based on Word and Document Embeddings
    El Mahdaouy, Abdelkader
    Gaussier, Eric
    El Alaoui, Said Ouatik
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 32 - 41
  • [3] Extending Full Text Search for Legal Document Collections Using Word Embeddings
    Landthaler, Joerg
    Waltl, Bernhard
    Holl, Patrick
    Matthes, Florian
    [J]. LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 294 : 73 - 82
  • [4] Text Classification Using Word Embeddings
    Helaskar, Mukund N.
    Sonawane, Sheetal S.
    [J]. 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [5] REPRESENTING WORD IMAGE USING VISUAL WORD EMBEDDINGS AND RNN FOR KEYWORD SPOTTING ON HISTORICAL DOCUMENT IMAGES
    Wei, Hongxi
    Zhang, Hui
    Gao, Guanglai
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 1368 - 1373
  • [6] Automatic document screening of medical literature using word and text embeddings in an active learning setting
    Andres Carvallo
    Denis Parra
    Hans Lobel
    Alvaro Soto
    [J]. Scientometrics, 2020, 125 : 3047 - 3084
  • [7] Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora
    Cagliero, Luca
    La Quatra, Moreno
    [J]. IEEE ACCESS, 2021, 9 : 137309 - 137321
  • [8] Automatic document screening of medical literature using word and text embeddings in an active learning setting
    Carvallo, Andres
    Parra, Denis
    Lobel, Hans
    Soto, Alvaro
    [J]. SCIENTOMETRICS, 2020, 125 (03) : 3047 - 3084
  • [9] Automatic Text Summarization using Word Embeddings
    Easwar, Arjun
    Uthra, Annie
    [J]. PROCEEDINGS OF THE 2021 FIFTH INTERNATIONAL CONFERENCE ON I-SMAC (IOT IN SOCIAL, MOBILE, ANALYTICS AND CLOUD) (I-SMAC 2021), 2021, : 1065 - 1079
  • [10] Single document summarization using word and sentence embeddings
    Ayana
    [J]. PROCEEDINGS OF THE 2015 JOINT INTERNATIONAL MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY CONFERENCE (JIMET 2015), 2015, 10 : 523 - 526