Finding Parallel Passages in Cultural Heritage Archives

被引:11
|
作者
Harris, Martyn [1 ]
Levene, Mark [1 ]
Zhang, Dell [1 ]
Levene, Dan [2 ]
机构
[1] Birkbeck Univ London, Malet St, London WC1E 7HX, England
[2] Southampton Univ, Univ Rd, Southampton SO17 1BJ, Hants, England
来源
基金
英国惠康基金;
关键词
Digital archives; information retrieval; statistical language models; suffix trees; SEARCH;
D O I
10.1145/3195727
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this article, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimized suffix tree, and generalized edit distance. A comprehensive evaluation through crowdsourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.
引用
收藏
页码:1 / 24
页数:24
相关论文
共 50 条
  • [1] Comparing "parallel passages" in digital archives
    Harris, Martyn
    Levene, Mark
    Zhang, Dell
    Levene, Dan
    [J]. JOURNAL OF DOCUMENTATION, 2020, 76 (01) : 271 - 289
  • [2] Archives and children's cultural heritage
    Sparrman, Anna
    Sjoberg, Johanna
    Hrechaniuk, Yelyzaveta
    Kopsell, Linn
    Isaksson, Karin
    Eriksson, Maria
    Orrmalm, Alex
    Venalainen, Paeivi
    Agren, Ylva
    Coulter, Natalie
    Kjellman, Ulrika
    Aarsand, Pal
    Tesar, Marek
    Sanchez-Eppler, Karen
    Wells, Elizabeth
    [J]. ARCHIVES AND RECORDS-THE JOURNAL OF THE ARCHIVES AND RECORDS ASSOCIATION, 2023,
  • [3] Digital Archives and Cultural Heritage: The Inatheque
    Andolfi, Lea
    [J]. AMERICAN JOURNALISM, 2023, 40 (02) : 258 - 259
  • [4] The management of intangible cultural heritage archives of art from the perspective of cultural heritage
    Yan, Wenming
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 126 : 173 - 174
  • [5] THE PERSONAL ARCHIVES AND THEIR IMPORTANCE AS DOCUMENTARY AND CULTURAL HERITAGE
    Svicero, Thais Jeronimo
    [J]. HISTORIA E CULTURA, 2013, 2 (01): : 221 - 237
  • [6] The Repository of Cultural Heritage Research Information: "E-Connect to Cultural Heritage Knowledge" (Archives of Cultural Heritage Research Information)
    Baek, Ju-hyun
    [J]. REVIEW OF KOREAN STUDIES, 2023, 26 (02): : 199 - 220
  • [7] A Computational Framework for Organizing and Querying Cultural Heritage Archives
    de Mooij, Jan
    Kurtan, Can
    Baas, Jurian
    Dastani, Mehdi
    [J]. ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2022, 15 (03):
  • [8] Information searching in cultural heritage archives: a user study
    Borlund, Pia
    Pharo, Nils
    Liu, Ying-Hsang
    [J]. JOURNAL OF DOCUMENTATION, 2024, 80 (04) : 978 - 1002
  • [9] CULTURAL HERITAGE - COMMON DENOMINATOR FOR LIBRARIES, ARCHIVES AND MUSEUMS
    Karun, Breda
    [J]. BOSNIACA-JOURNAL OF THE NATIONAL AND UNIVERSITY LIBRARY OF BOSNIA AND HERZEGOVINA, 2007, (12): : 63 - 67
  • [10] Citizen Experiences in Cultural Heritage Archives: A Data Journey
    Daga, Enrico
    [J]. KNOWLEDGE ORGANIZATION, 2024, 51 (05): : 310 - 319