Building A Large Collection of Multi-domain Electronic Theses and Dissertations

被引:5
|
作者
Uddin, Sami [1 ]
Banerjee, Bipasha [2 ]
Wu, Jian [1 ]
Ingram, William A. [3 ]
Fox, Edward A. [2 ]
机构
[1] Old Dominion Univ, Comp Sci, Norfolk, VA 23529 USA
[2] Virginia Polytech Inst & State Univ, Comp Sci, Blacksburg, VA USA
[3] Virginia Polytech Inst & State Univ, Univ Lib, Blacksburg, VA USA
关键词
ETD; OAI-PMH; Big data;
D O I
10.1109/BigData52589.2021.9672058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we report our progress on building a collection containing over 450k Electronic Theses and Dissertations (ETDs), including full-text and metadata. Our goal is to close the gap of accessibility between long text and short text documents, and to create a new research opportunity for the scholarly community. For that, we developed an ETD Ingestion Framework (EIF) that automatically harvests metadata and PDFs of ETDs from university libraries. We faced multiple challenges and learned many lessons during the process, that led to proposed solutions to overcome/mitigate the limitations of the current data. We also described the data that we have collected. We hope our methods will be useful for building similar collections from university libraries and that the data can be used for research and education.
引用
收藏
页码:6043 / 6045
页数:3
相关论文
共 50 条
  • [21] Modeling and Simulation Semantics for Building Large-Scale Multi-Domain Embedded Systems
    Carl, Joshua D.
    Lattmann, Zsolt
    Biswas, Gautam
    PROCEEDINGS 27TH EUROPEAN CONFERENCE ON MODELLING AND SIMULATION ECMS 2013, 2013, : 93 - +
  • [22] Morphing metadata: maximizing access to electronic theses and dissertations
    McCutcheon, Sevim
    Kreyche, Michael
    Maurer, Margaret Beecher
    Nickerson, Joshua
    LIBRARY HI TECH, 2008, 26 (01) : 41 - 57
  • [23] Integrated framework for electronic theses and dissertations in Korean contexts
    Park, Eun G.
    Nam, Young-joon
    Oh, Sanghee
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2007, 33 (03): : 338 - 346
  • [24] Nurse scholars' knowledge and use of electronic theses and dissertations
    Goodfellow, L. M.
    Macduff, C.
    Leslie, G.
    Copeland, S.
    Nolfi, D.
    Blackwood, D.
    INTERNATIONAL NURSING REVIEW, 2012, 59 (04) : 511 - 518
  • [25] Long-term retention of electronic theses and dissertations
    Teper, TH
    Kraemer, B
    COLLEGE & RESEARCH LIBRARIES, 2002, 63 (01): : 61 - 72
  • [26] Towards Multi-Domain and Multi-Physical Electronic Design
    Crepaldi, Marco
    Sanginario, Alessandro
    Ros, Paolo Motto
    Grosso, Michelangelo
    Sassone, Alessandro
    Poncino, Massimo
    Macii, Enrico
    Rinaudo, Salvatore
    Gangemi, Giuliana
    Demarchi, Danilo
    IEEE CIRCUITS AND SYSTEMS MAGAZINE, 2015, 15 (03) : 18 - 43
  • [27] Large Scale, Multi-domain Language Identification
    Jauhiainen, Tommi
    Zampieri, Marcos
    Baldwin, Timothy
    Lindén, Krister
    Synthesis Lectures on Human Language Technologies, 2024, Part F2039 : 117 - 135
  • [28] Building multi-domain conversational systems from single domain resources
    Griol, David
    Molina, Jose Manuel
    NEUROCOMPUTING, 2018, 271 : 59 - 69
  • [29] Using large language models to write theses and dissertations
    O'Leary, Daniel E.
    INTELLIGENT SYSTEMS IN ACCOUNTING FINANCE & MANAGEMENT, 2023, 30 (04): : 228 - 234
  • [30] LabTablet: Semantic Metadata Collection on a Multi-domain Laboratory Notebook
    Amorim, Ricardo Carvalho
    Castro, Joao Aguiar
    da Silva, Joao Rocha
    Ribeiro, Cristina
    METADATA AND SEMANTICS RESEARCH, MTSR 2014, 2014, 478 : 193 - 205