SPECTRa-T: Machine-Based Data Extraction and Semantic Searching of Chemistry e-Theses

被引:6
|
作者
Downing, Jim [1 ]
Harvey, Matt J. [4 ]
Morgan, Peter B. [2 ]
Murray-Rust, Peter [1 ]
Rzepa, Henry S. [3 ]
Stewart, Diana C. [1 ]
Tonge, Alan P. [1 ]
Townsend, Joe A. [1 ]
机构
[1] Univ Cambridge, Dept Chem, Unilever Ctr Mol Informat, Cambridge CB2 1EW, England
[2] Cambridge Univ Lib, Cambridge CB3 9DR, England
[3] Univ London Imperial Coll Sci Technol & Med, Dept Chem, London SW7 2AZ, England
[4] Univ London Imperial Coll Sci Technol & Med, ICT, High Performance Comp Unit, London SW7 2AZ, England
关键词
CHEMICAL MARKUP; WEB; XML;
D O I
10.1021/ci9003688
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (c-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.
引用
收藏
页码:251 / 261
页数:11
相关论文
empty
未找到相关数据