A Context-free Markup Language for Semi-structured Text

被引：4

作者：

Xi, Qian ^{[1
]}

Walker, David ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

来源：

PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION | 2010年

关键词：

Domain-specific Languages; Tool Generation; Ad Hoc Data; PADS; ANNE;

D O I：

10.1145/1806596.1806622

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description [19], which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.

引用

页码：221 / 232

页数：12

共 50 条

[1] A Context-free Markup Language for Semi-structured Text
Xi, Qian
Walker, David
[J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 221 - 232
[2] Learning information extraction rules for semi-structured and free text
Soderland, S
[J]. MACHINE LEARNING, 1999, 34 (1-3) : 233 - 272
[3] Learning Information Extraction Rules for Semi-Structured and Free Text
Stephen Soderland
[J]. Machine Learning, 1999, 34 : 233 - 272
[4] DEFT: A corpus for definition extraction in free- and semi-structured text
Spala, Sasha
Miller, Nicholas A.
Yang, Yiming
Dernoncourt, Franck
Dockhorn, Carl
[J]. 13TH LINGUISTIC ANNOTATION WORKSHOP (LAW XIII), 2019, : 124 - 131
[5] The set of minimal words of a context-free language is context-free
Berstel, J
Boasson, L
[J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (03) : 477 - 488
[6] CONTEXT-FREE TEXT GRAMMARS
EHRENFEUCHT, A
TENPAS, P
ROZENBERG, G
[J]. ACTA INFORMATICA, 1994, 31 (02) : 161 - 206
[7] WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model
Xie, Chenhao
Huang, Wenhao
Liang, Jiaqing
Huang, Chengsong
Xiao, Yanghua
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 2211 - 2220
[8] Automated Transformation of Semi-Structured Text Elements
Heurix, Johannes
Rella, Antonio
Fenz, Stefan
Neubauer, Thomas
[J]. AMCIS 2012 PROCEEDINGS, 2012,
[9] A semi-structured document model for text mining
Jianwu Yang
Xiaoou Chen
[J]. Journal of Computer Science and Technology, 2002, 17 : 603 - 610
[10] A semi-structured document model for text mining
Yang, JW
Chen, XO
[J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (05) : 603 - 610

← 1 2 3 4 5 →