A Context-free Markup Language for Semi-structured Text

被引:4
|
作者
Xi, Qian [1 ]
Walker, David [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
关键词
Domain-specific Languages; Tool Generation; Ad Hoc Data; PADS; ANNE;
D O I
10.1145/1806596.1806622
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description [19], which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.
引用
收藏
页码:221 / 232
页数:12
相关论文
共 50 条
  • [41] LANGUAGE MODELING USING STOCHASTIC CONTEXT-FREE GRAMMARS
    CORAZZA, A
    DEMORI, R
    GRETTER, R
    SATTA, G
    [J]. SPEECH COMMUNICATION, 1993, 13 (1-2) : 163 - 170
  • [42] UNIFORM RANDOM GENERATION OF STRINGS IN A CONTEXT-FREE LANGUAGE
    HICKEY, T
    COHEN, J
    [J]. SIAM JOURNAL ON COMPUTING, 1983, 12 (04) : 645 - 655
  • [43] Specification Inference Using Context-Free Language Reachability
    Bastani, Osbert
    Anand, Saswat
    Aiken, Alex
    [J]. ACM SIGPLAN NOTICES, 2015, 50 (01) : 553 - 566
  • [44] Context-Free Language Reachability via Skewed Tabulation
    Lei, Yuxiang
    Bossut, Camille
    Sui, Yulei
    Zhang, Qirun
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2024, 8 (PLDI):
  • [45] Graph object oriented model and query language: A semi-structured approach
    Choudhury, S
    Chaki, N
    Bhattacharya, S
    [J]. INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, PROCEEDINGS, 2001, : 685 - 689
  • [46] Semi-Structured Distributional Regression
    Ruegamer, David
    Kolb, Chris
    Klein, Nadja
    [J]. AMERICAN STATISTICIAN, 2024, 78 (01): : 88 - 99
  • [47] NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
    Holmes, Connor
    Zhang, Minjia
    He, Yuxiong
    Wu, Bo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [48] FreeST: Context-free Session Types in a Functional Language
    Almeida, Bernardo
    Mordido, Andreia
    Vasconcelos, Vasco T.
    [J]. ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2019, (291): : 12 - 23
  • [49] IT IS DECIDABLE WHETHER A REGULAR LANGUAGE IS PURE CONTEXT-FREE
    BUCHER, W
    HAGAUER, J
    [J]. THEORETICAL COMPUTER SCIENCE, 1983, 26 (1-2) : 233 - 241
  • [50] Learning context-free grammars from partially structured examples
    Sakakibara, Y
    Muramatsu, H
    [J]. GRAMMATICAL INFERENCE: ALGORITHMS AND APPLICATIONS, 2000, 1891 : 229 - 240