Lightweight structured text processing

被引:0
|
作者
Miller, RC [1 ]
Myers, BA [1 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Text is a popular storage and distribution format for information, partly due to generic text-processing tools like Unix grep and sort. Unfortunately, existing generic tools make assumptions about text format (e.g., each line is a record) that limit their applicability. Custom-built tools are one alternative, but they require substantial time investment and programming expertise. We describe a new approach, lightweight structured teat processing, which overcomes these difficulties by enabling users to define text structure interactively and manipulate the structure with generic tools. Our prototype system, LAPIS, is a web browser that can highlight, filter, and sort text regions described by the user. LAPIS has several advantages over other systems: (1) the ability to define custom structure with a simple, intuitive pattern language; (2) interactive specification, showing pattern matches in context and letting users choose the most convenient combination of manual selection and pattern matching; and (3) external parsers for standard text formats. The pattern language iri LAPIS, text constraints, describes text structure in high-level terms, with region relationships like before, after, in, and contains. We describe an implementation of text constraints using a novel, compact representation of region sets as collections of rectangles, or region intervals. We also illustrate some examples of applying LAPIS to web pages, text files, and source code.
引用
收藏
页码:131 / 144
页数:14
相关论文
共 50 条
  • [1] Representation of structured data of the text genre as a technique for automatic text processing
    Fonseca, Claudia Aparecida
    Carvalho Guelpeli, Marcus Vinicius
    de Souza Netto, Rafael Santiago
    [J]. TEXTO LIVRE-LINGUAGEM E TECNOLOGIA, 2022, 15
  • [2] STRUCTURED INFORMATION MANAGEMENT USING NEW TECHNIQUES FOR PROCESSING TEXT
    GIBB, F
    SMART, G
    [J]. ONLINE REVIEW, 1990, 14 (03): : 159 - 171
  • [3] Pattern based approaches to pre-processing structured text: A newsfeed example
    Bogg, P
    [J]. COMPUTATIONAL SCIENCE - ICCS 2003, PT IV, PROCEEDINGS, 2003, 2660 : 859 - 867
  • [4] STRUCTURED TEXT FORMATTING
    NOOT, H
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 1983, 13 (01): : 79 - 94
  • [5] RECALLING STRUCTURED TEXT
    HARTLEY, J
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 1992, 27 (3-4) : 107 - 108
  • [6] Dynamic Lightweight Text Compression
    Brisaboa, Nieves
    Farina, Antonio
    Navarro, Gonzalo
    Parama, Jose
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 28 (03)
  • [7] FULL TEXT RETRIEVAL FROM STRUCTURED TEXT
    GOLDSTEIN, CM
    [J]. BULLETIN OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1989, 15 (06): : 11 - 11
  • [8] Lightweight natural language text compression
    Brisaboa, Nieves R.
    Farina, Antonio
    Navarro, Gonzalo
    Parama, Jose R.
    [J]. INFORMATION RETRIEVAL, 2007, 10 (01): : 1 - 33
  • [9] Reasoning with inconsistency in structured text
    Hunter, A
    [J]. KNOWLEDGE ENGINEERING REVIEW, 2000, 15 (04): : 317 - 337
  • [10] Text Processing
    Couto, Francisco M.
    [J]. DATA AND TEXT PROCESSING FOR HEALTH AND LIFE SCIENCES, 2019, 1137 : 45 - 60