An Annotated Corpus and Method for Analysis of Ad-Hoc Structures Embedded in Text

被引:0
|
作者
Yeh, Eric [1 ]
Niekrasz, John [1 ]
Freitag, Dayne [1 ]
Rohwer, Richard [1 ]
机构
[1] SRI Int, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA
关键词
table recognition; semistructured data; information extraction; INFORMATION;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
We describe a method for identifying and performing functional analysis of structured regions that are embedded in natural language documents, such as tables or key-value lists. Such regions often encode information according to ad hoc schemas and avail themselves of visual cues in place of natural language grammar, presenting problems for standard information extraction algorithms. Unlike previous work in table extraction, which assumes a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of naturally occurring structure types. Our approach has three main parts. First, we collect and annotate a a diverse sample of "naturally" occurring structures from several sources. Second, we use probabilistic text segmentation techniques, featurized by skip bigrams over spatial and token category cues, to automatically identify contiguous regions of structured text that share a common schema. Finally, we identify the records and fields within each structured region using a combination of distributional similarity and sequence alignment methods, guided by minimal supervision in the form of a single annotated record. We evaluate the last two components individually, and conclude with a discussion of further work.
引用
收藏
页码:2063 / 2070
页数:8
相关论文
共 50 条
  • [1] Embedded pseudo SIP server for ad-hoc VoIP
    Chang, Lin-Huang
    Liaw, Jiun-Jian
    Chuang, Ping-Da
    Chen, Yu-Jen
    WSEAS Transactions on Communications, 2006, 5 (10): : 1922 - 1929
  • [2] Wireless optical ad-hoc networks for embedded systems
    Hui, JY
    CONFERENCE PROCEEDINGS OF THE 2001 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, 2001, : 140 - 144
  • [3] Corpus Pragmatics and Multimodality: Compiling an ad-hoc Multimodal Corpus for EFL Pragmatics Teaching
    Rodriguez Penarroja, Manuel
    INTERNATIONAL JOURNAL OF INSTRUCTION, 2021, 14 (01) : 927 - 946
  • [4] Demonstrating ASET: Ad-hoc Structured Exploration of Text Collections
    Haettasch, Benjamin
    Bodensohn, Jan-Micha
    Binnig, Carsten
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 2393 - 2396
  • [5] Scalable Ad-hoc Entity Extraction from Text Collections
    Agrawal, Sanjay
    Chakrabarti, Kaushik
    Chaudhuri, Surajit
    Ganti, Venkatesh
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 945 - 957
  • [6] Interesting-Phrase Mining for Ad-Hoc Text Analytics
    Bedathur, Srikanta
    Berberich, Klaus
    Dittrich, Jens
    Mamoulis, Nikos
    Weikum, Gerhard
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1348 - 1357
  • [7] Performance Analysis of Cluster Formation Method in Vehicular Ad-hoc Networks
    Fauziyyah, Anni Karimatul
    Sulistyo, Selo
    Mustika, I. Wayan
    2017 7TH INTERNATIONAL ANNUAL ENGINEERING SEMINAR (INAES), 2017, : 67 - 72
  • [8] Formal Security Analysis for Ad-Hoc Networks
    Nanz, Sebastian
    Hankin, Chris
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2006, 142 : 195 - 213
  • [9] Analysis of Single Hop Ad-hoc Networks
    Skianis, Charalabos
    GLOBECOM 2006 - 2006 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, 2006,
  • [10] Ad-Hoc On Demand Distance Vector Routing Algorithm Using Neighbor Matrix Method in Static Ad-Hoc Networks
    Nagaraju, Aitha
    Kumar, G. Charan
    Ramachandram, S.
    ADVANCES IN NETWORKS AND COMMUNICATIONS, PT II, 2011, 132 : 44 - 54