Automatic segmentation of text into structured records

被引:0
|
作者
Borkar, V [1 ]
Deshmukh, K [1 ]
Sarawagi, S [1 ]
机构
[1] Indian Inst Technol, Bombay 400076, Maharashtra, India
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary.; Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.
引用
收藏
页码:175 / 186
页数:12
相关论文
共 50 条
  • [1] Automatic text segmentation and text recognition for video indexing
    Lienhart, R
    Effelsberg, W
    [J]. MULTIMEDIA SYSTEMS, 2000, 8 (01) : 69 - 81
  • [2] Automatic text segmentation and text recognition for video indexing
    Rainer Lienhart
    Wolfgang Effelsberg
    [J]. Multimedia Systems, 2000, 8 : 69 - 81
  • [3] An automatic approach for efficient text segmentation
    Cai, Keke
    Bu, Jiajun
    Chen, Chun
    Huang, Peng
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2006, 4251 : 417 - 424
  • [4] Text segmentation for automatic document processing
    Mital, DP
    Leng, GW
    [J]. ETFA '96 - 1996 IEEE CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION, PROCEEDINGS, VOLS 1 AND 2, 1996, : 642 - 648
  • [5] Automatic Text Segmentation for Movie Subtitles
    Scaiano, Martin
    Inkpen, Diana
    Laganiere, Robert
    Reinhartz, Adele
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2010, 6085 : 295 - 298
  • [6] Text segmentation for automatic document processing
    Mital, DP
    Leng, GW
    [J]. JOURNAL OF MICROCOMPUTER APPLICATIONS, 1995, 18 (04): : 375 - 392
  • [7] AUTOMATIC TEXT AREA SEGMENTATION IN NATURAL IMAGES
    Jafri, Syed Ali Raza
    Boutin, Mireille
    Delp, Edward J.
    [J]. 2008 15TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-5, 2008, : 3196 - 3199
  • [8] An Evolutionary Approach to Automatic Chinese Text Segmentation
    Zhang, Dong
    [J]. 2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 771 - 776
  • [9] Automatic text segmentation from complex background
    Ye, QX
    Gao, W
    Huang, QM
    [J]. ICIP: 2004 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1- 5, 2004, : 2905 - 2908
  • [10] Automatic segmentation of printed Persian (Farsi) text
    Yektaie, MH
    Zahzah, EH
    Menard, M
    [J]. SCIA '97 - PROCEEDINGS OF THE 10TH SCANDINAVIAN CONFERENCE ON IMAGE ANALYSIS, VOLS 1 AND 2, 1997, : 767 - 772