Web information extraction using generalized hidden Markov model

被引:0
|
作者
Zhong, Ping [1 ]
Chen, Jinlin [2 ]
Cook, Terry [1 ]
机构
[1] CUNY, Grad Ctr, Dept Comp Sci, New York, NY 10021 USA
[2] CUNY, Grad Ctr, Queens Coll, Dept Comp Sci, New York, NY 10021 USA
关键词
hidden Markov model; information extraction; layout analysis; web;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hidden Markov Model (HMM) is an important approach for information extraction (IE). When applied to Web IE, several problems exist with HMM based approaches due to the lack of consideration on Web-specific features. In this paper we present a Generalized Hidden Markov Model (GHMM) that extends traditional HMMs by making use of Web-specific information for Web IE. In our approach we use Web content block instead of term as basic extraction unit. Besides, instead of using the traditional sequential state transition order, we detect the state transition order of GHMM based on layout structure of the corresponding web page. Furthermore, we use multiple emission features instead of single emission feature. In this way GHMM can better accommodate Web IE. Experiments show promising results comparing to traditional HMM based Web IE.
引用
收藏
页码:142 / +
页数:2
相关论文
共 50 条
  • [21] Application Study of Hidden Markov Model and Maximum Entropy in Text Information Extraction
    Li, Rong
    Liu, Li-ying
    Fu, He-fang
    Zheng, Jia-heng
    ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, PROCEEDINGS, 2009, 5855 : 399 - +
  • [22] Analytical method of web user behavior using Hidden Markov Model
    Kawazu, Hirotaka
    Toriumi, Fujio
    Takano, Masanori
    Wada, Kazuya
    Eukuda, Ichiro
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2518 - 2524
  • [23] Thai syllable-based information extraction using hidden Markov models
    Narupiyakul, L
    Thomas, C
    Cercone, N
    Sirinaovakul, B
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 537 - 546
  • [24] Digital watermark extraction in wavelet domain using hidden Markov model
    Marzieh Amini
    M. Omair Ahmad
    M. N. S. Swamy
    Multimedia Tools and Applications, 2017, 76 : 3731 - 3749
  • [25] Digital watermark extraction in wavelet domain using hidden Markov model
    Amini, Marzieh
    Ahmad, M. Omair
    Swamy, M. N. S.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (03) : 3731 - 3749
  • [26] A hidden Markov model-based approach for extracting information from web news
    Tso, Brandt
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2007, 3 (1-2) : 104 - 115
  • [27] Markov Financial Model Using Hidden Markov Model
    Luc Tri Tuyen
    INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS & STATISTICS, 2013, 40 (10): : 72 - 83
  • [28] Estimation of solar radiation using a combination of Hidden Markov Model and generalized Fuzzy model
    Bhardwaj, Saurabh
    Sharma, Vikrant
    Srivastava, Smriti
    Sastry, O. S.
    Bandyopadhyay, B.
    Chandel, S. S.
    Gupta, J. R. P.
    SOLAR ENERGY, 2013, 93 : 43 - 54
  • [29] Computing the observed information in the hidden Markov model using the EM algorithm
    Hughes, JP
    STATISTICS & PROBABILITY LETTERS, 1997, 32 (01) : 107 - 114
  • [30] Detecting Fraudulent Financial Information of a Company Using Hidden Markov Model
    Yang, Ruicheng
    Zuo, Ailing
    Shen, Qing
    INTERNATIONAL JOURNAL OF SECURITY AND ITS APPLICATIONS, 2016, 10 (09): : 19 - 28