An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

被引:0
|
作者
Xu, Lei [1 ]
机构
[1] Hubei Univ, Fac Phys & Elect Sci, Wuhan, Peoples R China
关键词
information extraction; !text type='HTML']HTML[!/text; regular expressions;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a method to analyze characters and structures of web pages via regular expressions. From encoding to HMTL elements, characters in Web pages are counted one by one. The effectiveness of this tool is proven in experiments with more than one hundred real-world web pages. All work can be ready for massive web information extraction.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Structural Analysis and Regular Expressions based Noise Elimination from Web Pages for Web Content Mining
    Dutta, Amit
    Paria, Sudipta
    Golui, Tanmoy
    Kole, Dipak K.
    [J]. 2014 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2014, : 1445 - 1451
  • [2] SPECIAL CHARACTERS AND REGULAR EXPRESSIONS
    HUGHES, P
    [J]. MICROCOMPUTING, 1984, 8 (10): : 36 - 40
  • [3] Regular Expressions on the Web
    Hodovan, Renata
    Herczeg, Zoltan
    Kiss, Akos
    [J]. 12TH IEEE INTERNATIONAL SYMPOSIUM ON WEB SYSTEMS EVOLUTION (WSE 2010), 2010, : 29 - 32
  • [4] Querying Web pages with lattice expressions
    Hsu, PY
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1999, E82D (01) : 156 - 164
  • [5] Intelligent Crawler for Web Forums based on Improved Regular Expressions
    Pavkovic, Milos
    Protic, Jelica
    [J]. 2013 21ST TELECOMMUNICATIONS FORUM (TELFOR), 2013, : 817 - 820
  • [6] STRUCTURES, SENSIBILITIES AND EXPRESSIONS OF AGGRESSIVITY CHARACTERS
    NGUYENVA.J
    [J]. ANNALES DE L AMELIORATION DES PLANTES, 1969, 19 (04): : 391 - &
  • [7] Discovery of semantic relationships among Web pages based on Web topic structures
    Matsukura, T
    Kondo, H
    Hirata, Y
    Tanaka, K
    [J]. SEMANTIC ISSUES IN E-COMMERCE SYSTEMS, 2003, 111 : 171 - 185
  • [8] Regular Expressions for Web Advertising Detection Based on an Automatic Sliding Algorithm
    [J]. Riaño, D. (donovan20@comunidad.unam.mx); Piñon, R. (rodrigo_pinon@comunidad.unam.mx); Molero-Castillo, G. (gmoleroca@fi-b.unam.mx); Bárcenas, E. (ebarcenas@unam.mx); Velázquez-Mena, A. (mena@fi-b.unam.mx), 1600, Pleiades journals (46):
  • [9] Regular Expressions for Web Advertising Detection Based on an Automatic Sliding Algorithm
    D. Riaño
    R. Piñon
    G. Molero-Castillo
    E. Bárcenas
    A. Velázquez-Mena
    [J]. Programming and Computer Software, 2020, 46 : 652 - 660
  • [10] Regular Expressions for Web Advertising Detection Based on an Automatic Sliding Algorithm
    Riano, D.
    Pinon, R.
    Molero-Castillo, G.
    Barcenas, E.
    Velazquez-Mena, A.
    [J]. PROGRAMMING AND COMPUTER SOFTWARE, 2020, 46 (08) : 652 - 660