Automatic wrapper generation for multilingual Web resources

被引:0
|
作者
Yamada, Y [1 ]
Ikeda, D
Hirokawa, S
机构
[1] Kyushu Univ, Grad Sch Informat Sci & Elect Engn, Fukuoka 8128581, Japan
[2] Kyushu Univ, Comp & Commun Ctr, Fukuoka 8128581, Japan
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.
引用
收藏
页码:332 / 339
页数:8
相关论文
共 50 条
  • [1] Automatic wrapper generation for Web search engines
    Chidlovskii, B
    Ragetli, J
    de Rijke, M
    [J]. WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2000, 1846 : 399 - 410
  • [2] Automatic generation of wrapper for data extraction from the Web
    Zhang, SZ
    Lu, ZD
    [J]. WEB ENGINEERING, PROCEEDINGS, 2003, 2722 : 390 - 394
  • [3] Semi-automatic wrapper generation for commercial web sources
    Pan, A
    Raposo, J
    Alvarez, M
    Hidalgo, J
    Viña, A
    [J]. ENGINEERING INFORMATION SYSTEMS IN THE INTERNET CONTEXT, 2002, 103 : 265 - 283
  • [4] Wrapper generation for automatic data extraction from large web sites
    Jindal, N
    [J]. DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3433 : 34 - 53
  • [5] Automatic Generation of Geospatial Metadata for Web Resources
    Florczyk, Aneta J.
    Lopez-Pellicer, Francisco J.
    Nogueras-Iso, Javier
    Zarazaga-Soria, Javier
    [J]. INTERNATIONAL JOURNAL OF SPATIAL DATA INFRASTRUCTURES RESEARCH, 2012, 7 : 151 - 172
  • [6] Multilingual access to web resources: an overview
    Large, A
    Moukdad, H
    [J]. PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2000, 34 (01) : 43 - 58
  • [7] Automatic Detection of Multilingual Dictionaries on the Web
    Grigonyte, Gintare
    Baldwin, Timothy
    [J]. PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2014, : 93 - 98
  • [8] Wrapper generation for Web accessible data sources
    Gruser, JR
    Raschid, L
    Vidal, ME
    Bright, L
    [J]. 3RD IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS - PROCEEDINGS, 1998, : 14 - 23
  • [9] Automatic support for the alignment of multilingual Web sites
    Tonella, Paolo
    Ricca, Filippo
    Pianta, Emanuele
    Girardi, Christian
    [J]. JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2006, 18 (03): : 153 - 179
  • [10] Automatic generation of embedded memory wrapper for multiprocessor SoC
    Gharsalli, F
    Meftali, S
    Rousseau, F
    Jerraya, AA
    [J]. 39TH DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2002, 2002, : 596 - 601