A STRUCTURAL APPROACH TO EXTRACTING CHINESE POSITION RELATIONS FROM WEB PAGES

被引:0
|
作者
Jin, Peiquan [1 ]
Yang, Jia [1 ]
Zhao, Jie [2 ]
Liu, Yanhong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Anhui Univ, Hefei, Peoples R China
来源
JOURNAL OF WEB ENGINEERING | 2013年 / 12卷 / 05期
基金
美国国家科学基金会;
关键词
Position Relation; Relation Extraction; Structural File Segment;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of position relations, which refer to the position of people in an organization, can serve for enterprises as a significant competitive intelligence method. The rapid growth of the data volume in the Web brings new opportunities for us to extract position relations of interest from the Web. In this paper, we propose a new algorithm to extract position relations from the Web. Our algorithm is based on the structural feature of position relations in the Web, i.e., a position relation is usually presented in Web pages as a table or a list. In order to define the structural feature of Web content, we first introduce a structural coefficient for each Web page, which is then used to generate structural file segments for Web pages. A structural file segment consists of all candidates of position relations having a similar structure. After that, we employ a pattern-matching method to extract position relations from the structural file segments. Finally, we conduct experiments on a real data set containing 6028 Chinese Web pages gathered by the Baidu search engine, and evaluate precision and recall of our approach. The experimental results confirm that our algorithm has a precision over 96% and a recall over 87%.
引用
收藏
页码:363 / 382
页数:20
相关论文
共 50 条
  • [41] A Rule Based DFA Driven Information Extractor for Content Extracting from Web Pages
    Liu, Jin
    Chu, Danliang
    Song, Junjie
    Zhong, Bei
    Cai, Biqi
    [J]. INTELLIGENT SYSTEMS AND APPLICATIONS (ICS 2014), 2015, 274 : 482 - 488
  • [42] Extracting Topics Information from Conference Web Pages using Page Segmentation and SVM
    Chen, Yaw-Huei
    Li, Sin-Sian
    Chen, Yu-Ta
    [J]. INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2010), 2010, : 270 - 277
  • [43] Software agents for extracting, aggregating and updating data from web pages of genomic databanks
    Stella, A
    Masseroli, M
    Alcalay, M
    Pinciroli, F
    [J]. AMIA 2002 SYMPOSIUM, PROCEEDINGS: BIOMEDICAL INFORMATICS: ONE DISCIPLINE, 2002, : 1171 - 1171
  • [44] An open platform for collecting domain specific web pages and extracting information from them
    Karkaletsis, V
    Spyropoulos, CD
    [J]. Knowledge Mining, 2005, 185 : 147 - 157
  • [45] Creative activity support by extracting notable keywords from Web pages with modified dates
    Sunayama, A
    Yachida, M
    [J]. PROCEEDINGS OF THE 2002 IEEE INTERNATIONAL SYMPOSIUM ON INTELLIGENT CONTROL, 2002, : 838 - 843
  • [46] Extracting content structure for web pages based on visual representation
    Cai, D
    Yu, SP
    Wen, JR
    Ma, WY
    [J]. WEB TECHNOLOGIES AND APPLICATIONS, 2003, 2642 : 406 - 417
  • [47] Extracting term collocations for directing users to informative web pages
    Yamamoto, Eiko
    Isahara, Hitoshi
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4139 : 310 - 321
  • [48] EXTRACTING THE SEMANTIC CONTENT OF WEB PAGES VIA REPEATED STRUCTURES
    He, Zheng
    Luo, Hangzai
    Fan, Jianping
    Liu, Xiao
    [J]. ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,
  • [49] Structural analysis and grouping of Web pages
    Kojima, Shuichi
    Takasu, Atsuhiro
    Adachi, Jun
    [J]. NII Journal, 2002, (04): : 23 - 35
  • [50] Extracting Relations from Web Tables by Leveraging Table Entity Behaviours
    de Alwis, Lahiru
    Dissanayake, Achala
    Pallewatte, Manujith
    Silva, Kalana
    Thayasivam, Uthayasanker
    [J]. 2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 1 - 6