A STRUCTURAL APPROACH TO EXTRACTING CHINESE POSITION RELATIONS FROM WEB PAGES

被引:0
|
作者
Jin, Peiquan [1 ]
Yang, Jia [1 ]
Zhao, Jie [2 ]
Liu, Yanhong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Anhui Univ, Hefei, Peoples R China
来源
JOURNAL OF WEB ENGINEERING | 2013年 / 12卷 / 05期
基金
美国国家科学基金会;
关键词
Position Relation; Relation Extraction; Structural File Segment;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of position relations, which refer to the position of people in an organization, can serve for enterprises as a significant competitive intelligence method. The rapid growth of the data volume in the Web brings new opportunities for us to extract position relations of interest from the Web. In this paper, we propose a new algorithm to extract position relations from the Web. Our algorithm is based on the structural feature of position relations in the Web, i.e., a position relation is usually presented in Web pages as a table or a list. In order to define the structural feature of Web content, we first introduce a structural coefficient for each Web page, which is then used to generate structural file segments for Web pages. A structural file segment consists of all candidates of position relations having a similar structure. After that, we employ a pattern-matching method to extract position relations from the structural file segments. Finally, we conduct experiments on a real data set containing 6028 Chinese Web pages gathered by the Baidu search engine, and evaluate precision and recall of our approach. The experimental results confirm that our algorithm has a precision over 96% and a recall over 87%.
引用
收藏
页码:363 / 382
页数:20
相关论文
共 50 条
  • [1] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [2] Extracting Templates from Web pages
    Manjula, R.
    Chilambuchelvan, A.
    [J]. 2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 788 - 791
  • [3] Extracting Topic Maps from Web Pages
    Mase, Motohiro
    Yamada, Seiji
    Nitta, Katsumi
    [J]. NEW FRONTIERS IN APPLIED DATA MINING, 2009, 5433 : 169 - +
  • [4] A Distant Learning Approach for Extracting Hypernym Relations from Wikipedia Disambiguation Pages
    Kamel, Mouna
    Trojahn, Cassia
    Ghamnia, Adel
    Aussenac-Gilles, Nathalie
    Fabre, Cecile
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS, 2017, 112 : 1764 - 1773
  • [5] Extracting Relations from Chinese Web Documents Using Kernel Methods
    Qiu, Jing
    Liao, Lejian
    [J]. PROCEEDINGS OF THE 8TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE, 2009, : 352 - 356
  • [6] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2008, 5075 : 490 - +
  • [7] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, 5075 : 490 - 495
  • [8] Adaptively extracting structured data from Web pages
    Guo, Yingnan
    Zhang, Jiajun
    Chen, Xing
    [J]. 2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1524 - 1525
  • [9] Finding and Extracting Data Records from Web Pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 59 (01): : 123 - 137
  • [10] Finding and Extracting Data Records from Web Pages
    Manuel Álvarez
    Alberto Pan
    Juan Raposo
    Fernando Bellas
    Fidel Cacheda
    [J]. Journal of Signal Processing Systems, 2010, 59 : 123 - 137