Unsupervised Keyphrase Extraction for Web Pages

被引:3
|
作者
Haarman, Tim [1 ,2 ]
Zijlema, Bastiaan [2 ]
Wiering, Marco [1 ]
机构
[1] Univ Groningen, Bernoulli Inst, Dept Artificial Intelligence, POB 407, NL-9700 AK Groningen, Netherlands
[2] Dataprovidercom, Helperpk 292, NL-9723 ZA Groningen, Netherlands
关键词
unsupervised keyphrase extraction; sequence embeddings; web pages; WebEmbedRank;
D O I
10.3390/mti3030058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Turkish Keyphrase Extraction from Web Pages with BERT
    Ayan, Emre Tolga
    Arslan, Rabia
    Zengin, Muhammed Said
    Duru, Haci Ali
    Salman, Sedat
    Bardak, Batuhan
    [J]. 29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [2] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2008, 5075 : 490 - +
  • [3] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, 5075 : 490 - 495
  • [4] TripleRank: An unsupervised keyphrase extraction algorithm
    Li, Tuohang
    Hu, Liang
    Li, Hongtu
    Sun, Chengyu
    Li, Shuai
    Chi, Ling
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 219 (219)
  • [5] Unsupervised keyphrase extraction for search ontologies
    Gulla, Jon Atle
    Borch, Hans Olaf
    Ingvaldsen, Jon Espen
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2006, 3999 : 25 - 36
  • [6] Combination of Unsupervised Keyphrase Extraction Algorithms
    Zhu, Zede
    Li, Miao
    Chen, Lei
    Yang, Zhenxin
    Chen, Sheng
    [J]. 2013 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2013), 2013, : 33 - 36
  • [7] PromptRank: Unsupervised Keyphrase Extraction Using Prompt
    Kong, Aobo
    Zhao, Shiwan
    Chen, Hao
    Li, Qicheng
    Qin, Yong
    Sun, Ruiqi
    Bai, Xiaoyan
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 9788 - 9801
  • [8] NamedKeys: Unsupervised Keyphrase Extraction for Biomedical Documents
    Gero, Zelalem
    Ho, Joyce C.
    [J]. ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 328 - 337
  • [9] How Preprocessing Affects Unsupervised Keyphrase Extraction
    Wang, Rui
    Liu, Wei
    McDonald, Chris
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 163 - 176
  • [10] A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction
    Florescu, Corina
    Caragea, Cornelia
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2017, 2017, 10193 : 477 - 483