Automatic extraction of useful facet hierarchies from text databases

被引:41
|
作者
Dakka, Wisam [1 ]
Ipeirotis, Panagiotis G. [2 ]
机构
[1] Columbia Univ, Dept Comp Sci, 1214 Amsterdam Ave, New York, NY 10027 USA
[2] NYU, Dept Informat Operat & Management Sci, New York, NY 10012 USA
关键词
D O I
10.1109/ICDE.2008.4497455
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. Thus far, the identification of the facets was either a manual procedure, or relied on apriori knowledge of the facets that can potentially appear in the underlying collection. In this paper, we present an unsupervised technique for automatic extraction of facets useful for browsing text databases. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful facet terms. For this, we first identify important phrases in each document. Then, we expand each phrase with "context" phrases using external resources, such as WordNet and Wikipedia, causing facet terms to appear in the expanded database. Finally, we compare the term distributions in the original database and the expanded database to identify the terms that can be used to construct browsing facets. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster.
引用
收藏
页码:466 / +
页数:2
相关论文
共 50 条
  • [21] Automatic extraction of acronym-meaning pairs from MEDLINE databases
    Pustejovsky, J
    Castaño, J
    Cochran, B
    Kotecki, M
    Morrell, M
    MEDINFO 2001: PROCEEDINGS OF THE 10TH WORLD CONGRESS ON MEDICAL INFORMATICS, PTS 1 AND 2, 2001, 84 : 371 - 375
  • [22] A text mining approach on automatic generation of web directories and hierarchies
    Yang, HC
    Lee, CH
    EXPERT SYSTEMS WITH APPLICATIONS, 2004, 27 (04) : 645 - 663
  • [23] A text mining approach on automatic generation of web directories and hierarchies
    Yang, HC
    Lee, CH
    IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2003, : 625 - 628
  • [24] Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus
    Khairova, Nina
    Petrasova, Svitlana
    Lewoniewski, Wlodzimierz
    Mamyrbayev, Orken
    Mukhsina, Kuralai
    PROCEEDINGS OF THE 2018 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2018, : 485 - 488
  • [25] Automatic problem extraction and analysis from unstructured text in IT tickets
    Agarwal, S.
    Aggarwal, V.
    Akula, A. R.
    Dasgupta, G. B.
    Sridhara, G.
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2017, 61 (01) : 41 - 52
  • [26] Automatic Definition Extraction and Crosswords Generation From News Text
    Esteche, Jennifer
    Romero, Rornina
    Chiruzzo, Luis
    Rosa, Aiala
    PROCEEDINGS OF THE 2016 XLII LATIN AMERICAN COMPUTING CONFERENCE (CLEI), 2016,
  • [27] An Approach of Automatic Extraction of Domain Keywords from the Kazakh Text
    Alimzhanov, Yermek
    Mansurova, Madina
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2016, PT II, 2016, 9876 : 555 - 562
  • [28] Automatic extraction of persistent topics from social text streams
    Yongwook Shin
    Chuhyeop Ryo
    Jonghun Park
    World Wide Web, 2014, 17 : 1395 - 1420
  • [29] Automatic Open Domain Information Extraction from Indonesian Text
    Gultom, Yohanes
    Wibowo, Wahyu Catur
    2017 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS 2017), 2017, : 23 - 30
  • [30] Automatic extraction of persistent topics from social text streams
    Shin, Yongwook
    Ryo, Chuhyeop
    Park, Jonghun
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2014, 17 (06): : 1395 - 1420