Information Retrieval from Unstructured Web Text Document Based on Automatic Learning of the Threshold

被引:0
|
作者
Fkih, Fethi [1 ]
Omri, Mohamed Nazih [1 ]
机构
[1] Univ Monastir, Fac Sci Monastir, MARS Res Unit, Monastir, Tunisia
关键词
Binary Classification; Collocation Retrieval; Performance Evaluation; Precision; Recall; Receiver Operating Characteristic (ROC) Curves; Statistical Threshold; Youden Index;
D O I
10.4018/ijirr.2012100102
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of information is widely used in various applications such as Information Retrieval, document indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for the automatic retrieval of collocations from textual documents. These techniques use statistical measures based on a joint frequency calculation to quantify the connection strength between the tokens of a candidate collocation. The discrimination between relevant and irrelevant collocations is performed using a priori fixed threshold. Generally, the discrimination threshold estimation is performed manually by a domain expert. This supervised estimation is considered as an additional cost which reduces system performance. In this paper, the authors propose a new technique for the threshold automatic learning to retrieve information from web text document. This technique is mainly based on the usual performance evaluation measures (such as ROC and Precision-Recall curves). The results show the ability to automatically estimate a statistical threshold independently of the treated corpus.
引用
收藏
页码:12 / 30
页数:19
相关论文
共 50 条
  • [1] Text segmentation based on document understanding for information retrieval
    Prince, Violaine
    Labadie, Alexandre
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2007, 4592 : 295 - +
  • [2] Evolutionary learning of Web-document structure for information retrieval
    Kim, S
    Zhang, BT
    [J]. PROCEEDINGS OF THE 2001 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1 AND 2, 2001, : 1253 - 1260
  • [3] Information retrieval beyond the text document
    Rui, Y
    Ortega, M
    Huang, TS
    Mehrotra, S
    [J]. LIBRARY TRENDS, 1999, 48 (02) : 455 - 474
  • [4] Mining unstructured web pages to enhance web information retrieval
    Yang, Hsin-Chang
    Lee, Chung-Hong
    [J]. ICICIC 2006: FIRST INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING, INFORMATION AND CONTROL, VOL 2, PROCEEDINGS, 2006, : 429 - +
  • [5] The skills of document use: From text comprehension to web-based learning
    Spyridakis, Jan
    [J]. TECHNICAL COMMUNICATION, 2007, 54 (01) : 103 - 105
  • [6] Automatic In-Text Keyword Tagging based on Information Retrieval
    Kim, Jinsuk
    Jin, Du-Seok
    Kim, KwangYoung
    Choe, Ho-Seop
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2009, 5 (03): : 159 - 166
  • [7] RECENT STUDIES IN AUTOMATIC TEXT ANALYSIS AND DOCUMENT RETRIEVAL
    SALTON, G
    [J]. JOURNAL OF THE ACM, 1973, 20 (02) : 258 - 278
  • [8] Automatic Web Document Restructuring Based on Visual Information Analysis
    Burget, Radek
    [J]. ADVANCES IN INTELLIGENT WEB MASTERING-2, PROCEEDINGS, 2010, 67 : 61 - 70
  • [9] Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web
    Lulu, Leena
    Belkhouche, Boumediene
    Harous, Saad
    [J]. PROCEEDINGS OF THE 2016 12TH INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION TECHNOLOGY (IIT), 2016, : 179 - 184
  • [10] Learning relation axioms from text: An automatic Web-based approach
    Sanchez, David
    Moreno, Antonio
    Del Vasto-Terrientes, Luis
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 5792 - 5805