Information Retrieval from Unstructured Web Text Document Based on Automatic Learning of the Threshold

被引:0
|
作者
Fkih, Fethi [1 ]
Omri, Mohamed Nazih [1 ]
机构
[1] Univ Monastir, Fac Sci Monastir, MARS Res Unit, Monastir, Tunisia
关键词
Binary Classification; Collocation Retrieval; Performance Evaluation; Precision; Recall; Receiver Operating Characteristic (ROC) Curves; Statistical Threshold; Youden Index;
D O I
10.4018/ijirr.2012100102
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of information is widely used in various applications such as Information Retrieval, document indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for the automatic retrieval of collocations from textual documents. These techniques use statistical measures based on a joint frequency calculation to quantify the connection strength between the tokens of a candidate collocation. The discrimination between relevant and irrelevant collocations is performed using a priori fixed threshold. Generally, the discrimination threshold estimation is performed manually by a domain expert. This supervised estimation is considered as an additional cost which reduces system performance. In this paper, the authors propose a new technique for the threshold automatic learning to retrieve information from web text document. This technique is mainly based on the usual performance evaluation measures (such as ROC and Precision-Recall curves). The results show the ability to automatically estimate a statistical threshold independently of the treated corpus.
引用
收藏
页码:12 / 30
页数:19
相关论文
共 50 条
  • [41] A Semantic and Feature Aggregated Information Retrieval Technique for Efficient Geospatial Text Document Retrieval
    Uma, R.
    Muneeswaran, K.
    [J]. JOURNAL OF MULTIPLE-VALUED LOGIC AND SOFT COMPUTING, 2017, 28 (06) : 547 - 569
  • [42] Text Retrieval Based on Syntactic Information
    Yongwei Z.
    Ting L.
    Chang L.
    Bingxin W.
    Jingsong Y.
    [J]. Data Analysis and Knowledge Discovery, 2022, 6 (11) : 25 - 37
  • [43] AIWIRT: An adaptive and intelligent Web information retrieval tool for web-based learning
    Li, LZ
    Liu, YH
    Zhang, X
    Zhang, W
    [J]. APPLICATIONS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN EDUCATION AND TRAINING, 2004, : 278 - 283
  • [44] Voice-based Information Retrieval - how far are we from the text-based information retrieval ?
    Lee, Lin-shan
    Pan, Yi-cheng
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 26 - 43
  • [45] Ontology-based Unstructured Information Organization and Retrieval
    Zhang, Peiyun
    Xie, Rongjian
    [J]. 2009 WRI WORLD CONGRESS ON SOFTWARE ENGINEERING, VOL 1, PROCEEDINGS, 2009, : 408 - +
  • [46] Image-based document vectors for text retrieval
    Yu, ZH
    Tan, CL
    [J]. 15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 4, PROCEEDINGS: APPLICATIONS, ROBOTICS SYSTEMS AND ARCHITECTURES, 2000, : 393 - 396
  • [47] Towards automatic multilevel indexing for Thai text information retrieval
    Kawtrakul, A
    Thumkanon, C
    McFetridge, P
    [J]. APCCAS '98 - IEEE ASIA-PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS: MICROELECTRONICS AND INTEGRATING SYSTEMS, 1998, : 551 - 554
  • [48] The problem of automatic understanding of full text documents in information retrieval
    Zabezhailo, MI
    [J]. JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 1998, 37 (05) : 822 - 830
  • [49] Towards automatic multilevel indexing for Thai text information retrieval
    Kasetsart Univ, Bangkok, Thailand
    [J]. IEEE Asia Pac Conf Circuits Syst Proc, (551-554):
  • [50] Event Information Retrieval from Text
    Sankepally, Rashmi
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1447 - 1447