A Study on Performance Volatility in Information Retrieval

被引：0

作者：

Hosseini, Mehdi ^{[1
]}

机构：

[1] UCL, Dept Comp Sci, London WC1E 6BT, England

来源：

PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2009年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A common practice in comparative evaluation of information retrieval (IR) systems is to create a test collection comprising a set of topics (queries), a document corpus, and relevance judgments, and to monitor the performance of retrieval systems over such a collection. A typical evaluation of a system involves computing a performance metric, e.g. ' Average Precision (AP), for each topic and then using the average performance metric, e.g., Mean Average Precision (MAP) to express the overall system performance. However, averages do not capture all the important aspects of system performance, and, used alone, may not thoroughly express system effectiveness. For example, the average can mask large variations in individual topic effectiveness. The author's hypothesis is that, in addition to average performance, attention needs to be paid to how a system performance varies across topics. We refer to this performance variation as Volatility. The main purpose of the thesis is to introduce the concept of performance volatility and apply it to information retrieval. There are several ways in which the volatility might be defined. One obvious definition is to use the standard deviation (SD) of the AP values from their MAP. An alternative definition might be compute the expected performance using a subset of queries, and then measure the deviation of held-out queries from this prediction. Another definition could be based on "interquartile range" [1]. Our initial investigation has used SD as a measure of volatility. Using SD to measure volatility has the benefit that it is a well-unders tood and well-studied quantity. However our preliminary experiments, which calculated a straightforward SD of per-topic performance scores, highlighted a problem. Typically, scores are bounded between [0,I). As a result, we observed that systems with low MAP exhibited lower volatility. This bias can be eliminated by applying a score standardization [3] or logit transformation to the AP values, in which case, the range of values is now (-infinity,+infinity). One application of volatility is in the evaluation of systems effectiveness. Following standard practices in experiment analysis, it is beneficial to consider both the mean and the volatility of performance (e.g. AP) across topics. Of course ' variance is routinely used within IR to assess the statistical significance of measurements. However, two systems can have statistically equivalent mean values of performance, yet exhibit quite different variances. In such a situation, we may prefer low/high volatile systems. For example, we can consider a minimum level for average performance, say MAP. Consequently, if MAP scores are smaller than such a threshold, we prefer volatile systems and hope to gain a satisfying AP scores at least for a part of topics, and if MAP scores are bigger than the threshold vice versa. Indeed such a strategy is consistent with the TREC robust track [2] where the main goal was to improve the consistency of systems evaluation by considering the impact of good and poorly performing topics equally. Another application of volatility may be in performance prediction. Here, performance volatility is due to several factors. Performance prediction involves a measurement step followed by a prediction step. During the measurement step, we are given a collection, a set of queries, and corresponding results together with relevance judgments. During the prediction step, we can consider three different scenarios. The first scenario predicts system performance on different topic sets (queries) but the same document collection as used during the measurement step. The second scenario predicts performance on a different document collection but the same topic set. The third scenario predicts system perforrnance for both a different topic set and document collection. Volatility may be useful in judging the quality of these predictions.

引用

页码：854 / 854

页数：1

共 50 条

[1] Study on the Performance Measure of Information Retrieval Models
Hua, Jiang
2009 INTERNATIONAL SYMPOSIUM ON INTELLIGENT UBIQUITOUS COMPUTING AND EDUCATION, 2009, : 436 - 439
[2] THE PREDICTIONS OF PERFORMANCE METRICS IN INFORMATION RETRIEVAL: AN EXPERIMENTAL STUDY
Muwanei, Sinyinda
Ravana, Sri Devi
Hoo, Wai Lam
Kunda, Douglas
MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2021, : 35 - 54
[3] A comparative study of performance measures for information retrieval systems
Meng, Xiannong
Third International Conference on Information Technology: New Generations, Proceedings, 2006, : 578 - 579
[4] Web Algorithms for Information Retrieval: A Performance Comparative Study
Frikh, Bouchra
Ouhbi, Brahim
INTERNATIONAL JOURNAL OF MOBILE COMPUTING AND MULTIMEDIA COMMUNICATIONS, 2014, 6 (01) : 1 - 16
[5] Predicting information retrieval performance
Losee R.M.
Synthesis Lectures on Information Concepts, Retrieval, and Services, 2019, 10 (04): : 1 - 79
[6] AN INFORMATION MEASURE OF RETRIEVAL PERFORMANCE
WILBUR, WJ
INFORMATION SYSTEMS, 1992, 17 (04) : 283 - 298
[7] An experimental study on the performance of visual information retrieval similarity models
Eidenberger, H
Breiteneder, C
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 233 - 236
[8] A STUDY OF USER PERFORMANCE AND ATTITUDES WITH INFORMATION-RETRIEVAL INTERFACES
MEADOW, CT
WANG, JB
YUAN, WJ
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1995, 46 (07): : 490 - 505
[9] Thesaurus Performance with Information Retrieval: Schema Matching as A Case Study
Sabbah, Thabit
Selamat, Ali
2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 4494 - 4498
[10] On the Volatility of Commercial Search Engines and its Impact on Information Retrieval Research
Jimmy
Zuccon, Guido
Demartini, Gianluca
ACM/SIGIR PROCEEDINGS 2018, 2018, : 1105 - 1108

← 1 2 3 4 5 →