How significant is statistically significant? The case of audio music similarity and retrieval

Julián Urbano, Brian McFee, J Stephen Downie, Markus Schedl

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The principal goal of the annual Music Information Retrieval Evaluation eXchange (MIREX) experiments is to determine which systems perform well and which systems perform poorly on a range of MIR tasks. However, there has been no systematic analysis regarding how well these evaluation results translate into real-world user satisfaction. For most researchers, reaching statistical significance in the evaluation results is usually the most important goal, but in this paper we show that indicators of statistical significance (i.e., small p-value) are eventually of secondary importance. Researchers who want to predict the real-world implications of formal evaluations should properly report upon practical significance (i.e., large effect-size). Using data from the 18 systems submitted to the MIREX 2011 Audio Music Similarity and Retrieval task, we ran an experiment with 100 real-world users that allows us to explicitly map system performance onto user satisfaction. Based upon 2, 200 judgments, the results show that absolute system performance needs to be quite large for users to be satisfied, and differences between systems have to be very large for users to actually prefer the supposedly better system. The results also suggest a practical upper bound of 80% on user satisfaction with the current definition of the task. Reflecting upon these findings, we make some recommendations for future evaluation experiments and the reporting and interpretation of results in peer-reviewing.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012
Pages181-186
Number of pages6
StatePublished - Dec 1 2012
Event13th International Society for Music Information Retrieval Conference, ISMIR 2012 - Porto, Portugal
Duration: Oct 8 2012Oct 12 2012

Publication series

NameProceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012

Other

Other13th International Society for Music Information Retrieval Conference, ISMIR 2012
CountryPortugal
CityPorto
Period10/8/1210/12/12

ASJC Scopus subject areas

  • Music
  • Information Systems

Fingerprint Dive into the research topics of 'How significant is statistically significant? The case of audio music similarity and retrieval'. Together they form a unique fingerprint.

  • Cite this

    Urbano, J., McFee, B., Downie, J. S., & Schedl, M. (2012). How significant is statistically significant? The case of audio music similarity and retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012 (pp. 181-186). (Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012).