TY - GEN
T1 - Assessor differences and user preferences in tweet timeline generation
AU - Wang, Yulu
AU - Sherman, Garrick
AU - Lin, Jimmy
AU - Efron, Miles
PY - 2015/8/9
Y1 - 2015/8/9
N2 - In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is humangenerated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.
AB - In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is humangenerated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.
KW - Microblog search
KW - TREC evaluation
KW - User study
UR - http://www.scopus.com/inward/record.url?scp=84953729491&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84953729491&partnerID=8YFLogxK
U2 - 10.1145/2766462.2767699
DO - 10.1145/2766462.2767699
M3 - Conference contribution
AN - SCOPUS:84953729491
T3 - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 615
EP - 624
BT - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015
Y2 - 9 August 2015 through 13 August 2015
ER -