Building evaluation datasets for information retrieval is a time-consuming and exhausting activity. To evaluate research over novel corpora, researchers are increasingly turning to crowdsourcing to efficiently distribute the evaluation dataset creation among many workers. However, there has been little investigation into the effect of instrument design on data quality in crowdsourced evaluation datasets. We pursue this question through a case study, music similarity judgments in a music digital library evaluation, where we find that even with trusted graders song pairs are not consistently rated the same. We find that much of this low intra-coder consistency can be attributed to the task design and judge effects, concluding with recommendations for achieving reliable evaluation judgments for music similarity and other normative judgment tasks.