Counting with the crowd

Adam Marcus, David Karger, Samuel Madden, Robert Miller, Sewoong Oh

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, we address the problem of selectivity estimation in a crowdsourced database. Specifically, we develop several techniques for using workers on a crowdsourcing platform like Amazon's Mechanical Turk to estimate the fraction of items in a dataset (e.g., a collection of photos) that satisfy some property or predicate (e.g., photos of trees). We do this without explicitly iterating through every item in the dataset. This is important in crowdsourced query optimization to support predicate ordering and in query evaluation, when performing a GROUP BY operation with a COUNT or AVG aggregate. We compare sampling item labels, a traditional approach, to showing workers a collection of items and asking them to estimate how many satisfy some predicate. Additionally, we develop techniques to eliminate spammers and colluding attackers trying to skew selectivity estimates when using this count estimation approach. We find that for images, counting can be much more effective than sampled labeling, reducing the amount of work necessary to arrive at an estimate that is within 1% of the true fraction by up to an order of magnitude, with lower worker latency. We also find that sampled labeling outperforms count estimation on a text processing task, presumably because people are better at quickly processing large batches of images than they are at reading strings of text. Our spammer detection technique, which is applicable to both the label- and count-based approaches, can improve accuracy by up to two orders of magnitude.

Original languageEnglish (US)
Pages (from-to)109-120
Number of pages12
JournalProceedings of the VLDB Endowment
Volume6
Issue number2
DOIs
StatePublished - Dec 2012
Externally publishedYes
Event39th International Conference on Very Large Data Bases, VLDB 2012 - Trento, Italy
Duration: Aug 26 2013Aug 30 2013

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Counting with the crowd'. Together they form a unique fingerprint.

Cite this