Accelerating Queries over Unstructured Data with ML

Research output: Contribution to conferencePaperpeer-review

Abstract

Queries over unstructured data (e.g., videos and text) are becoming increasingly common due to large data volumes and machine learning (ML). A common method of answering queries is to use a deep neural network (DNN) or human labeler (which we collectively refer to as oracle methods) to extract structured information from this unstructured data. For example, an ecologist may be interested in understanding hummingbird behavior, so extracts all bird positions from a video. Unfortunately, these methods can be costly: labeling 100 days of video via human annotators can cost hundreds of thousands of dollars. Thus, to reduce the cost of executing queries, recent work has proposed using proxy models: cheap approximations to oracle methods. They have primarily been studied in the context of approximating binary predicates, in which the proxy model produces a score between 0 and 1 and records above some ad-hoc score threshold are assumed to satisfy the predicate [1, 3, 6]. However, this prior work on binary predicates leaves major concerns unaddressed: 1) existing query processing algorithms do not provide statistical guarantees on query results and 2) they cannot share work between queries efficiently. To address these issues, we have been developing indexing and query processing algorithms for unstructured data using ML in a system MEME. We describe our recent developments and some applications below.

Original languageEnglish (US)
StatePublished - 2021
Event11th Annual Conference on Innovative Data Systems Research, CIDR 2021 - Virtual, Online
Duration: Jan 11 2021Jan 15 2021

Conference

Conference11th Annual Conference on Innovative Data Systems Research, CIDR 2021
CityVirtual, Online
Period1/11/211/15/21

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Information Systems and Management
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Accelerating Queries over Unstructured Data with ML'. Together they form a unique fingerprint.

Cite this