TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data

Daniel Kang, John Guibas, Peter D. Bailis, Tatsunori Hashimoto, Matei Zaharia

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10x less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24x.

Original languageEnglish (US)
Title of host publicationSIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1934-1947
Number of pages14
ISBN (Electronic)9781450392495
DOIs
StatePublished - Jun 10 2022
Externally publishedYes
Event2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 - Virtual, Online, United States
Duration: Jun 12 2022Jun 17 2022

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Country/TerritoryUnited States
CityVirtual, Online
Period6/12/226/17/22

Keywords

  • index
  • proxy-based algorithms
  • query processing with proxies
  • semantic index

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data'. Together they form a unique fingerprint.

Cite this