TY - GEN
T1 - TASTI
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
AU - Kang, Daniel
AU - Guibas, John
AU - Bailis, Peter D.
AU - Hashimoto, Tatsunori
AU - Zaharia, Matei
N1 - Funding Information:
This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook,
Publisher Copyright:
© 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10x less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24x.
AB - Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10x less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24x.
KW - index
KW - proxy-based algorithms
KW - query processing with proxies
KW - semantic index
UR - http://www.scopus.com/inward/record.url?scp=85132787623&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132787623&partnerID=8YFLogxK
U2 - 10.1145/3514221.3517897
DO - 10.1145/3514221.3517897
M3 - Conference contribution
AN - SCOPUS:85132787623
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1934
EP - 1947
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 12 June 2022 through 17 June 2022
ER -