AIDB: a Sparsely Materialized Database for Queries using Machine Learning

Tengjun Jin, Akash Mittal, Chenghao Mo, Jiahao Fang, Chengsong Zhang, Timothy Dai, Daniel Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Analysts and scientists are interested in automatically analyzing the semantic contents of unstructured, non-tabular data (videos, images, text, and audio). These analysts have turned to unstructured data systems leveraging machine learning (ML). The most common method of using ML in analytics systems is to call them as user-defined functions (UDFs). Unfortunately, UDFs can be difficult for query optimizers to reason over. Furthermore, they can be difficult to implement and unintuitive to application users. Instead of specifying ML models via UDFs, we propose specifying mappings between virtual columns in a structured table, where virtual rows are sparsely materialized via ML models. Querying sparsely materialized tables has unique challenges: even the cardinality of tables is unknown ahead of time, rendering a wide range of standard optimization techniques unusable. We propose novel optimizations for accelerating approximate and exact queries over sparsely materialized tables to address these challenges, providing up to 350x cheaper queries. We implement our techniques in AIDB and deploy them in four real-world datasets. Several of these datasets were constructed with collaborators including law professors studying court cases, showing AIDB's wide applicability.

Original languageEnglish (US)
Title of host publicationProceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference
PublisherAssociation for Computing Machinery
Pages23-28
Number of pages6
ISBN (Electronic)9798400706110
DOIs
StatePublished - Jun 9 2024
Externally publishedYes
Event8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - Santiago, Chile
Duration: Jun 9 2024 → …

Publication series

NameProceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference

Conference

Conference8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024
Country/TerritoryChile
CitySantiago
Period6/9/24 → …

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Hardware and Architecture
  • Sociology and Political Science

Fingerprint

Dive into the research topics of 'AIDB: a Sparsely Materialized Database for Queries using Machine Learning'. Together they form a unique fingerprint.

Cite this