TY - GEN
T1 - AIDB
T2 - 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024
AU - Jin, Tengjun
AU - Mittal, Akash
AU - Mo, Chenghao
AU - Fang, Jiahao
AU - Zhang, Chengsong
AU - Dai, Timothy
AU - Kang, Daniel
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/6/9
Y1 - 2024/6/9
N2 - Analysts and scientists are interested in automatically analyzing the semantic contents of unstructured, non-tabular data (videos, images, text, and audio). These analysts have turned to unstructured data systems leveraging machine learning (ML). The most common method of using ML in analytics systems is to call them as user-defined functions (UDFs). Unfortunately, UDFs can be difficult for query optimizers to reason over. Furthermore, they can be difficult to implement and unintuitive to application users. Instead of specifying ML models via UDFs, we propose specifying mappings between virtual columns in a structured table, where virtual rows are sparsely materialized via ML models. Querying sparsely materialized tables has unique challenges: even the cardinality of tables is unknown ahead of time, rendering a wide range of standard optimization techniques unusable. We propose novel optimizations for accelerating approximate and exact queries over sparsely materialized tables to address these challenges, providing up to 350x cheaper queries. We implement our techniques in AIDB and deploy them in four real-world datasets. Several of these datasets were constructed with collaborators including law professors studying court cases, showing AIDB's wide applicability.
AB - Analysts and scientists are interested in automatically analyzing the semantic contents of unstructured, non-tabular data (videos, images, text, and audio). These analysts have turned to unstructured data systems leveraging machine learning (ML). The most common method of using ML in analytics systems is to call them as user-defined functions (UDFs). Unfortunately, UDFs can be difficult for query optimizers to reason over. Furthermore, they can be difficult to implement and unintuitive to application users. Instead of specifying ML models via UDFs, we propose specifying mappings between virtual columns in a structured table, where virtual rows are sparsely materialized via ML models. Querying sparsely materialized tables has unique challenges: even the cardinality of tables is unknown ahead of time, rendering a wide range of standard optimization techniques unusable. We propose novel optimizations for accelerating approximate and exact queries over sparsely materialized tables to address these challenges, providing up to 350x cheaper queries. We implement our techniques in AIDB and deploy them in four real-world datasets. Several of these datasets were constructed with collaborators including law professors studying court cases, showing AIDB's wide applicability.
UR - http://www.scopus.com/inward/record.url?scp=85196634912&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196634912&partnerID=8YFLogxK
U2 - 10.1145/3650203.3663329
DO - 10.1145/3650203.3663329
M3 - Conference contribution
AN - SCOPUS:85196634912
T3 - Proceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference
SP - 23
EP - 28
BT - Proceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference
PB - Association for Computing Machinery
Y2 - 9 June 2024
ER -