TY - GEN
T1 - S/C
T2 - 39th IEEE International Conference on Data Engineering, ICDE 2023
AU - Li, Zhaoheng
AU - Pi, Xinyu
AU - Park, Yongjoo
N1 - This work is supported in part by Microsoft Azure.
PY - 2023
Y1 - 2023
N2 - With data pipeline tools and the expressiveness of SQL, managing interdependent materialized views (MVs) are becoming increasingly easy. These MVs are updated repeatedly upon new data ingestion (e.g., daily), from which database admins can observe performance metrics (e.g., refresh time of each MV, size on disk) in a consistent way for different types of updates (full vs. incremental) and for different systems (single node, distributed, cloud-hosted). One missed opportunity is that existing data systems treat those MV updates as independent SQL statements without fully exploiting their dependency information and performance metrics. However, if we know that the result of a SQL statement will be consumed immediately after for subsequent operations, those subsequent operations do not have to wait until the early results are fully materialized on storage because the results are already readily available in memory. Of course, this may come at a cost because keeping those results in memory (even temporarily) will reduce the amount of available memory; thus, our decision should be careful.In this paper, we introduce a new system, called S/C, which tackles this problem through efficient creation and update of a set of MVs with acyclic dependencies among them. S/C judiciously uses bounded memory to reduce the end-to-end MV refresh time by short-circuiting expensive reads and writes; S/C's objective function accurately estimates the time savings from keeping intermediate data in memory for particular periods. Our solution jointly optimizes an MV refresh order, what data to keep in memory, and when to release the data from memory. At a high level, S/C still materializes all data exactly as defined in MV definitions; thus, it does not impact any service-level agreements. In our experiments with TPC-DS datasets (up to 1TB), we show that S/C's optimization can speedup end-to-end runtime by 1.04×-5.08× with (only) 1.6GB memory.
AB - With data pipeline tools and the expressiveness of SQL, managing interdependent materialized views (MVs) are becoming increasingly easy. These MVs are updated repeatedly upon new data ingestion (e.g., daily), from which database admins can observe performance metrics (e.g., refresh time of each MV, size on disk) in a consistent way for different types of updates (full vs. incremental) and for different systems (single node, distributed, cloud-hosted). One missed opportunity is that existing data systems treat those MV updates as independent SQL statements without fully exploiting their dependency information and performance metrics. However, if we know that the result of a SQL statement will be consumed immediately after for subsequent operations, those subsequent operations do not have to wait until the early results are fully materialized on storage because the results are already readily available in memory. Of course, this may come at a cost because keeping those results in memory (even temporarily) will reduce the amount of available memory; thus, our decision should be careful.In this paper, we introduce a new system, called S/C, which tackles this problem through efficient creation and update of a set of MVs with acyclic dependencies among them. S/C judiciously uses bounded memory to reduce the end-to-end MV refresh time by short-circuiting expensive reads and writes; S/C's objective function accurately estimates the time savings from keeping intermediate data in memory for particular periods. Our solution jointly optimizes an MV refresh order, what data to keep in memory, and when to release the data from memory. At a high level, S/C still materializes all data exactly as defined in MV definitions; thus, it does not impact any service-level agreements. In our experiments with TPC-DS datasets (up to 1TB), we show that S/C's optimization can speedup end-to-end runtime by 1.04×-5.08× with (only) 1.6GB memory.
KW - Caching
KW - Materialized-View
KW - Scheduling
UR - http://www.scopus.com/inward/record.url?scp=85150330016&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85150330016&partnerID=8YFLogxK
U2 - 10.1109/ICDE55515.2023.00393
DO - 10.1109/ICDE55515.2023.00393
M3 - Conference contribution
AN - SCOPUS:85150330016
T3 - Proceedings - International Conference on Data Engineering
SP - 1981
EP - 1994
BT - Proceedings - 2023 IEEE 39th International Conference on Data Engineering, ICDE 2023
PB - IEEE Computer Society
Y2 - 3 April 2023 through 7 April 2023
ER -