TY - GEN
T1 - Airphant
T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022
AU - Chockchowwat, Supawit
AU - Sood, Chaitanya
AU - Park, Yongjoo
N1 - VIII. ACKNOWLEDGEMENT This work is supported in part by Microsoft Azure and Google Cloud Platform.
PY - 2022
Y1 - 2022
N2 - Modern data warehouses can scale compute nodes independently of storage. These systems persist their data on cloud storage, which is always available and cost-efficient. Ad-hoc compute nodes then fetch necessary data on-demand from cloud storage. This ability to quickly scale or shrink data systems is highly beneficial if query workloads may change over time. We apply this new architecture to search engines with a focus on optimizing their latencies in cloud environments. However, simply placing existing search engines (e.g., Apache Lucene) on top of cloud storage significantly increases their end-to-end query latencies (i.e., more than 6 seconds on average in one of our studies). This is because their indexes can incur multiple network round-trips due to their hierarchical structure (e.g., skip lists, B-trees, learned indexes). To address this issue, we develop a new statistical index, called IoU Sketch. For lookup, IoU Sketch makes multiple asynchronous network requests in parallel. While IoU Sketch may fetch more bytes than existing indexes, it significantly reduces the index lookup time because parallel requests do not block each other. Based on IoU Sketch, we built an end-to-end search engine called Airphant; we describe how Airphant builds, optimizes, and manages IoU Sketch, and ultimately supports keyword-based querying. In our experiments with four real datasets, Airphant's average end-to-end latencies are between 13 milliseconds and 300 milliseconds, up to 8.97× faster than Apache Lucence and 113.39× faster than Elasticsearch.
AB - Modern data warehouses can scale compute nodes independently of storage. These systems persist their data on cloud storage, which is always available and cost-efficient. Ad-hoc compute nodes then fetch necessary data on-demand from cloud storage. This ability to quickly scale or shrink data systems is highly beneficial if query workloads may change over time. We apply this new architecture to search engines with a focus on optimizing their latencies in cloud environments. However, simply placing existing search engines (e.g., Apache Lucene) on top of cloud storage significantly increases their end-to-end query latencies (i.e., more than 6 seconds on average in one of our studies). This is because their indexes can incur multiple network round-trips due to their hierarchical structure (e.g., skip lists, B-trees, learned indexes). To address this issue, we develop a new statistical index, called IoU Sketch. For lookup, IoU Sketch makes multiple asynchronous network requests in parallel. While IoU Sketch may fetch more bytes than existing indexes, it significantly reduces the index lookup time because parallel requests do not block each other. Based on IoU Sketch, we built an end-to-end search engine called Airphant; we describe how Airphant builds, optimizes, and manages IoU Sketch, and ultimately supports keyword-based querying. In our experiments with four real datasets, Airphant's average end-to-end latencies are between 13 milliseconds and 300 milliseconds, up to 8.97× faster than Apache Lucence and 113.39× faster than Elasticsearch.
KW - Airphant
KW - Cloud storage
KW - Database as a service
KW - Indexing
KW - Information retrieval
KW - Inverted Index
KW - IoU Sketch
KW - Multi layer Hash Table
KW - Physical database design
KW - Separation of compute and storage
KW - Sketch data structure
UR - http://www.scopus.com/inward/record.url?scp=85136376381&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136376381&partnerID=8YFLogxK
U2 - 10.1109/ICDE53745.2022.00107
DO - 10.1109/ICDE53745.2022.00107
M3 - Conference contribution
AN - SCOPUS:85136376381
T3 - Proceedings - International Conference on Data Engineering
SP - 1368
EP - 1381
BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PB - IEEE Computer Society
Y2 - 9 May 2022 through 12 May 2022
ER -