Abstract
Queries over unstructured data (e.g., videos and text) are becoming increasingly common due to large data volumes and machine learning (ML). A common method of answering queries is to use a deep neural network (DNN) or human labeler (which we collectively refer to as oracle methods) to extract structured information from this unstructured data. For example, an ecologist may be interested in understanding hummingbird behavior, so extracts all bird positions from a video. Unfortunately, these methods can be costly: labeling 100 days of video via human annotators can cost hundreds of thousands of dollars. Thus, to reduce the cost of executing queries, recent work has proposed using proxy models: cheap approximations to oracle methods. They have primarily been studied in the context of approximating binary predicates, in which the proxy model produces a score between 0 and 1 and records above some ad-hoc score threshold are assumed to satisfy the predicate [1, 3, 6]. However, this prior work on binary predicates leaves major concerns unaddressed: 1) existing query processing algorithms do not provide statistical guarantees on query results and 2) they cannot share work between queries efficiently. To address these issues, we have been developing indexing and query processing algorithms for unstructured data using ML in a system MEME. We describe our recent developments and some applications below.
Original language | English (US) |
---|---|
State | Published - 2021 |
Event | 11th Annual Conference on Innovative Data Systems Research, CIDR 2021 - Virtual, Online Duration: Jan 11 2021 → Jan 15 2021 |
Conference
Conference | 11th Annual Conference on Innovative Data Systems Research, CIDR 2021 |
---|---|
City | Virtual, Online |
Period | 1/11/21 → 1/15/21 |
ASJC Scopus subject areas
- Artificial Intelligence
- Information Systems
- Information Systems and Management
- Hardware and Architecture