Kandula (Microsoft Research): Approximate Answers for Complex Parallel Queries
Despite decades of research, approximations are not widely used in data analytics platforms. To understand why, an ideal approximate analytics system has to meet at least four goals: cover a large class of queries, offer much better latency and/ or throughput, have small overhead and offer accuracy guarantees. Whether such a system exists remains an open question. In this talk, I will describe alternate approaches that (a) introduce samplers as native SQL operators including samplers that can sample before a join and a group-by and (b) extend a cost-based query optimizer so as to improve the performance of plans with samplers without changing their accuracy. These techniques are used within Microsoft and are publicly available in the Azure Data Lake Analytics platform.