TY - GEN
T1 - Automatic Reliability Testing for Cluster Management Controllers
AU - Sun, Xudong
AU - Luo, Wenqing
AU - Gu, Jiawei Tyler
AU - Ganesan, Aishwarya
AU - Alagappan, Ramnatthan
AU - Gasch, Michael
AU - Suresh, Lalith
AU - Xu, Tianyin
N1 - We thank the anonymous reviewers and our shepherd, Ran-jita Bhagwan, for their insightful comments. We thank Mar-cos Aguilera, Sujata Banerjee, Mihai Budiu, Jon Howell, Rob Johnson, Matthew Lentz, Darko Marinov, Davanum Srinivas, Chaitanya Bhandari, Yinfang Chen, Lilia Tang, and Shuai Wang for valuable feedback and discussions that helped shape this work. We thank all the controller developers who engaged with us and reviewed our reports and patches. This work was funded in part by NSF SHF-1816615, CNS-2130560, CNS-2145295, and a VMware Research Gift.
PY - 2022
Y1 - 2022
N2 - Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers - we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller's view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state's evolution with and without perturbations to detect safety and liveness issues. Sieve's design is powered by a fundamental opportunity in state-reconciliation systems - these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.
AB - Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers - we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller's view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state's evolution with and without perturbations to detect safety and liveness issues. Sieve's design is powered by a fundamental opportunity in state-reconciliation systems - these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.
UR - http://www.scopus.com/inward/record.url?scp=85141074754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141074754&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85141074754
T3 - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
SP - 143
EP - 159
BT - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PB - USENIX Association
T2 - 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Y2 - 11 July 2022 through 13 July 2022
ER -