TY - GEN
T1 - Multi-Grained Specifications for Distributed System Model Checking and Verification
AU - Ouyang, Lingzhi
AU - Sun, Xudong
AU - Tang, Ruize
AU - Huang, Yu
AU - Jivrajani, Madhav
AU - Ma, Xiaoxing
AU - Xu, Tianyin
N1 - We thank the anonymous reviewers and our shepherd, Serdar Tasiran, for their insightful comments. Huang\u2019s group is supported by the National Natural Science Foundation of China (62025202, 62372222), the CCF-Huawei Populus Grove Fund (CCF-HuaweiFM202304), the Cooperation Fund of Huawei-NJU Next Generation Programming Innovation Lab (YBN2019105178SW38), and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX24_ 0235). Xu\u2019s group is supported in part by NSF CNS-2130560, CNS-2145295, and a VMware Research Gift.
PY - 2025/3/30
Y1 - 2025/3/30
N2 - This paper presents our experience specifying and verifying the correctness of ZooKeeper, a complex and evolving distributed coordination system. We use TLA+ to model fine-grained behaviors of ZooKeeper and use the TLC model checker to verify its correctness properties; we also check conformance between the model and code. The fundamental challenge is to balance the granularity of specifications and the scalability of model checking-fine-grained specifications lead to state-space explosion, while coarse-grained specifications introduce model-code gaps. To address this challenge, we write specifications with different granularities for composable modules, and compose them into mixed-grained specifications based on specific scenarios. For example, to verify code changes, we compose fine-grained specifications of changed modules and coarse-grained specifications that abstract away details of unchanged code with preserved interactions. We show that writing multi-grained specifications is a viable practice and can cope with model-code gaps without untenable state space, especially for evolving software where changes are typically local and incremental. We detected six severe bugs that violate five types of invariants and verified their code fixes; the fixes have been merged to ZooKeeper. We also improve the protocol design to make it easy to implement correctly.
AB - This paper presents our experience specifying and verifying the correctness of ZooKeeper, a complex and evolving distributed coordination system. We use TLA+ to model fine-grained behaviors of ZooKeeper and use the TLC model checker to verify its correctness properties; we also check conformance between the model and code. The fundamental challenge is to balance the granularity of specifications and the scalability of model checking-fine-grained specifications lead to state-space explosion, while coarse-grained specifications introduce model-code gaps. To address this challenge, we write specifications with different granularities for composable modules, and compose them into mixed-grained specifications based on specific scenarios. For example, to verify code changes, we compose fine-grained specifications of changed modules and coarse-grained specifications that abstract away details of unchanged code with preserved interactions. We show that writing multi-grained specifications is a viable practice and can cope with model-code gaps without untenable state space, especially for evolving software where changes are typically local and incremental. We detected six severe bugs that violate five types of invariants and verified their code fixes; the fixes have been merged to ZooKeeper. We also improve the protocol design to make it easy to implement correctly.
KW - Distributed systems
KW - model checking
KW - reliability
UR - http://www.scopus.com/inward/record.url?scp=105002232781&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002232781&partnerID=8YFLogxK
U2 - 10.1145/3689031.3696069
DO - 10.1145/3689031.3696069
M3 - Conference contribution
AN - SCOPUS:105002232781
T3 - EuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
SP - 379
EP - 395
BT - EuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
PB - Association for Computing Machinery
T2 - 20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
Y2 - 30 March 2025 through 3 April 2025
ER -