For large industrial applications, system test cases are still often described in natural language (NL), and their number can reach thousands. Test automation is to automatically execute the test cases. Achieving test automation typically requires substantial manual effort for creating executable test scripts from these NL test cases. In particular, given that each NL test case consists of a sequence of NL test steps, testers first implement a test API method for each test step and then write a test script for invoking these test API methods sequentially for test automation. Across different test cases, multiple test steps can share semantic similarities, supposedly mapped to the same API method. However, due to numerous test steps in various NL forms under manual inspection, testers may not realize those semantically similar test steps and thus waste effort to implement duplicate test API methods for them. To address this issue, in this paper, we propose a new approach based on natural language processing to cluster similar NL test steps together such that the test steps in each cluster can be mapped to the same test API method. Our approach includes domain-specific word embedding training along with measurement based on Relaxed Word Mover'sDistance to analyze the similarity of test steps. Our approach also includes a technique to combine hierarchical agglomerative clustering and K-means clustering post-refinement to derive high-quality and manually-adjustable clustering results. The evaluation results of our approach on a large industrial mobile app, WeChat, show that our approach can cluster the test steps with high accuracy, substantially reducing the number of clusters and thus reducing the downstream manual effort. In particular, compared with the baseline approach, our approach achieves 79.8% improvement on cluster quality, reducing 65.9% number of clusters, i.e., the number of test API methods to be implemented.