BEEP is a state-of-the-art subgraph enumerator that delivers high performance through a combination of balanced, parallel GPU processing and novel algorithmic improvements. With a rapidly increasing demand for fast tools on large graphs, GPU-based subgraph enumerators are of growing interest. Most existing GPU enumerators are based on Breadth First Search (BFS), which often impose limitations on hardware resources due to excessive memory requirements. PARSEC  was the first GPU enumerator to adopt Depth First Search (DFS) that demonstrated impressive speedups and its adaptability to hardware with limited memory resources. However, PARSEC's DFS implementation suffers from computational inefficiencies and load imbalances. BEEP introduces novel search space reduction techniques and load balancing strategies to tackle these challenges in DFS-based parallelization and achieves exceptional performance and scalability. Experimental results indicate that BEEP outperforms PARSEC with geometric mean speedups of up to 10.52× across disparate data graphs and up to 7.28× across various queries with maximum speedups of 33.46×. This makes BEEP the fastest subgraph enumerator to date. Furthermore, a multi-GPU implementation is developed that exhibits almost linear scalability with the number of devices.