We propose a state-based approach to gesture learning and recognition. Using spatial clustering and temporal alignment, each gesture is defined to be an ordered sequence of states in spatialoral space. The 2D image positions of the centers of the head and both hands of the user are used as features; these are located by a color-based tracking method. From training data of a given gesture, we first learn the spatial information and then group the data into segments that are automatically aligned temporally. The temporal information is further integrated to build a finite state machine (FSM) recognizer. Each gesture has a FSM corresponding to it. The computational efficiency of the FSM recognizers allows us to achieve real-time on-line performance. We apply this technique to build an experimental system that plays a game of "Simon Says" with the user.