We present a novel generative model for video that models video as mixture of transformed video scenes. The learning procedure automatically clusters video frames into video scenes and objects. The learning algorithm is based on a hierarchical, on-line EM algorithm. Fast Fourier transform (FFT) is used for rapid computations in E and M step of the EM algorithm. We use the model to: 1. perform video clustering by grouping similar (up to translation and scale) video frames into clusters; 2. robustly stabilize video by inferring translation and scale intensity for each frame. We believe that video scene modeling of this kind is essential to bridge the "semantic gap" in video understanding. We illustrate this with several excellent results, both in terms of speed and accuracy.