This paper presents a fast and scalable method for activity analysis of construction equipment involved in earthmoving operations from highly varying long-sequence videos obtained from fixed cameras. A common approach to characterize equipment activities consists of detecting and tracking the equipment within the video volume, recognizing interest points and describing them locally, and following by a bag-of-words representation for classifying activities. While successful results have been achieved in each aspect of detection, tracking, and activity recognition, the highly varying degree of intra-class variability in resources, occlusions and scene clutter, the difficulties in defining visually-distinct activities, together with long computational time have challenged scalability of current solutions. In this paper, we present a new end-to-end automated method to recognize the equipment activities by simultaneously detecting and tracking features, and characterizing the spatial kinematics of features via a decision tree. The method is tested on an unprecedented dataset of 5hr-long real-world videos of interacting pairs of excavators and trucks. The Experimental results show that the method is capable of activity recognition with accuracy of 88.91% with a computational time less than one-to-one ratio for each video length. The benefits of the proposed method for root-cause assessment of performance deviations are discussed.