With the ever-increasing urbanization process, systematically modeling people’s activities in the urban space is being recognized as a crucial socioeconomic task. This task was nearly impossible years ago due to the lack of reliable data sources, yet the emergence of geo-tagged social media (GTSM) data sheds new light on it. Recently, there have been fruitful studies on discovering geographical topics from GTSM data. However, their high computational costs and strong distributional assumptions about the latent topics hinder them from fully unleashing the power of GTSM. To bridge the gap, we present CrossMap, a novel cross-modal representation learning method that uncovers urban dynamics with massive GTSM data. CrossMap first employs an accelerated mode seeking procedure to detect spatiotemporal hotspots underlying people’s activities. Those detected hotspots not only address spatiotemporal variations, but also largely alleviate the sparsity of the GTSM data. With the detected hotspots, CrossMap then jointly embeds all spatial, temporal, and textual units into the same space using two different strategies: one is reconstruction-based and the other is graph-based. Both strategies capture the correlations among the units by encoding their co-occurrence and neighborhood relationships, and learn low-dimensional representations to preserve such correlations. Our experiments demonstrate that CrossMap not only significantly outperforms state-of-the-art methods for activity recovery and classification, but also achieves much better efficiency.