Spatiotemporal activity modeling, which aims at modeling users' activities at different locations and time from user behavioral data, is an important task for applications like urban planning and mobile advertising. State-of-the-art methods for this task use cross-modal embedding to map the units from different modalities (location, time, text) into the same latent space. However, the success of such methods relies on data sufficiency, and may not learn quality embeddings when user behavioral data is scarce. To address this problem, we propose BRANCHNET, a spatiotemporal activity model that transfers knowledge from external sources for alleviating data scarcity. BRANCHNET adopts a graph-regularized cross-modal embedding framework. At the core of it is a main embedding space, which is shared by the main task of reconstructing user behaviors and the auxiliary graph embedding tasks for external sources, thus allowing external knowledge to guide the cross-modal embedding process. In addition to the main embedding space, the auxiliary tasks also have branched task-specific embedding spaces. The branched embeddings capture the discrepancies between the main task and the auxiliary ones, and free the main embeddings from encoding information for all the tasks. We have empirically evaluated the performance of BRANCHNET, and found that it is capable of effectively transferring knowledge from external sources to learn better spatiotemporal activity models and outperforming strong baseline methods.