Object tracking plays an important role in multimedia surveillance systems, in which the major types of data are video and audio captured by cameras and microphone arrays. In this paper, we describe a systematic approach to audiovisual object tracking, originally proposed by Beal et al, based on graphical models that jointly combine audio and video variables under a single probabilistic framework. We seek to improve this approach through three aspects: First, we introduce background subtraction preprocessing of video data. Second, we modify the video model to exclude the background from being transformed. Third, we extend the joint model to a dynamic Bayes net. These improvements yield satisfactory results on single person tracking in a noisy outdoor environment with far-field background road traffic, and handle situations where the target is lost due to occlusions.