Organizations of all sizes are shifting their IT infrastructures to the cloud because of its cost efficiency and convenience. Because of the on-demand nature of the Infrastructure as a Service (IaaS) clouds, hundreds of thousands of virtual machines (VMs) may be deployed and terminated in a single large cloud data center each day. In this paper, we propose a content-based scheduling algorithm for the placement of VMs in data centers. We take advantage of the fact that it is possible to find identical disk blocks in different VM disk images with similar operating systems by scheduling VMs with high content similarity on the same hosts. That allows us to reduce the amount of data transferred when deploying a VM on a destination host. In this paper, we first present our study of content similarity between different VMs, based on a large set of VMs with different operating systems that represent the majority of popular operating systems in use today. Our analysis shows that content similarity between VMs with the same operating system and close version numbers (e.g., Ubuntu 12.04 vs. Ubuntu 11.10) can be as high as 60%. We also show that there is close to zero content similarity between VMs with different operating systems. Second, based on the above results, we designed a content-based scheduling algorithm that lowers the network traffic associated with transfer of VM disk images inside data centers. Our experimental results show that the amount of data transfer associated with deployment of VMs and transfer of virtual disk images can be lowered by more than 70%, resulting in significant savings in data center network utilization and congestion.