This paper studies an emerging and important problem of identifying misleading COVID-19 short videos where the misleading content is jointly expressed in the visual, audio, and textual content of videos. Existing solutions for misleading video detection mainly focus on the authenticity of videos or audios against AI algorithms (e.g., deepfake) or video manipulation, and are insufficient to address our problem where most videos are user-generated and intentionally edited. Two critical challenges exist in solving our problem: i) how to effectively extract information from the distractive and manipulated visual content in TikTok videos? ii) How to efficiently aggregate heterogeneous information across different modalities in short videos? To address the above challenges, we develop TikTec, a multimodal misinformation detection framework that explicitly exploits the captions to accurately capture the key information from the distractive video content, and effectively learns the composed misinformation that is jointly conveyed by the visual and audio content. We evaluate TikTec on a real-world COVID- 19 video dataset collected from TikTok. Evaluation results show that TikTec achieves significant performance gains compared to state-of-the-art baselines in accurately detecting misleading COVID-19 short videos.