This paper focuses on a critical problem of explainable multimodal COVID-19 misinformation detection where the goal is to accurately detect misleading information in multimodal COVID-19 news articles and provide the reason or evidence that can explain the detection results. Our work is motivated by the lack of judicious study of the association between different modalities (e.g., text and image) of the COVID-19 news content in current solutions. In this paper, we present a generative approach to detect multimodal COVID-19 misinformation by investigating the cross-modal association between the visual and textual content that is deeply embedded in the multimodal news content. Two critical challenges exist in developing our solution: 1) how to accurately assess the consistency between the visual and textual content of a multimodal COVID-19 news article? 2) How to effectively retrieve useful information from the unreliable user comments to explain the misinformation detection results? To address the above challenges, we develop a duo-generative explainable misinformation detection (DGExplain) framework that explicitly explores the cross-modal association between the news content in different modalities and effectively exploits user comments to detect and explain misinformation in multimodal COVID-19 news articles. We evaluate DGExplain on two real-world multimodal COVID-19 news datasets. Evaluation results demonstrate that DGExplain significantly outperforms state-of-the-art baselines in terms of the accuracy of multimodal COVID-19 misinformation detection and the explainability of detection explanations.