TY - JOUR
T1 - VEF
T2 - A variant filtering tool based on ensemble methods
AU - Zhang, Chuanyi
AU - Ochoa, Idoia
N1 - Funding Information:
This work has been partially supported by grant 2018-182799 from the Chan Zuckerberg Initiative DAF, an advised fund SVCF, and a Strategic Research Initiatives (SRI) grant and a CompGen fellowship from UIUC.
Publisher Copyright:
© 2019 The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
PY - 2020/4/15
Y1 - 2020/4/15
N2 - Motivation: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-The-Art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known 'true' variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results: For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample).
AB - Motivation: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-The-Art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known 'true' variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results: For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample).
UR - http://www.scopus.com/inward/record.url?scp=85084027016&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084027016&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btz952
DO - 10.1093/bioinformatics/btz952
M3 - Article
C2 - 31873730
AN - SCOPUS:85084027016
SN - 1367-4803
VL - 36
SP - 2328
EP - 2336
JO - Bioinformatics
JF - Bioinformatics
IS - 8
ER -