Better tools are needed to enable researchers to quickly identify and explore effective and interpretable feature-based explanations for discriminating multi-class genomic datasets, e.g., healthy versus diseased samples. We develop an interactive exploration tool, GENVISAGE, which rapidly discovers the most discriminative feature pairs that separate two classes of genomic objects and then displays the corresponding visualizations. Since quickly finding top feature pairs is computationally challenging, especially for large numbers of objects and features, we propose a suite of optimizations to make GENVISAGE responsive at scale and demonstrate that our optimizations lead to a 400× speedup over competitive baselines for multiple biological datasets. We apply our rapid and interpretable tool to identify literature-supported pairs of genes whose transcriptomic responses significantly discriminate several chemotherapy drug treatments. With its generalizable optimizations and framework, GENVISAGE opens up real-time feature-based explanation generation to data from massive sequencing efforts, as well as many other scientific domains. A fundamental task in the analysis of genomics datasets is identifying features that can explain the difference between two groups of biological samples. As studies and data repositories that enable simultaneous analysis of thousands of samples become widespread, it is imperative that feature identification tools return interpretable and significant results rapidly, allowing researchers to interactively generate and explore hypotheses on these massive datasets. Our tool, GENVISAGE, is built around a framework that identifies pairs of features that strongly separate samples of different classes. An extensive suite of optimization techniques enables us to extract literature-supported feature pairs with accompanying interpretable visualizations from exceptionally large genomic datasets in real time. The GENVISAGE optimizations and webserver instance provide a blueprint for future online tools providing interactive feature exploration in massive datasets from genomics and other domains. Identifying features that most strongly separate samples from two biological classes is fundamental in the analysis of genomic datasets. This task is typically addressed by finding (1) single features using univariate statistical methods or (2) multi-feature combinations from time-intensive machine learning. Here we present GENVISAGE, a tool that enables researchers to interactively identify visually interpretable and significant feature pairs that separate the classes. With this highly optimized tool, researchers can instantaneously generate and explore hypotheses on very massive genomic datasets.
- DSML 2: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem
- feature pair
- separability problem
ASJC Scopus subject areas
- Decision Sciences(all)