Feature Selection Metrics: Similarities, Differences, and Characteristics of the Selected Models

Debopam Sanyal, Nigel Bosch, Luc Paquette

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Supervised machine learning has become one of the most important methods for developing educational and intelligent tutoring software; it is the backbone of many educational data mining methods for estimating knowledge, emotion, and other aspects of learning. Hence, in order to ensure optimal utilization of computing resources and effective analysis of models, it is essential that researchers know which evaluation metrics are best suited to educational data. In this article, we focus on the problem of wrapper feature selection, where predictors are added to models based on how much they improve model accuracy in terms of a given metric. We compared commonly-used machine learning algorithms including naive Bayes, support vector machines, logistic regression, and random forests on 11 diverse learning-related datasets. We optimized feature selection based on nine different metrics, then evaluated each to address research questions about how effective each metric was in terms of the others (e.g., does optimizing for precision also result in good F1?) as well as calibration (i.e., are predictions produced by models accurate probabilities of correctness?). We provide empirical evidence that the Matthews correlation coefficient (MCC) produced the overall best results across the other metrics, but that root mean squared error (RMSE) selected the best-calibrated models. Finally, we also discuss issues related to the number of features selected when optimizing for each metric, as well as the types of datasets for which certain metrics were more effective.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th International Conference on Educational Data Mining, EDM 2020
EditorsAnna N. Rafferty, Jacob Whitehill, Cristobal Romero, Violetta Cavalli-Sforza
PublisherInternational Educational Data Mining Society
Pages212-223
Number of pages12
ISBN (Electronic)9781733673617
StatePublished - 2020
Event13th International Conference on Educational Data Mining, EDM 2020 - Virtual, Online
Duration: Jul 10 2020Jul 13 2020

Publication series

NameProceedings of the 13th International Conference on Educational Data Mining, EDM 2020

Conference

Conference13th International Conference on Educational Data Mining, EDM 2020
CityVirtual, Online
Period7/10/207/13/20

Keywords

  • Feature selection
  • Machine learning
  • Metrics
  • Student models

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Feature Selection Metrics: Similarities, Differences, and Characteristics of the Selected Models'. Together they form a unique fingerprint.

Cite this