Recent advancements in natural language processing (NLP) have generated interest in using computers to assist in the coding and analysis of students’ short answer responses for PER or classroom applications. We train a state-of-the-art NLP, IBM’s Watson, and test its agreement with humans in three varying experimental cases. By exploring these cases, we begin to understand how Watson behaves with ideal and more realistic data, across different levels of training, and across different types of categorization tasks. We find that Watson’s self-reported confidence for categorizing samples is reasonably well-aligned with its accuracy, although this can be impacted by features of the data being analyzed. Based on these results, we discuss implications and suggest potential applications of this technology to education research.