Stopwords and keywords for manual field assignment for the STI 2023 paper Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science

Dataset

Description

We used the following keywords files to identify categories for journals and conferences not in Scopus, for our STI 2023 paper "Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science".

The first four text files each contains keywords/content words in the form: 'keyword1', 'keyword2', 'keyword3', .... The file title indicates the name of the category:
file1: healthscience_words.txt
file2: lifescience_words.txt
file3: physicalscience_words.txt
file4: socialscience_words.txt

The first four files were generated from a combination of software and manual review in an iterative process in which we:
- Manually reviewed venue titles were not able to automatically categorize using the Scopus categorization or extending it as a resource.
- Iteratively reviewed uncategorized venue titles to manually curate additional keywords as content words indicating a venue title could be classified in the category healthscience, lifescience, physicalscience, or socialscience. We used English content words and added words we could automatically translate to identify content words. NOTE: Terminology with multiple potential meanings or contain non-English words that did not yield useful automatic translations e.g., (e.g., Al-Masāq) were not selected as content words.

The fifth text file is a list of stopwords in the form: 'stopword1', 'stopword2, 'stopword3', ...
file5: stopwords.txt
This file contains manually curated stopwords from venue titles to handle non-content words like 'conference' and 'journal,' etc.

This dataset is a revision of the following dataset:
Version 1: Lee, Jou; Schneider, Jodi: Keywords for manual field assignment for Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science. University of Illinois at Urbana-Champaign Data Bank.

Changes from Version 1 to Version 2:
- Added one author
- Added a stopwords file that was used in our data preprocessing.
- Thoroughly reviewed each of the 4 keywords lists. In particular, we added UTF-8 terminology, removed some non-content words and misclassified content words, and extensively reviewed non-English keywords.
Date made availableSep 19 2023
PublisherUniversity of Illinois Urbana-Champaign

Keywords

  • physical science keywords
  • RISRS
  • stopwords
  • health science keywords
  • life science keywords
  • science of science
  • keywords
  • scientometrics
  • field
  • social science keywords
  • meta-science

Cite this