Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database

Vetle Ingvald Torvik, Sneha Agarwal

Research output: Contribution to conferencePaper

Abstract

We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.edu
Original languageEnglish (US)
Number of pages1
StatePublished - Mar 2016
EventInternational Symposium on Science of Science - Library of Congress, Washington DC, United States
Duration: Mar 22 2016Mar 23 2016

Conference

ConferenceInternational Symposium on Science of Science
CountryUnited States
CityWashington DC
Period3/22/163/23/16

Fingerprint

ethnicity
Israeli
gender

Keywords

  • bibliometrics
  • ethnicity classification
  • machine learning

Cite this

Torvik, V. I., & Agarwal, S. (2016). Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. Paper presented at International Symposium on Science of Science, Washington DC, United States.

Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. / Torvik, Vetle Ingvald; Agarwal, Sneha.

2016. Paper presented at International Symposium on Science of Science, Washington DC, United States.

Research output: Contribution to conferencePaper

Torvik, VI & Agarwal, S 2016, 'Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database' Paper presented at International Symposium on Science of Science, Washington DC, United States, 3/22/16 - 3/23/16, .
Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. 2016. Paper presented at International Symposium on Science of Science, Washington DC, United States.
Torvik, Vetle Ingvald ; Agarwal, Sneha. / Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. Paper presented at International Symposium on Science of Science, Washington DC, United States.1 p.
@conference{9bab56997c20405b89486483478ecd11,
title = "Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database",
abstract = "We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.edu",
keywords = "bibliometrics, ethnicity classification, machine learning",
author = "Torvik, {Vetle Ingvald} and Sneha Agarwal",
year = "2016",
month = "3",
language = "English (US)",
note = "International Symposium on Science of Science ; Conference date: 22-03-2016 Through 23-03-2016",

}

TY - CONF

T1 - Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database

AU - Torvik, Vetle Ingvald

AU - Agarwal, Sneha

PY - 2016/3

Y1 - 2016/3

N2 - We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.edu

AB - We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.edu

KW - bibliometrics

KW - ethnicity classification

KW - machine learning

UR - http://hdl.handle.net/2142/88927

M3 - Paper

ER -