A search engine approach to estimating temporal changes in gender orientation of first names

Brittany N. Smith, Mamta Singh, Vetle Ingvald Torvik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., "Uncle Taylor"). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with timevarying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25% of the people in the SSA dataset (slightly worse than the 1.74% pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20 th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors' names are listed.

Original languageEnglish (US)
Title of host publicationJCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
Pages199-208
Number of pages10
DOIs
StatePublished - Aug 23 2013
Event13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013 - Indianapolis, IN, United States
Duration: Jul 22 2013Jul 26 2013

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)1552-5996

Other

Other13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013
CountryUnited States
CityIndianapolis, IN
Period7/22/137/26/13

Fingerprint

Search engines
Digital libraries
Logistics

Keywords

  • Androgyny
  • Bibliometrics
  • Data mining
  • Earch engine
  • Gender
  • Semantic orientation
  • Temporal prediction
  • Textual markers

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 199-208). (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/2467696.2467720

A search engine approach to estimating temporal changes in gender orientation of first names. / Smith, Brittany N.; Singh, Mamta; Torvik, Vetle Ingvald.

JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 2013. p. 199-208 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Smith, BN, Singh, M & Torvik, VI 2013, A search engine approach to estimating temporal changes in gender orientation of first names. in JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, pp. 199-208, 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, Indianapolis, IN, United States, 7/22/13. https://doi.org/10.1145/2467696.2467720
Smith BN, Singh M, Torvik VI. A search engine approach to estimating temporal changes in gender orientation of first names. In JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 2013. p. 199-208. (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/2467696.2467720
Smith, Brittany N. ; Singh, Mamta ; Torvik, Vetle Ingvald. / A search engine approach to estimating temporal changes in gender orientation of first names. JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 2013. pp. 199-208 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).
@inproceedings{9513f7424d6a4affa93c805a7858d848,
title = "A search engine approach to estimating temporal changes in gender orientation of first names",
abstract = "This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., {"}Uncle Taylor{"}). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with timevarying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25{\%} of the people in the SSA dataset (slightly worse than the 1.74{\%} pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20 th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors' names are listed.",
keywords = "Androgyny, Bibliometrics, Data mining, Earch engine, Gender, Semantic orientation, Temporal prediction, Textual markers",
author = "Smith, {Brittany N.} and Mamta Singh and Torvik, {Vetle Ingvald}",
year = "2013",
month = "8",
day = "23",
doi = "10.1145/2467696.2467720",
language = "English (US)",
isbn = "9781450320764",
series = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",
pages = "199--208",
booktitle = "JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries",

}

TY - GEN

T1 - A search engine approach to estimating temporal changes in gender orientation of first names

AU - Smith, Brittany N.

AU - Singh, Mamta

AU - Torvik, Vetle Ingvald

PY - 2013/8/23

Y1 - 2013/8/23

N2 - This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., "Uncle Taylor"). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with timevarying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25% of the people in the SSA dataset (slightly worse than the 1.74% pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20 th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors' names are listed.

AB - This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., "Uncle Taylor"). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with timevarying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25% of the people in the SSA dataset (slightly worse than the 1.74% pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20 th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors' names are listed.

KW - Androgyny

KW - Bibliometrics

KW - Data mining

KW - Earch engine

KW - Gender

KW - Semantic orientation

KW - Temporal prediction

KW - Textual markers

UR - http://www.scopus.com/inward/record.url?scp=84882261244&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84882261244&partnerID=8YFLogxK

U2 - 10.1145/2467696.2467720

DO - 10.1145/2467696.2467720

M3 - Conference contribution

AN - SCOPUS:84882261244

SN - 9781450320764

T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

SP - 199

EP - 208

BT - JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries

ER -