Distribution of "Characteristic" Terms in MEDLINE Literatures

Neil R. Smalheiser, Wei Zhou, Vetle I. Torvik

Research output: Contribution to journalArticle

Abstract

Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p <0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18% of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O'Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.
Original languageEnglish (US)
Pages (from-to)266-276
Number of pages11
JournalInformation (Switzerland)
Volume2
Issue number2
DOIs
StatePublished - 2011

Cite this

Distribution of "Characteristic" Terms in MEDLINE Literatures. / Smalheiser, Neil R.; Zhou, Wei; Torvik, Vetle I.

In: Information (Switzerland), Vol. 2, No. 2, 2011, p. 266-276.

Research output: Contribution to journalArticle

Smalheiser, Neil R. ; Zhou, Wei ; Torvik, Vetle I. / Distribution of "Characteristic" Terms in MEDLINE Literatures. In: Information (Switzerland). 2011 ; Vol. 2, No. 2. pp. 266-276.
@article{7d747b92945e4263b93ce5f2ba822a37,
title = "Distribution of {"}Characteristic{"} Terms in MEDLINE Literatures",
abstract = "Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p <0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18{\%} of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O'Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.",
author = "Smalheiser, {Neil R.} and Wei Zhou and Torvik, {Vetle I.}",
year = "2011",
doi = "10.3390/info2020266",
language = "English (US)",
volume = "2",
pages = "266--276",
journal = "Information (Switzerland)",
issn = "2078-2489",
publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",
number = "2",

}

TY - JOUR

T1 - Distribution of "Characteristic" Terms in MEDLINE Literatures

AU - Smalheiser, Neil R.

AU - Zhou, Wei

AU - Torvik, Vetle I.

PY - 2011

Y1 - 2011

N2 - Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p <0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18% of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O'Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.

AB - Given the occurrence frequency of any term within any set of articles within MEDLINE, we define “characteristic” terms as words and phrases that occur in that literature more frequently than expected by chance (at p <0.001 or better). In this report, we studied how the cut-off criterion varied as a function of literature size and term frequency in MEDLINE as a whole, and have compared the distribution of characteristic terms within a number of journal-defined, affiliation-defined and random literatures. We also investigated how the characteristic terms were distributed among MEDLINE titles, abstracts, and last sentence of abstracts, including “regularized” terms that appear both in the title and abstract of the same paper for at least one paper in the literature. For a set of 10 disciplinary journals, the characteristic terms comprised 18% of the total terms on average. Characteristic terms are utilized in several of our web-based services (Anne O'Tate and Arrowsmith), and should be useful for a variety of other information-processing tasks designed to improve text mining in MEDLINE.

U2 - 10.3390/info2020266

DO - 10.3390/info2020266

M3 - Article

VL - 2

SP - 266

EP - 276

JO - Information (Switzerland)

JF - Information (Switzerland)

SN - 2078-2489

IS - 2

ER -