TY - JOUR
T1 - Biological databases in the age of generative artificial intelligence
AU - Pop, Mihai
AU - Attwood, Teresa K.
AU - Blake, Judith A.
AU - Bourne, Philip E.
AU - Conesa, Ana
AU - Gaasterland, Terry
AU - Hunter, Lawrence
AU - Kingsford, Carl
AU - Kohlbacher, Oliver
AU - Lengauer, Thomas
AU - Markel, Scott
AU - Moreau, Yves
AU - Noble, William S.
AU - Orengo, Christine
AU - Ouellette, B. F.Francis
AU - Parida, Laxmi
AU - Przulj, Natasa
AU - Przytycka, Teresa M.
AU - Ranganathan, Shoba
AU - Schwartz, Russell
AU - Valencia, Alfonso
AU - Warnow, Tandy
N1 - The workshops where the ideas presented in this article were first discussed were funded by the International Society for Computational Biology (ISCB). N.P. was supported by the European Research Council (ERC) Consolidator [770827], the Spanish State Research Agency and the Ministry of Science and Innovation MCIN [PID2022-141920NB-I00/AEI/10.13039/ 501100011033/FEDER], UE, and the Department of Research and Universities of the Generalitat de Catalunya code 2021 [SGR 01536]. R.S. was supported by the National Human Genome Research Institute of the National Institutes of Health under award number [R01HG010589]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. T.P. was supported by the Intramural Research Program of National Library of Medicine, NIH.
PY - 2025
Y1 - 2025
N2 - Summary: Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.
AB - Summary: Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.
UR - http://www.scopus.com/inward/record.url?scp=105001937636&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105001937636&partnerID=8YFLogxK
U2 - 10.1093/bioadv/vbaf044
DO - 10.1093/bioadv/vbaf044
M3 - Article
C2 - 40177265
AN - SCOPUS:105001937636
SN - 2635-0041
VL - 5
JO - Bioinformatics Advances
JF - Bioinformatics Advances
IS - 1
M1 - vbaf044
ER -