Original language | English (US) |
---|---|
Pages (from-to) | 1353-1356 |
Number of pages | 4 |
Journal | Structure |
Volume | 15 |
Issue number | 11 |
DOIs |
|
State | Published - Nov 13 2007 |
ASJC Scopus subject areas
- Structural Biology
- Molecular Biology
Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS
In: Structure, Vol. 15, No. 11, 13.11.2007, p. 1353-1356.
Research output: Contribution to journal › Comment/debate › peer-review
}
TY - JOUR
T1 - A Protein Structure (or Function ?) Initiative
AU - Gerlt, John A.
N1 - Funding Information: As a mechanistic enzymologist, my perspective on the Protein Structure Initiative (PSI) may be somewhat different than those of many readers of Structure . I do not consider a high resolution structure as the end to a research problem but, instead, the beginning that allows the formulation of structure-based hypotheses for enzyme mechanisms and experimental strategies for elucidation of both the chemical and structural bases of those mechanisms. Indeed, for much of my career I have enjoyed productive collaborations with X-ray crystallographers that always provided interesting insights into structure-function relationships for a variety of enzyme-catalyzed reactions. I admit to having been a disconnected observer when PSI-1 was initiated. Those early stages of the PSI did little to impact the direction of my science, although I did wonder exactly how the influx of structures that eventually would emerge could impact my views on and approaches to enzymology. I was aware of and perhaps even sympathetic to criticisms that structures would be available for many highly divergent proteins of unknown function, but with no obvious way to put them to use. I was reminded of my early, short-sighted views of genome sequencing in which I questioned the wisdom of robotic sequencing of vast stretches of DNA. But, with the passage of a few years and the continued evolution of my own research interests, I have come to realize that the biological, even the enzymological, landscape has changed significantly. More than six hundred genomes later, unanticipated complexities in both biology and enzymology have become apparent. We all have come to appreciate that an unexpectedly small number of genes enables the complexities of human biology, challenging simple views of biological function. We also have learned of the importance of lateral gene transfer, even between prokaryotes and eukaryotes, and that its role in the acquisition of adaptive advantage is far more widespread than we could have expected. And, even in my own niche in mechanistic enzymology, we have come to appreciate that the creation of “new” enzymatic functions by divergent evolution from ancestral proteins is exceedingly widespread. Point mutations and subsequent selective pressure can produce changes in substrate specificity while retaining a common chemical mechanism. The divergent members of such superfamilies (e.g., chymotrypsin, trypsin, elastase, and their homologs) can be termed “specificity diverse” ( Gerlt and Babbitt, 2001 ). But, divergent evolution also can produce changes in the overall reactions that are catalyzed while retaining a partial chemical reaction (mechanistically diverse enzyme superfamilies). Perhaps even more surprising, divergent evolution from a common progenitor can produce enzymes that share neither substrate specificities nor chemical mechanisms (mechanistically distinct enzyme suprafamilies). In retrospect, the formation of enzyme superfamilies and suprafamilies reflects the structural adaptability of a relatively small number of folds to catalyze an amazingly diverse range of chemistry. For example, the (β/α) 8 -barrel fold is the most ubiquitous in the Protein Data Bank (PDB), with examples catalyzing reactions involving carbocations, carbanions, and radicals as intermediates. However, as the number of sequenced genomes increases, mechanistic enzymology, as well as the entire biological community, must come to grips with the limitations of the bounty of sequence information. With each deposited genome, sequences are identified that share no detectable sequence identity with previously sequenced proteins. Therefore, the functions of these “new” proteins are necessarily unknown. They may be functional analogs of characterized enzymes, but without independent biochemical experiments their function is indeterminate. Perhaps 50% of the sequences that have been deposited in the databases have unknown or uncertain functions. Even for those that have been annotated, usually using computational methods, the functions of many are uncertain or incorrect, because automated methods may capture the function of a closest characterized homolog that is so divergent that the function cannot be accurately transferred. Or, even when the sequence identity is high, the function may not be conserved. That annotated functions may be uncertain, or even incorrect, may not be recognized by nonexperts and, therefore, may have even more (negative!) impact on the course of science than annotation as a “hypothetical protein”! Without reliable assignment of the functions of the proteins encoded by a genome, the biological capabilities and properties of the organism cannot be specified. But, isn't this supposed to be the goal of genomic biology? Even for exhaustively studied mechanistically diverse superfamilies for which many functions have been characterized and many structures have been determined, e.g., the enolase superfamily that has been my focus for nearly 20 years, new genomes often encode members that are sufficiently divergent in sequence that their functions cannot be specified. Thus, even when we would like to think that we understand the structural bases for the evolution of new enzymes, we are faced with the hard reality that we do not. For example, if proteins within a superfamily are to be (re)designed to catalyze potentially useful reactions with unnatural substrates, the process could be facilitated by knowing the complete set of functions that are contained in naturally evolved superfamilies, allowing Nature's own successful strategies for redesign to be applied. Thus, we now are faced with the problem that new advances not only in the study of enzyme superfamilies but also in biology require the assignment of function to the many unknowns present in all genomes. But, this task is far from trivial! I do not know whether this problem was fully appreciated when the guidelines for the PSI were developed. Certainly, from its inception, a major motivation of the PSI was to determine the structures of highly divergent proteins, ideally one from each fold class, so that homology modeling approaches could be used to predict the structures of all proteins discovered in genome projects, thereby facilitating functional assignment. But, the emerging realization is that few biologists have either tried or been able to take much advantage of the increasing number of structures of uncharacterized proteins that are being deposited in the PDB. With respect to the latter, structure-based predictions of function are difficult. And, for that and other reasons, the PSI is increasingly criticized both within and outside the structural biology community. Certainly, in this time of severe fiscal pressure on biomedical science, the resources that support the PSI (and other NIH Roadmap Initiatives) are viewed as threats to the investigator-initiated (R01) science that long has been the mainstay in American biomedical science. I do not agree with that view. Although I believe in “R01 science” (I do it myself), I am convinced that many (most?) important biological problems cannot be solved by a single investigator. Virtually all members of the mechanistic enzymology community recognize that collaborations with structural biologists are essential if we are to investigate the structural bases of catalysis. Many of us also recognize that collaborations with computational biologists are also essential if we are to recognize other critical aspects of function, including the dynamic features that are required for transition state stabilization, because structures of complexes with transition state analogs may fail to capture the dynamical processes that are an integral part of catalysis. In many ways, the problem of assigning functions to uncharacterized proteins is analogous to the problems routinely faced by mechanistic enzymologists, but functional assignment is just more complicated, at least with contemporary experimental approaches and computational algorithms: how can protein structure alone be used to generate testable hypotheses of biological function? As enzymologists, we have the luxury of knowing the enzymatic function as we analyze the structure and design experimental approaches to test mechanistic hypotheses. Now, when we do not know the function, how do we start to establish structure-function relationships? Clearly, we need expansive strategies for assigning functions to the unknown, uncharacterized proteins. Complaining about the resources that are currently devoted to the PSI and thereby discouraging strategic community-wide efforts to solve the functional assignment problem is not doing any favor to biology, even in a time when resources are limited. Instead, I suggest that as NIH/NIGMS evaluates how the PSI should be continued beyond the current “production phase,” the enzymological, structural, computational, and biological communities together should consider developing new strategies for using the advances in structural biology that have emerged from the PSI (including higher throughput and automation) to tackle the problem of devising integrated approaches for functional assignment. This opinion is based on experience. Confronted with the specific problem that perhaps 50% of the members of the mechanistically diverse enolase superfamily and an even larger fraction of the members of the mechanistically diverse amidohydrolase superfamily have unknown/uncertain functions, a group of scientists with diverse expertise (Patricia Babbitt [USCF], bioinformatics; Andrej Sali [USCF], homology modeling; Matthew Jacobson [USCF], homology modeling and ligand docking; Brian Shoichet [USCF], ligand docking, Steven Almo [Albert Einstein College of Medicine], X-ray crystallography; Frank Raushel [Texas A&M], amidohydrolase superfamily enzymology; and myself, enolase superfamily enzymology) decided to tackle the problem of functional assignment in these superfamilies that share the (β/α) 8 -barrel fold. We are participants in an NIGMS-funded Program Project (GM-71790) entitled “Deciphering Enzyme Specificity,” with its goal to define and implement an integrated structure-computation-function approach for facilitating functional assignments of uncharacterized members of both superfamilies. As this Program Project has evolved over the past three plus years, we have established previously elusive but now productive connections between functional and computational enzymology. As a result, we predicted the substrate specificity and enzymatic function of an unknown member of the amidohydrolase superfamily by in silico docking a library of high energy tetrahedral intermediates into a PSI-determined structure ( Hermann et al., 2007 ). We also computationally predicted the substrate specificity and enzymatic function of an unknown member of the enolase superfamily using homology modeling of the uncharacterized protein's sequence to generate a structure into which a library of possible substrates was docked ( Song et al., 2007 ). In both cases, the computational efforts generated a “short list” of potential substrates that was experimentally tested; also in both cases, the actual substrates were at or near the top of the list. And, in both cases, subsequent structural analyses of liganded complexes confirmed that the structural bases for the prediction of specificity were correct. In neither case did we anticipate that the results of the computational efforts alone would be sufficient to establish the physiological function. Instead, we used the predictions to expedite functional assignment based on laboratory experiments. These were the first successful examples of the use of computational methods to predict enzymatic functions. What is the advantage of this integrated approach for assigning substrate specificity and function? The physical acquisition of a “complete” library of metabolites for functional screening is impossible. Many cannot be purchased. And, for those that cannot be purchased, the syntheses can be tedious and/or expensive, with comprehensive synthetic efforts poorly justified based on the low probability that a particular compound will be the substrate of the specific enzyme for which function is sought. However, reliable computational predictions of substrate specificity would allow focused efforts to identify and screen size-restricted physical libraries that would be likely to contain the physiological substrate. What are the disadvantages of this approach? First, databases, such as KEGG or BioCyc, do not contain all metabolites, because the complete metabolome has not been defined for any organism. Second, conformational changes often accompany ligand binding, so structures that are determined without ligands and homology models derived therefrom often will not be useful for predictions of substrate specificity. In the case of our successful prediction of the N-succinyl Arg/Lys racemase function in the enolase superfamily, the template for homology modeling was the liganded structure of a homolog ( Song et al., 2007 ). Thus, the positions of the mobile loops that define the active site were properly positioned in the model, although flexible receptor docking algorithms for the modeled side chains were necessary for the correct prediction of substrate specificity. In the case of the S-adenosylhomocysteine deaminase in the amidohydrolase family, the deposited structure was in a closed conformation that allowed successful docking of the library of high energy intermediates ( Hermann et al., 2007 ). In both cases, the in silico ligand library contained the real substrate. However, one or both of these favorable situations need not apply. Thus, our studies define areas for future attention in integrated computational-structural-functional approaches for function prediction. The advantages of studying functionally diverse superfamilies include the expectation that nontrivial functional assignments can be made using established structure-function relationships as a foundation, e.g., in the enolase superfamily the reactions involve enolization of a carboxylate substrate that is facilitated by a required divalent metal ion. While this partial reaction restricts the ligand libraries for in silico docking, it also suggests the structures of substrate/intermediate fragments that might used as ligands in cocrystallization so that substrate-induced conformational changes can be realized, thereby enhancing the likelihood of productive library docking. The successful application of this approach would allow insights into the nature of the conformational changes that accompany ligand binding, thereby informing the development of computational approaches to solve the general problem of predicting conformational changes. Obviously, this is a real challenge, but one that requires attention and resources. Also, within a superfamily, that the presently recognized metabolite library is not complete does not pose an insurmountable problem for in silico ligand docking. Although the acquisition of a complete physical library of potential substrates is impossible, the formulation of a complete in silico library of potential substrates is “easy.” In other words, working toward the goal of “perfecting” in silico ligand docking to predict the possible structures of unknown substrates will focus the synthetic efforts that are necessary to obtain the compounds that must be experimentally tested. So, what do I propose for the future of the PSI? Do I propose that it be abandoned? No! But, I do propose that the emphasis on the determination of structures of divergent proteins based solely on sequence identities be decreased. The discovery of new folds is important to an understanding of the boundaries of evolutionary processes and, also, provides necessary challenges to the segment of the computational community that is interested in de novo structural prediction. However, I propose that the major emphasis now should be placed on comprehensive approaches to functional prediction. From my experience, these efforts will require new areas of expertise that are beyond the funds that can be provided by the Program Project funding mechanism. Expertise is required in both synthetic chemistry to prepare even limited libraries of potential substrates and, also, in biological verification of functions “assigned” based on enzymatic assays. With respect to the latter, the current state-of-the-art is the measured value of k cat /K m (does it approach the diffusion controlled limit of 10 6 –10 8 M −1 sec −1 ?). But, when an operon context is available for the unknown, substrate specificity analyses of the other enzymes that constitute a metabolic pathway (likely previously unknown) can be used to confirm the functional assignment. Although operon context is applicable only to microbial unknowns, this highlights the need for physiological approaches to support functional assignments. Resources for screening unknown proteins for their ability to bind small molecules, instead of relying on measurements of enzymatic activity, also would be very useful. Enzymatic activity results from favorable disposition of reactive ligands with active site residues. However, geometrically less precise interactions may be sufficient to produce a closed conformation, which, when structurally characterized, can be used to enable successful in silico docking. These efforts would complement development of computational approaches for predicting substrate-induced conformational changes. However, significant resources for focused structural studies must remain. Our Program Project established a “community targets” collaboration with the New York Structural GenomiX Research Consortium (NYSGXRC), one of the four PSI-2 production centers, to determine structures of members of the enolase and amidohydrolase superfamilies for which the sequence identities to structural characterized members can be >30%, not <30% as required for “targets” intended to populate fold space. Also, through this collaboration we are obtaining liganded structures when computational predictions and/or library screening provides suitable ligand candidates. These additional structural resources exceed those that can be supported by the Program Project funding mechanism, but they are essential as we define and refine our approaches for functional assignment. The implementation of an expanded program for functional assignment will require specific foci. We are not yet ready for high throughput functional assignment, although that is the long-term goal. At this early stage, each component of the process—target selection, protein purification, structural determination, application and refinement of computational algorithms, experimental testing, and biological verification—must be performed carefully and deliberately. But, resources for such comprehensive programs are required if the community is to solve the problem of the devising more effective approaches for facilitated functional assignment. So, my recommendation is that NIGMS morph the structure-centered efforts of the Protein Structure Initiative into fully integrated, multidisciplinary efforts of a Protein Function Initiative. Eventually, the structures and functions of all proteins need to established, but current efforts to enable prediction of the structures of all proteins is premature. Instead, NIGMS should place its emphasis on the development of informed, but general, solutions to the problem of functional assignment. In the short term this necessarily will involve focus on a limited collections of proteins (e.g., members of mechanistically diverse superfamilies or those encoded by pathogens); but, in the long term the resources that are invested can be expected to benefit the entire biomedical community.
PY - 2007/11/13
Y1 - 2007/11/13
UR - http://www.scopus.com/inward/record.url?scp=35848958144&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35848958144&partnerID=8YFLogxK
U2 - 10.1016/j.str.2007.10.003
DO - 10.1016/j.str.2007.10.003
M3 - Comment/debate
C2 - 17997960
AN - SCOPUS:35848958144
SN - 0969-2126
VL - 15
SP - 1353
EP - 1356
JO - Structure
JF - Structure
IS - 11
ER -