当前位置:首页 >> >>

Protein structure prediction force fields parametrization with quasi Newtonian dynamics, Pr

PROTEINS: Structure, Function, and Genetics 27:367–384 (1997)

Protein Structure Prediction Force Fields: Parametrization With Quasi-Newtonian Dynamics
Patrick Ulrich,1 Walter Scott,1 Wilfred F. van Gunsteren,1 and Andrew E. Torda2* 1Computational Chemistry (Physical Chemistry) ETH Zentrum, 8092 Zurich, Switzerland ¨ 2The Research School of Chemistry, The Australian National University, Canberra ACT 0200, Australia

ABSTRACT We present an unusual method for parametrizing low-resolution force ?elds of the type used for protein structure prediction. Force ?eld parameters were determined by assigning each a ?ctitious mass and using a quasi-molecular dynamics algorithm in parameter space. The quasi-energy term favored folded native structures and speci?cally penalized folded nonnative structures. The force ?eld was generated after optimizing less than 70 adjustable parameters, but shows a strong ability to discriminate between native structures and compact misfolded alternatives. The functional form of the force ?eld was chosen as in molecular mechanics and is not table-driven. It is continuous with continuous derivatives and is thus suitable for use with algorithms such as energy minimization or newtonian dynamics. Proteins 27:367–384, 1997.

r 1997 Wiley-Liss, Inc.
Key words: protein folding; force ?eld; structure prediction; molecular dynamics INTRODUCTION Ideally, one would like to be able to predict protein structures based only on the amino acid sequence and without recourse to experiment. This may not be realistic, but it may well be feasible to recognize when a new sequence will adopt a known fold1–4 or whether a proposed structure is totally misfolded.5 To do this, one needs some function of coordinates that yields a number re?ecting the ?t of sequence and structure. In molecular mechanics nomenclature, this is a potential energy, but it might also be referred to as a score6 or pro?le.7 The incentive to develop such useful potential energy functions continually increases as more protein sequences are determined and as the ?rst complete genomes are sequenced. To be practical, such potential energy functions are based on simple representations of proteins using only one or two interaction sites per residue. Beyond that, there is a multitude of approaches. For example, functions may be de?ned in discrete (usually lattice) coordinates8–12 for the sake of speed or in continuous space for detail. In this work, we concenr 1997 WILEY-LISS, INC.

trate only on continuous force ?elds. A less obvious, but conceptually important distinction can be made on the philosophy of the force ?eld. Some force ?elds can be seen as tightly connected to an underlying physical basis in that they should in some way be an average over physical terms.13–17 Alternatively, a force ?eld may be statistical in nature. Although it will be some kind of average of the underlying physical interactions, the functional forms may bear little relation to common terms such as electrostatic or Lennard-Jones potential energy functions. Because the functional forms are not physically based, they may only be useful for a narrow range of structures. The force ?eld in this work is of the simple statistical type, which immediately limits its ability to generalize. The parametrization procedure is based on native and other compact conformations so one would not expect it to extrapolate to unfolded conformations. One can also see that, although we use nomenclature from molecular mechanics, our energies bear no relation to any real potential energy. Given these caveats, the aims of this force ?eld parametrization can be stated. The calculations are based on a wide variety of proteins and should encapsulate properties common to proteins in general. The parameters will only be relevant to compact structures, but they should be useful across many sequences and folded conformations. The ability to generalize is further enhanced by using a small number (,70) of adjustable parameters so as to avoid ?tting to noise in the dataset.18 This relatively small number of parameters was brought about by using only four kinds of interaction and grouping interaction types (not amino acids) into classes. The method of parametrization in this work is quite unusual in the area of low-resolution force ?elds. The procedure began by taking the idea of Crippen and coworkers19,20 that the force ?eld should speci?cally penalize misfolded structures. This idea was then cast into the form of a continuous pseudoenergy function. The parameters were then given

*Correspondence to: Dr. Andrew E. Torda, The Research School of Chemistry, The Australian National University, Canberra, 0200, Australia. E-mail: andrew.torda@anu.edu.au. Received 8 March 1996; accepted 13 August 1996.

TABLE I. Relative Weights of Force Field Terms Terma Kn,n12 Kn,n13 Klong


Valueb 2.0 1.0 0.1
to Equation (1). units.

Interaction type Pseudoangle Pseudotorsion Long range


?ctitious masses, treated like particles, and subjected to quasi-newtonian dynamics. This dynamics simulation was solely used as an optimization process to improve initial estimates of parameters and was in no way a physical simulation. METHODS The following sections describe the protein model, interaction functions and parameters, and a methodology for optimizing the parameters. To make this interpretable, certain conventions are used. Superscripts are used to specify topological proximity, so En,n11 may refer to an interaction energy of two adjacent amino acids and sn,n13 would refer to a s parameter for two residues with two intervening residues. Lowercase subscripts refer to any pair of residues, so rab is the distance between residues a and b, regardless of their type. Force ?eld parameters depend on topological proximity and amino acid type, so one also needs a convention for specifying the type or class of amino acid. An uppercase subscript is used here, so eJ is a parameter for residues of type J, and slong is the s parameter for XY interactions between residues of types X and Y at a long topological distance. Protein Model and Interaction Function Energies were calculated as if they were true potential energies, so the nomenclature is taken from conventional molecular mechanics. Each amino acid was represented by a single point located at the Ca atom. These points then interacted in one of four ways depending on their topological proximity. The total potential energy of a protein was calculated by summing four terms: Etotal 5 K n,n11

The interaction between adjacent residues is analogous to a bond potential energy in a normal force ?eld, although there is no real bond there. This gives rise to the ?rst term, En,n11, in Equation (1). Similarly, one can de?ne an energy term that depends on the angle included by three adjacent residues. This is the second term in Equation (1). Residues separated by two intervening connected residues form a pseudotorsion angle and their interaction gives rise to the En,n13 term in Equation (1). All other amino acid pairs interact by a fourth kind of interaction, Elong. Each of the terms in Equation (1) had a different dependence on either parameters or amino acid types, and each is described below. Although some initial tests were performed using a simple harmonic form for the pseudobond term, this was replaced in ?nal implementations. We used the iterative algorithm, SHAKE21 to hold pseudobonds to 3.81 ?, the typical distance between Ca atoms in adjacent residues with trans peptide bonds.12,16 This concedes some error due to the small fraction of cis peptide bonds. The effect of this was that energy from a pseudobond term was redistributed into the other terms of the force ?eld. A Lennard-Jones-like term was used for residues separated by one intervening residue in the sequence. Considering three sequential residues i, j, and k of types I, J, and K, the interaction energy En,n12 depended on the type of central residue J and ik the distance rik between residues i and k. The interaction energy was calculated according to E n,n12 (rik, e n,n12, sn,n12) ik J J 5 en,n12 5 J


s n,n12 12 J rik




s n,n,12 10 j rik




pseudo bond


E n,n11 1 K n,n12

pseudo angle


E n,n12

1 K n,n13

pseudo torsion angles


E n,n13 1 K long

other pairs




Each of the four terms was weighted by an arbitrary constant, K, as given in Table I.

The energy has been written explicitly as a function of force-?eld parameters, as well as the distance, to highlight the fact that these were later treated as adjustable quantities. This has been deliberately cast so as to highlight the similarity with a LennardJones term in a molecular mechanics force ?eld.22 sn,n12 gives the distance of lowest energy and en,n12 J J gives the depth of the energy well at that distance. The single uppercase subscript J re?ects the fact that the parameters for the n, n 1 2 interaction depend only on the type of the central residue J. When calculating the energy of a protein, the contribution from Equation (2) was summed over every sequential triplet of residues not spanning a chain break. The same functional form was used for residues separated by two intervening residues, but with a different set of parameters and different dependence on parameters. Given four sequential residues i, j, k, and l of types I, J, K, and L, the energy En,n13 il depended on the types of the central pair of residues



J and K, and the distance ril between the outer residues. The interaction energy was calculated according to E n,n13(ril, e n,n13, s n,n13) il JK JK 5en,n13 5 JK


s n,n13 12 JK ril




s n,n,13 10 JK ril




This term’s contribution to the total energy of the protein was calculated by summing over all sets of four sequential residues that did not span a chain break. The last component of the potential energy, Elong accounted, in a mean manner, for the tendency of certain kinds of residues to adopt certain distances E long(rij, e long, slong) ij IJ IJ 5 1 11 elong IJ
Fig. 1. Total energy as a function of protein size. Energy is in arbitrary units. Nres refers to the number of residues. Each point represents a protein in the calibration set of 104 proteins. Energies were calculated using the initial force ?eld before parameter dynamics.

31 2

s long 12 IJ

2 12

1 24
s long 10 IJ rij



This term was summed over every pair of residues separated by more than three pseudobonds, but within a cutoff of 15 ?. This included residues within a protein, across subunits of polymeric proteins, and any bound small peptides. The choice of cutoff size was arbitrary, but ensured that, beyond a certain size, the total energy would have a linear dependence on the size of a protein. No term was used for disul?de bridges, since it was intended to use the force ?eld when the location of disul?de bridges would not be known. Their structural effect was only included in a mean manner along with all the other in?uences on parameters for Equation (1). Protein Set For Parametrization For force ?eld development, one requires a set of calibration proteins. This set should be large, but contain only reliable, well-de?ned structures without sequence or structural homology between members. The basis for the calibration set was provided by Hobohm and Sander23 and consisted of proteins with less than 35% sequence homology. Structures with crystallographic R factors greater than 0.25 or resolutions worse than 2.8 ? were removed. Proteins with missing internal Ca coordinates were deleted, but those missing up to 7 C- or N-terminal residues were included. Decisions about the resolution of NMR structures were avoided by omitting them entirely. Similarly, any proteins with large prosthetic groups or internal metal ions were omitted. This provided an initial set of 108 proteins. A further selection step was applied by calculating the energy of each protein using the initial crude parameter set described below. This suggested that

four proteins (1cid, 1omf, 1fai, and 1tnf) were of high energy (data not shown). The energy was then decomposed into the individual terms of Equation (1). In each case, the high energy could be attributed to an unusually short (#4.51 ?) n, n 1 2 Ca—Ca distance. These entries were removed from the reference protein set, even though they met the criteria of resolution and R factor. Figure 1 shows the potential energy for each protein in the set as a function of size. Energies were calculated using the unre?ned force ?eld and serve to show that there were no obvious outliers or incorrect structures in the calibration set. This resulted in the set of 104 proteins given in Table II. The choice of proteins was somewhat arbitrary, but generally erred on the side of reliable structures. This may have unfairly excluded highquality structures, but allowed such structures to be used for testing. Initial Parameter Estimation The optimization of s parameters for Equations (2) to (4) was a large calculation based on energy differences between native and misfolded structures. Initial s values, however, were calculated using a simple method based only on native structures and designed to minimize the energy after summing across all proteins.19 For example, for the pseudoangle term, Equation (2), the derivative of the energy with respect to sn,n12 was set to zero and solved for J for sn,n12, assuming en,n12 5 1. This yielded an J J expression.

s n,n12 5 J

3 4
NJi j k




r 2q ik

NJi j k


212 ik



TABLE II. Proteins Used for Force Field Calibration Protein codea 1aaj 1aak 1aap 1aep 1apa 1arb 1ayh 1baa 1bgh 1bn2 1bov 1bsa 1cau 1cd8 1cdt 1cew 1cmb 1col 1cse 1ctf 1dfn 1dri 1dsb 1eaf 1ede 1end 1fas 1fba 1gdh 1gmf 1hdd 1hle 1hsb 1hyp 1ifa 1ifc 1ipd 1lfb 1lis 1lts 1mpp 1nar 1ndk 1omp 1onc 1ppa 1ppb 1pgd 1pii 1plc 1poh 1pos 1rcb 1rtc 1rve 1sbp 1sgt 1shg 1shf Number of subunits Number of residues 105 150 112 161 266 268 214 243 85 354 345 321 365 114 120 108 208 408 337 68 60 271 376 243 310 137 61 1440 640 238 114 375 374 75 159 131 345 77 131 741 357 289 148 370 104 121 298 469 452 99 85 212 129 268 488 309 223 57 118 Residue coordinates unknownb Resolution (?) 1.8 2.4 1.5 2.7 2.3 1.2 2.0 2.8 1.8 2.8 2.2 2.0 2.3 2.6 2.5 2.0 1.8 2.4 1.2 1.7 1.9 1.7 2.0 2.3 1.9 1.6 1.8 1.9 2.4 2.4 2.8 2.0 1.9 1.8 2.6 1.2 2.2 2.8 1.9 2.0 2.0 1.8 2.2 1.8 1.7 2.0 1.9 2.5 2.0 1.3 2.0 2.6 2.2 2.3 2.5 1.7 1.7 1.8 1.9 R factorc 0.16 0.22 0.18 0.21 0.17 0.15 0.16 0.20 0.19 0.23 0.18 0.17 0.19 0.19 0.20 0.20 0.20 0.18 0.18 0.17 0.19 0.19 0.17 0.20 0.16 0.20 0.15 0.18 0.19 0.20 0.22 0.18 0.22 0.19 0.20 0.17 0.19 0.21 0.19 0.18 0.16 0.16 0.20 0.21 0.18 0.16 0.16 0.19 0.17 0.15 0.14 0.22 0.22 0.23 0.19 0.18 0.16 0.20 0.18


8NC 5N 5C


5 3 2 2 2 2 2 2 2


4 2 2 2 2 4








TABLE II. (Continued) Protein codea 1sim 1sry 1tbp 1tlk 1tml 1tpk 1ttb 1ula 1utg 1vaa 2alp 2cas 2cpl 2cro 2hpr 2liv 2pmg 2pol 2rn2 2sas 2scp 2sga 2snv 2tgi 2zta 3adk 3cd4 3chy 3dpa 3eca 3il8 3mon 3rp2 3sgb 3tgl 4blm 4enl 4gcr 4ins 4sgb 6taa 7icd 8ilb 9rnt 9wga

Number of subunits 2 2

Number of residues 381 842 360 103 286 264 254 289 70 381 196 548 164 65 87 344 1122 732 155 185 348 181 151 112 62 194 178 128 218 1308 68 376 448 235 265 512 436 174 102 236 478 414 146 104 342

Residue coordinates unknownb

Resolution (?) 2.0 2.5 2.6 2.8 1.8 2.4 1.7 2.8 1.3 2.3 1.7 3.0 1.6 2.4 2.0 2.4 2.7 2.5 1.5 2.4 2.0 1.5 2.8 1.8 1.8 2.1 2.2 1.7 2.5 2.4 2.0 2.8 1.9 1.8 1.9 2.0 1.9 1.5 1.5 2.1 2.1 2.4 2.4 1.5 1.8

R factorc 0.19 0.18 0.21 0.18 0.18 0.18 0.16 0.20 0.23 0.17 0.13 0.21 0.18 0.19 0.15 0.18 0.22 0.18 0.20 0.20 0.18 0.13 0.20 0.17 0.18 0.19 0.20 0.15 0.18 0.15 0.19 0.19 0.19 0.12 0.13 0.02 0.15 0.18 0.15 0.14 0.20 0.18 0.16 0.14 0.17

3 2


2 2



8 8 2 2 2

4 2



acquisition code. bNumber of residues at the N or C terminus for which no coordinates are available. cCrystallographic R factor.

where q 5 10 for the pseudoangle term of Equation (2). The summation runs over all NJijk triplets of sequential residues not spanning chain breaks and across the protein calibration set and where the central residue is of amino acid class J. The expression is still valid if class J contains more than one amino acid type. Analogous expressions with q 5 10 and q 5 1 were used for the sn,n13 and slong paramJK IJ

eters in the pseudotorsion and long-range interaction terms, respectively. The various e parameters were calculated using a method designed to weight those interactions which showed a preference for a certain distance. Again, one can use the example of the parameter for the pseudoangle term as an example. If a certain residue type X causes a very narrow range of n, n 1 2



distances, then the corresponding en,n12 should be X large. If residues surrounding residues of class Y occupy a wide range of distances, then the en,n12 Y parameter should be small. This statistical tendency n,n12 can be quanti?ed by ?rst setting e J 5 1 in Equation (2) for the amino acid type class J and then summing over all relevant pairs to obtain
NJi j k

en,n12 5 J


n,n,12 Jijk

NJi j k



As for the previous calculations, the summation runs over all triplets of adjacent residues not spanning a chain break and where the central residue is of amino acid class J. The same expression, summed over the appropriate pair energies was used for the n, n 1 3 and long-range interactions to generate en,n13 and elong parameters, respectively. The example in Equation (6) is based on the n, n 1 2 term, but a physical basis for the method can be seen by considering the example of the long-range interaction parameters. A pair of residues such as Asp and Arg are oppositely charged and may have a strong statistical preference for interacting at a long certain distance (close to sAsp-Arg ). This distance of minimum interaction energy will result in a maximum value for Equation (6). By contrast, Gly-Gly pairs show little tendency to prefer any particular distance. Energies calculated from Equation (4) will often be near zero, and Equation (6) will yield a value closer to zero. This interaction will have correspondingly little weight in the ?nal force ?eld. Parameter Classi?cation For the pseudoangle term, Equation (2), the choice of parameter depended on the type of the central residue, so there were 20 sn,n12 and 20 en,n12 parameters or, in this nomenclature, 20 amino acid classes, each having one member. No parameter reduction was applied to these terms. For the n, n 1 3 and long-range parameters, an algorithm was used to reduce the number of adjustable parameters and improve the method’s ability to generalize.18 Both of these potential energy functions depended on the type of two residues, so without some classi?cation, there would be (20 3 21)/ 2 5 210 pairwise interaction parameters. The aim of the algorithm was to produce a classi?cation that reduced the total energy summed across the reference protein set. Physically, this could be seen as looking for those interactions that were most similar and putting them together in a class. The method was applied separately for the n, n 1 3 and long-range interaction terms and required that for any group or class of pairwise interactions, one could use Equation (5) to calculate a s (distance of lowest energy) for that group or class. For example, a

class may consist of Asp-Cys and Glu-Leu interactions. Then, for all interactions of these types across the calibration set of 104 proteins, a s could be calculated. Given some classi?cation (indicated by Cn ), one can calculate a total energy over all interaction classes and all proteins. This is the energy E(Cn ) associated with some classi?cation, Cn. For initialization, one treats each interaction type separately and calculates a s value for each pairwise interaction type using Equation (5). This forms a list that can be sorted according to s value. The result will be that pairs that tend to be close to each other (small s) are at one end of the list and pairs that repel (in a statistical sense) will be at the other end of the list. The classi?cation procedure now operates by initially putting all interactions into a single class. So, all members of the list form class 1 (indicated by the lower index), C1, of the classi?cation comprising only 1 one class (indicated by the upper index, nc). When calculating the s value s1 for class C1, the summa1 1 tion in Equation (5) runs over all pairwise interaction types of class C1. At each step n ($2) of the 1 classi?cation procedure, one of the classes present is partitioned into two parts such that the decrease in the energy of classi?cation Cn, E(Cn ), with respect to the energy E(Cn21 ) is maximized. This can be illustrated by an example. Figure 2 shows a simple case with six s values, each corresponding to a different pairwise interaction type. The target in this example was arbitrarily set to 4 classes (nt 5 4). Initial estimates of s were calculated for each interaction type, sorted and put into one class (nc 5 1), as shown in the top row of the ?gure. The s value of this class, s1 depends on the summation over all six interaction 1 types in Equation (5). The second row shows np 5 5 possible places to divide the set, each marked by a light line. Each position is tested in turn and the one which produces the lowest energy selected. This results in two classes (third row) where the ?rst class consists of interactions types 1 to 4 and the second of interaction types 5 and 6. There are then np 5 4 possible ways to add a division (fourth row). Picking the position that results in lowest energy splits the ?rst class. The result is shown on the ?fth row with three classes of s values. At the next step, there are np 5 3 possible places to divide a class. Choosing one of these results in the ?nal row with four classes. Since the number of classes equals the target number, (nc 5 nt), the algorithm stops. The algorithm can now be described more formally. Let C be some set (class) of pairwise interaction types. Initially, there is one set, C1 containing all pairwise interaction types (210 in this calculation). By the end of the procedure there will be sets of interaction types C1 . . . Cnt forming a classi?cation Cnt. The classi?cation or set of interaction classes is denoted S. Because one needs to rank classi?cations in terms of energy, the algorithm also requires a set



Fig. 2. Algorithm for classi?cation of pairwise interaction types. Each number (1–6) represents a pairwise interaction. Subscripts indicate the class to which the s is assigned. Thin lines show possible divisions of the set. Thick lines show divisions introduced during the progress of the algorithm. Steps are described in text.

K where each member is an energy, calculated over the whole set of calibration proteins. Each energy, EDy will correspond to a trial classi?cation Cnc11, Dy which is obtained from classi?cation Cnc by partitioning class D (or Cnc ) of Cnc into two parts indicated by D y. So, one has EDy 5 E(Cnc11 ). For a pseudocode Dy description we use a generic SET data type:
variables CI . . . Cnt S K L, L1, L2 pseudocode C1 Z 5x 0 x is a pairwise interaction type6 sort C according to the s for each interaction type as calculated through Equation (5) S [ 5C16 nc [ 1 while (nc , nt) 5 K[B for (D [ each existing class C ) 5 for ( y [ every possible division of class D)5 EDy [ total energy associated with class D and division y K [ K < 5EDy6 6 6 sort set K according to EDy /* ?nd the division giving lowest total energy */ let L be the set associated with the lowest energy member of K split L into components L1 and L2 according to the division y referenced by EDy S[S2L S [ S < L1 S [ S < L2 nc [ nc 11 /* remove the set which is to be split */ /* Add ?rst component of new set */ /* Add second component of new set */ /* Add this energy to set K */ /* Initialize to empty set */ /* The ?rst member of the set of sets is C1 */ SET SET SET SET /* Each set is a class of interactions */ /*Classi?cation C nc, a set of C1 . . . Cnc */ /* Set of energies */ /* Set of interactions */

This scheme was used to reduce the number of n, n 1 3 interaction pair types from 210 to 36 (nt 5 36) and the long-range interaction pair types from 210 to 10 (nt 5 10). This was the same number of classes as in early work by Crippen and Snow,19 but the classes here were for interaction types rather than amino acid types. This means, for example, that a Trp-Tyr interaction may be in one class, while a Trp-Gly interaction could be in another. Generation and Initial Selection of Alternative Structures Compact misfolded alternative structures were generated for each of the 104 native structures in the calibration set. These alternative structures were constructed by taking each native sequence and threading it on to the Ca coordinates of every other protein with at least the same number of residues.24,25 If the native structure had Nnat residues, then the ?rst alternative structure used residues 1 to Nnat in the ?rst larger (template) structure. The second alternative structure came from the Ca coordinates of residues 2 to Nnat 1 1 in the template structure and so on. Thus, each template of Ntmpl residues added Ntmpl 2 Nnat 1 1 alternative conformations. Unlike in previous work, the number of alternative structures was then doubled by threading each sequence backward on template coordinates. Lastly, each native structure provided one more alternative structure by threading the sequence backward onto its own native coordinates. As tem-



plates, only unbroken protein subunits were used. Proteins with chain breaks were threaded on to larger template structures with a 1-residue gap at the chain break. Some alternative structures were then discarded. An alternative structure, which was structurally similar to a native, was not used. Ideally, redundancy in the alternative structures would be removed by comparing every alternative conformation with every other alternative conformation, but this would be a very expensive calculation for little gain, since the calibration set of 104 proteins was chosen so as to have little internal homology. A more practical approach was to compare each alternative conformation with the previously generated alternative conformation and only accept it if it was signi?cantly different. This procedure was very efficient, since successively generated alternative structures are most likely to be similar to each other. Structural similarity was de?ned by comparing distance matrices from structures. This measure has been called the distance matrix error (DME)26 or distance root mean square difference (Drmsd )27 and is given by

and so attempting to optimize them results in in?nite values. Only s values for the (n, n 1 2), (n,n 1 3) and long-range terms were subjected to dynamics. The potential energy function for parameters consisted of three terms. The ?rst two were based on the goals of the parametrization. First, native structures should be of low energy. Second, misfolded alternatives should be of higher energy. Before writing this formally, one should note that the number of alternative structures per protein differs within the protein calibration set and the proteins themselves differ in the number of interactions they contribute. This means that a large protein (few alternative structures) would dominate the native structure term, while a small protein would dominate the alternative structure term. The scheme here ?rst normalized to account for the number of alternative structures for a protein and then scaled the total parameter force due to each protein to be of the same magnitude. The parameter energy is of a different form to the pseudoenergies calculated for protein structures, so it is denoted Es. For a single protein and its alternative structures, Es is given by E s(r, e, s) 5 Etotal(r NAT, e, s) 2

Drmsd(A, B) 5




(r 2 1) o



2 rBij)





where Nres is the number of residues, and rAij and rBij are the distances between particles i and j in structures A and B, respectively. The threshold for structural similarity, D, has a cube root dependence on the number of amino acids. We used the value from Maiorov and Crippen27 D s 5 24.54 1 2.36 (Nres)1/3 (8)

o Nalt a51



Etotal(ra, e, s) 1

1 Nprot



Although their prescription was based on the root mean square difference of cartesian coordinates, the effect in this work was that too low a value for Dsig was used and the discrimination problem (below) was made more difficult. Parameter Dynamics and s Optimization The initial parameter estimates, described above, were based on native protein structures only. In the next step, parameters were optimized so as to simultaneously disfavor incorrectly folded alternative structures. This was done using an approximation to newtonian dynamics in parameter space. The word approximation is deliberately chosen, since the method did not conserve (parameter) energy, the parameter trajectories did not have continuous ?rst derivatives, and the scheme relied on a series of pragmatic decisions necessary to make the calculations computationally tractable. The e parameters [Eqs. (2) to (4)] were not optimized. In our formulation, they are scaling factors

where rNAT is the coordinate vector for the native structure and Nprot is the total number of proteins in the calibration set. The summation in the second term runs over the pseudoenergy of each of the Nalt alternative structures, with coordinates labeled ra. e and s are the parameter vectors as used in Equations (2) to (4). Erestr was a device to dissuade individual parameters s of the vector s from entering unlikely areas of parameter space and consisted of harmonic terms to enforce minimum and maximum s values (smin and smax, respectively), given by



5 5

krestr 2 krestr 2

(s 2 smin)2 (s 2 smax)

if s , smin (10) if s , smax

krestr was set to 200 energy units/?2 and smin to 3 ?. smax was set to two pseudobond lengths (2 3 3.8 5 7.6 ?) for the n, n 1 2 pseudoenergy term [Eq. (2)] and three pseudobond lengths (11.4 ?) for the n, n 1 3 pseudoenergy term [Eq. (3)]. An initial force, Fini, acting on the elements of s, due to each protein and its alternative structures was calculated by taking the opposite of the derivative of Equation (9) with respect to s. The ?nal force



vector, Ftotal was given by 1 Nprot

Ftotal(r, e, s) 5

o 10 F

I Fini(r, e, s I ini

(r, e, s 0



where F1 is the force due to protein I and its ini alternative structures and the summation runs over all Nprot proteins in the calibration set. Forcing the contribution of each protein to be the same does not lead to discontinuities in the parameter trajectories, but potentially violates energy conservation. Parameter dynamics were run with arbitrary units. Boltzmann’s constant was taken as 1, and, by trial and error, a time step of 5 3 1024 was chosen for integrating the equations of motion. Initial temperatures were not taken from a maxwellian distribution, as is common practice in molecular dynamics simulations. A less rigorous approach was used where parameters were allowed to develop velocities by moving in the ?eld of Equation (9) and the temperature maintained by coupling to a heat bath.28 Separate temperature baths were kept for slong parameters [Eq. (4)] and for the sn,n12 and sn,n13 parameters [Eqs. (2) and (3)]. The baths were set at 20 and 70 temperature units, respectively. Coupling to both baths was initially set to 5 3 1023 time units and this was increased by an exponential scheme to 7.1 3 1023 over 100 time steps. At each time step, the SHAKE algorithm21 was applied to all native structures and their alternatives to bring pseudobonds to regular lengths. The scheme described was then modi?ed to improve its ability to discriminate correctly from incorrectly folded structures and to speed the calculations. The initial parameters showed some capability for discrimination. That is, many misfolded structures were of higher energy than native conformations, and trial calculations suggested that they remain of high energy after parameter dynamics. Clearly, they contributed little of use to the force experienced by the parameters. This led to a screening, applied every step of parameter dynamics, to select misfolded alternative structures of low energy. Intuitively, one might choose alternative structures of energy lower than the corresponding native structure, but this would not be a strong enough condition for a useful force ?eld. One wants the difference between correct and incorrect structures to be as large as possible. To this end, a threshold energy (Ethresh ) was de?ned, which should be more positive than the native energy by some positive number D. We then de?ne
I E thresh 5 E I 1 DI nat

mation a will only be used in the force calculation if I its energy EI is below Ethres. From this point, one can a I should be gradually increased as the force see that D ?eld improves. This idea is shown in Figure 3. Each point in the ?gure represents an alternative structure. The y-axis gives the energy corresponding to each conformation. Points marked by dots are alternative structures of high energy and do not contribute to the force felt by parameters. Conformations marked by crosses are alternative structures of low energy, which should be used to calculate forces on parameters. Initially (Fig. 3a), there are three conformations below the energy threshold. After some steps of parameter dynamics the force ?eld has improved and the energy of two of the alternative structures rises. In Figure 3b, one of the three crossed points is above the energy threshold. The threshold should be raised as shown in Figure 3c, so as to select the most important (lowest energy) alternative structures. One still requires a method for determining the way in which the threshold difference DI is increased. To this end, weak coupling28 was used. First, one needs a switching function, to determine whether or not an alternative structure, a, is below the threshold relative to the native structure of protein I. This is de?ned by
I s(E nat’, E I , DI ) 5 a


0 1

if E I . E I 1 DI a nat if E I # E I 1 DI a nat



This switching function can then be used to de?ne the average energy difference between low-energy alternative structures and the threshold. If there are Nprot proteins in the calibration set and Dav is the averI age difference between threshold energies Ethresh and I the energies of alternative structures Ea, one can write Dav 5 1 Nprot
N prot




I N alt


o [(E

I nat

1 DI 2 E I )s(E I , E I , DI)] a nat a
N alt a51


s(E I , E I , DI ) nat a



To continue the analogy with weak coupling schemes,28 a target or reference value, D0 , was set, av and the goal was to maintain this at a constant value. A scaling factor, l, was calculated from l5 11



dt D0 av t Dav






where the superscript I, denotes protein I, so EI is nat I the energy of native structure I and Ethresh is the threshold energy for protein I; An alternative confor-

where dt is the time step of the integrator for the parameter dynamics, and t is the coupling constant that controls how rapidly changes are applied. Fi-



Fig. 3. Energy threshold for selection of alternative structures. r is some generalized coordinate representing conformational space. E denotes conformational energy. Enat is the energy of a native structure and Ethresh the corresponding threshold energy for alternative conformations. Crosses mark conformations of energy less than the threshold that contribute to the calculation of forces on the parameters. Dots mark the alternatives which are already of high energy and do not contribute to the force calculation.

nally, the coupling was applied by applying the scaling to the energy difference DI: DI(t) 5 lDI(t 2 dt) (16)

where DI (t) is the value of D at the current time step and DI (t 2 dt) is the value of D1 at the previous time step. The constants in Equation (15) were set by trial and error. The fraction dt/t was set to 0.1 and D0 to av 10.0. One should also note that DI (t) was different for each protein I. Furthermore, DI (t) is not de?ned at the ?rst time step, so we arbitrarily set DI (0) 5 0 EI 0, the absolute value of the energy of the native structure of protein I. In practice, this value was probably too large. During the short parameter dynamics calculation, the system did not equilibrate with the weak coupling bath, DI (t) tended to decrease, and the number of alternative structures contributing to Figure 3 the parameter forces decreased correspondingly. RESULTS The ?rst calculation was the classi?cation of parameters for the n, n 1 3 and long-range interactions, given in Tables III and IV, respectively. The classi?cations were based on the initial estimates of sn,n13 and slong before parameter dynamics and do not include the in?uence of misfolded alternative structures. The classes were ordered by the initial s value and show some trends. For example, the glycine column for the long-range interaction (Table IV) shows that the residue is usually in classes 1, 2, or 3, favoring the smallest distance of minimum interaction energy. Similarly, the bulky aliphatic

leucine residue is usually in classes 6–10 with a tendency to prefer interactions at larger distances. While one should not read too much physical signi?cance into this initial calculation, one property is clear. Classifying interaction types rather than amino acids allows more ?exibility in the functional form without increasing the number of adjustable parameters. For example, one would expect the oppositely charged Asp and Arg to be in different classes, but the scheme here allows Asp and Arg to fall into the same interaction class with respect to residues such as Trp or Gly. The 104 proteins in the calibration set resulted in 1.7 3 106 misfolded alternative structures. After applying similarity criteria and the initial selection based on energies, there remained 350,000 alternative structures (summed over all 104 proteins). This set formed the basis for calculating forces on parameters during dynamics calculations, although all 1.7 3 106 alternative structures were used for testing force ?eld quality below. The example dynamics calculation shown here ran for (coincidentally) 104 steps. The best results in terms of discriminating correct and misfolded structures were obtained after 92 steps and are described below. The results of the parameter dynamics calculations can be most clearly seen by considering the best and worst results. A good result would be one where the native structure is of low energy compared to all alternative structures. Using this measure, the protein giving the best result is a 60-residue toxin with the PDB acquisition code, 1cdt shown in Figure 4. The top panel shows the energy of alternative structures with respect to the energy of the native structure. Energies are divided by the number of residues



TABLE III. n,n 1 3 Interaction Pair Classes Identi?ed by Their Integer Class Number (1–36) Ala Arg Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val 14 14 19 19 11 15 18 20 15 1 14 16 19 22 19 22 10 16 14 13 14 10 11 17 13 14 15 29 9 14 12 9 18 12 29 23 14 22 19 8 Asn Asp Cys Gln 19 11 33 15 20 17 20 23 19 12 21 23 27 25 6 28 27 24 20 20 19 17 15 24 24 14 13 26 20 10 14 23 19 17 5 24 21 25 16 15 11 13 20 24 22 13 19 34 17 24 18 27 6 25 32 26 30 30 15 29 15 14 17 14 13 9 16 27 18 12 19 21 11 23 3 17 10 9 14 23 Glu 18 15 20 13 19 16 14 25 22 14 19 18 14 7 19 16 19 24 17 13 Gly 20 29 23 26 34 27 25 28 30 2 20 14 16 12 29 29 29 29 27 25 His Ile 15 9 19 20 17 18 22 30 25 25 20 21 28 13 31 23 22 10 9 14 1 14 12 10 24 12 14 2 25 12 16 9 19 14 19 21 24 22 13 14 Leu 14 12 21 14 18 19 19 20 20 16 18 19 11 9 25 17 21 9 19 16 Lys Met 16 9 23 23 27 21 18 14 21 9 19 12 21 20 28 18 22 12 13 13 19 18 27 19 6 11 14 16 28 19 11 21 10 20 22 20 10 17 27 16 Phe 22 12 25 17 25 23 7 12 13 14 9 20 20 12 30 29 9 8 15 11 Pro 19 29 6 5 32 3 19 29 31 19 25 28 22 30 35 6 14 13 28 22 Ser 22 23 28 24 26 17 16 29 23 21 17 18 20 29 6 24 22 29 10 25 Thr 10 14 27 21 30 10 19 29 22 24 21 22 10 9 14 22 27 4 4 20 Trp Tyr Val 16 22 24 25 30 9 24 29 10 22 9 12 17 8 13 29 4 36 7 8 14 19 20 16 15 14 17 27 9 13 19 13 27 15 28 10 4 7 11 22 13 8 20 15 29 23 13 25 14 14 16 13 16 11 22 25 20 8 22 8

TABLE IV. Long-Range Interaction Pair Classes Identi?ed by Their Integer Class Number (1–20) Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser The Trp Tyr Val Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val 5 7 8 5 4 7 6 2 9 4 6 7 3 6 4 2 5 6 5 6 7 5 5 4 4 8 5 2 5 5 8 9 8 8 4 4 3 5 4 7 8 5 4 7 7 8 6 2 8 8 9 3 8 5 3 6 7 8 5 8 5 4 7 8 10 8 9 2 4 9 9 3 6 4 7 6 4 8 9 9 4 4 7 10 1 8 9 2 9 4 6 8 3 2 4 7 3 3 3 6 7 8 8 8 8 4 8 2 6 8 9 8 8 6 6 3 3 7 8 7 6 5 6 9 9 8 6 4 8 7 10 3 7 7 9 5 5 9 7 8 2 2 2 2 2 2 4 1 3 3 3 3 3 2 2 2 2 2 2 3 9 5 8 4 9 6 8 3 3 8 9 3 6 4 7 7 6 2 6 9 4 5 8 9 4 8 7 3 8 5 6 7 4 7 8 7 7 6 4 4 6 8 9 9 6 9 10 3 9 6 8 9 6 7 8 7 9 6 6 6 7 9 3 3 8 8 3 3 3 7 9 3 9 7 3 4 3 3 3 8 3 8 8 6 3 8 7 3 6 4 6 9 9 5 3 7 7 8 7 7 6 8 5 4 2 6 7 2 4 7 7 7 5 3 5 5 4 6 8 4 4 4 3 7 4 6 9 2 7 8 8 3 3 5 2 2 5 2 5 6 2 4 6 6 7 3 5 2 7 7 7 4 7 5 2 3 3 3 3 4 5 3 7 4 3 3 5 2 6 7 9 3 7 4 5 3 2 4 4 4 6 5 8 8 3 7 9 2 2 6 6 3 8 6 2 3 4 2 2 6 5 4 5 9 3 8 7 2 6 4 6 3 7 8 5 3 4 2 5 5 6 7 8 9 6 7 8 3 9 4 6 8 7 4 6 4 4 6 5 4

as this allows an approximate comparison of proteins. It is encouraging that the alternative structures are clearly of higher energy and the progress of the dynamics calculation is shown in the middle panel. This gives the number of alternative structures below the energy threshold [Eq. (12)] at each time step. Although initially there are nearly 6,000 alternatives of low energy, the parameters are very quickly driven to favor the native structure. The bottom panel gives some insight into the path of the calculation, showing how the slong parameters change during the calculation. Using the same criterion for quality, the worst results came from the protein 1ppa and are shown in

Figure 5. The top panel again shows the energy of alternative structures relative to the energy of the native (after dividing by the number of residues). More than half the alternative conformations are of lower energy. There is no reason to suspect a problem with native coordinates, but the middle panel may give some clue as to why the parametrization has been so unsuccessful for this protein. From around step 45, the number of alternative structures below the threshold actually increases steadily. The most likely explanation is that the parameter trajectories are being driven most strongly by the energy of other proteins in the calibration set, unfortunately, at the expense of 1ppa. It is also worth examining the



Fig. 4. Progress of parametrization for protein 1cdt. Top: Histogram showing the energy distribution of alternative structures for 1cdt. Energies are relative to the native structure (Ealt 2 Enat) and are divided by the number of residues in the native protein. Energies were calculated with parameters after 92 steps of dynamics. Middle: Number of alternative structures of low energy (below the energy threshold) at each time step of the parameter dynamics calculation. Bottom Value of the ?rst 10 slong parameters (Table VII) at each time step in the parameter dynamics calculation.

Fig. 5. Progress of parametrization for protein 1ppa. Panels correspond to Figure 4.

bottom panel, again showing the slong parameters as a function of time. After 104 steps, they have not stopped ?uctuating. If one was to pursue parameters for this particular force ?eld parametrization, a longer calculation would be necessary. An overview of the results is given in Figure 6. This shows a comparison of the energies of alternative structures (relative to the native energy) for 1cdt, 1ppa, and the calibration set average. This last histogram is obtained by summing over all native and alternative structures and dividing the ordinate by the number of proteins (104) and always dividing the energy difference (Ealt 2 Enat ) by the number of residues in the native structure. In an approximate statistical sense, the simple force ?eld is quite powerful and usually able to discriminate correct (native) structures from plausible, but incorrect, structures. Unfortunately, it could not be deemed truly predictive. For an application such as threading (testing structures for a sequence), this particular parametrization would yield too many false positives to be useful.

The ?nal results of the parametrization calculations are given in Tables V, VI, and VII. When reading Tables V and VI it is important to remember the amino acids and pairs of amino acids are the central ones to each interaction. For example, the ?rst line of Table VI refers to Ala-Ile, so this parameter pertains to alanine and isoleucine as the j, k pair of an i, j, k, l quartet. Viewed from a historical perspective, this is similar to formalizing the tendency of amino acids to be in certain secondary structures and quantifying this geometrically. At the same time, interpretation in terms of secondary structure is not so straightforward. The n, n 1 2 and n, n 1 3 parameters are determined in the ?eld due to all the proteins, which, in turn, depends on all the parameters. This means the parameters re?ecting short-range interactions may be dominated by secondary structure, but would not be sufficient to predict it. Furthermore, there is a more subtle restriction on their use. The initial parameters re?ect average geometric considerations from Equation (5), but the ?nal parameters re?ect the in?uence of misfolded structures. For example, the sn,n12 was Val initially estimated to be 5.6 ? suggesting that when a valine residue is at the center of a triplet of residues there is a tendency for the outer two residues to be at a relatively large distance. After the parameter dynamics, this value becomes 2.7 ? suggesting that the ability to distinguish correct from



Fig. 6. Overall performance of parametrization. Each panel is a histogram showing the energy distribution of alternative structures relative to the native. Bars to the left of zero show alternative structures of energy lower than the corresponding native structure. Top: Results for 1cdt. Middle: Results for 1ppa. Bottom: Results summed over the 104 proteins of the calibration set and then divided by the number of proteins.

TABLE V. n,n 1 2 Interaction Parameters sn,n12 (?) Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Initial 5.470 5.501 5.535 5.524 5.679 5.461 5.480 5.581 5.589 5.569 5.546 5.495 5.513 5.590 5.259 5.516 5.626 5.558 5.612 5.633

Final 5.249 4.883 5.170 5.251 6.225 5.002 5.199 4.136 5.777 5.423 5.731 4.412 5.107 5.216 4.235 4.704 3.948 5.613 6.267 2.692

en,n12 (energy)a 1.460 1.332 1.454 1.492 1.194 1.412 1.464 1.256 1.306 1.250 1.406 1.342 1.304 1.200 0.960 1.240 1.220 1.348 1.224 1.248

Member Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

Noccurb 2480 1356 1465 1677 595 1098 1762 2385 505 1581 2354 1830 559 1141 1344 1811 1833 380 1125 1977

energy units. of occurrences of this interaction in the calibration protein set.


incorrect structures is enhanced by preferring a short distance for this interaction. The tables also contain information about the relative weights of the interactions since the e param-

eters from Equations (2)–(4) directly scale the energy contribution of each term. For example, the sn,n13 Gly-Ile parameter given in Table VI is one of the shorter distances, but sn,n13 is the smallest in the table Gly-Ile (0.091 energy units) meaning that it is also the least signi?cant of the n, n 1 3 interactions. One can again look for physical signi?cance in the results, bearing in mind that the in?uence of misfolded structures in the parameter dynamics tends to weaken the physical signi?cance of the parameters. Considering the long-range interaction parameters in Table VII, the Cys-Cys and Gly-Gly interaction pairs show a preference for interactions at the shortest distances (5.1 ?). This obviously re?ects disul?de bonds and the fact that glycine possesses the smallest side chain. Beyond that, one would only want to point to general trends such as the smaller side chains usually preferring short interaction distances and interactions with bulky side chains such as valine or leucine tending to prefer larger interaction distances. So far, we have only considered the force ?eld’s performance in recall. That is, how well the force ?eld works within the range of structures used to construct it. It is more interesting to look at the method’s ability to generalize. This is a measure of its predictive ability for proteins not used in the parametrization calculations. Ideally, one would perform a jackknife test, leaving each protein out of the calculation and reserving it for testing. This would be far too expensive computationally, but it is still possible to test the force ?eld outside of the calibration set. Choosing the proteins for calibration involved ranking the set according to quality (resolution and R factor). The 10 proteins immediately following the calibration set and some arbitrary examples were then of good quality and given the selection criteria, guaranteed to be nonhomologous with the proteins used for calibration. Table VIII gives the results for 10 structures under the columns labeled ‘‘Before minimization.’’ The immediate impression is that the parameters do not do much better than a set of random numbers, and the results range from excellent to bizarre. For 4pti, the discrimination is excellent with 99.8% of alternative structures being of higher energy. For 1abp, 99.9% of alternative structures appear to be of lower energy than the native. The reason for this is clear from the results of other workers.19,25 A structure may be very near in space to an energetic minimum, but of apparently high energy. The remedy for this is to compare structures after energy minimizing. This is shown by the two columns labeled ‘‘After minimization’’ in Table VIII. Native and alternative conformations were subjected to no more than 25 steps of conjugate gradients minimization. From these results, it is clear that the force ?eld does have a real ability to discriminate correct from incorrectly folded



TABLE VI. n,n 1 3 Interaction Parameters
sn,n13 (?) Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Initial 4.057 4.247 4.418 4.585 4.715 4.777 4.901 4.958 4.988 5.012 5.034 5.053 5.070 5.082 Final 4.094 4.228 4.403 4.530 4.703 4.601 4.748 4.816 4.626 4.858 4.914 4.786 4.681 4.731 en,n13 (energy)a 0.190 0.091 0.142 0.153 0.250 0.227 0.693 0.565 0.646 0.609 0.637 0.665 0.683 0.791 Nmemb 1 1 1 2 1 3 2 4 9 8 6 9 10 18 Noccurc 282 249 86 213 148 352 129 373 935 919 537 1154 1493 3163

Members Ala-Ile Gly-Ile Gln-Pro Thr-Trp, Thr-Tyr Asp-Pro Asn-Pro, Cys-Met, Pro-Ser Glu-Phe, Trp-Tyr Arg-Val, Phe-Trp, Trp-Val, Val-Val Arg-His, Arg-Lys, Gln-Gln, Gln-Trp, His-Tyr, IleLys, Leu-Phe, Leu-Trp, Phe-Thr Ala-Thr, Arg-Arg, Asp-Ile, Gln-Thr, His-Trp, MetMet, Met-Thr, Ser-Tyr Ala-Cys,, Arg-Asn, Gln-Met, Leu-Met, Phe-Val, Tyr-Tyr Arg-Leu, Arg-Phe, Asn-Ile, Gln-Ile, Gly-Phe, Ile-Ile, Lys-Lys, Lys-Trp, Phe-Phe Ala-Val, Arg-Cys, Asp-Glu, Cys-Gln, Glu-Val, HisPhe, Ile-Tyr, Lys-Tyr, Lys-Val, Pro-Trp Ala-Ala, Ala-Arg, Ala-Leu, Ala-Tyr, Arg-Gln, Arg-Ile, Arg-Thr, Asp-Gln, Asp-Leu, Gln-Tyr, Glu-Glu, Glu-Ile, Glu-Met, Gly-Lys, His-Val, Ile-Phe, IleVal, Pro-Thr Ala-Gln, Ala-His, Arg-Glu, Asn-Asp, Asp-Val, CysTyr, Phe-Tyr Ala-Lys, Ala-Trp, Asp-Tyr, Gln-Glu, Glu-Ser, GlyMet, Ile-Leu, Leu-Val, Met-Val Arg-Asp, Asn-Gln, Asp-Phe, Cys-His, Gln-Ser, GluTyr, Leu-Ser, Met-Trp Ala-Glu, Arg-Met, Cys-Leu, Gln-His, Glu-Lys, LeuLeu, Lys-Ser Ala-Asn, Ala-Asp, Ala-Met, Ala-Pro, Arg-Tyr, AsnHis, Asp-Met, Cys-Glu, Gln-Leu, Glu-Leu, GluPro, Glu-Thr, Ile-Met, Ile-Pro, Leu-Lys, Leu-Tyr Ala-Gly, Asn-Cys, Asn-Glu, Asn-Tyr, Asn-Val, AspHis, Gly-Leu, His-Leu, Lys-Phe, Met-Phe, MetSer, Thr-Val Asn-Leu, Asp-Thr, Gln-Lys, His-Lys, Ile-Ser, LeuThr, Lys-Met Ala-Phe, Ala-Ser, Arg-Trp, Cys-Cys, Glu-His, HisThr, Ile-Trp, Lys-Thr, Met-Pro, Pro-Val, Ser-Thr, Tyr-Val Arg-Ser, Asn-Gly, Asn-Lys, Asp-Lys, Gln-Phe, GlnVal, His-Ser Asn-Trp, Asp-Asp, Asp-Cys, Asp-Ser, Cys-Ile, GluTrp, Ile-Thr, Ser-Ser Asn-Phe, Asp-Trp, Cys-Phe, Glu-Gly, Gly-Val, HisHis, His-Ile, Leu-Pro, Ser-Val Asp-Gly, Cys-Ser Asn-Met, Asn-Thr, Cys-Lys, Gln-Gly, Gly-Tyr, MetTyr, Thr-Thr Asn-Ser, Gly-Gly, His-Met, Lys-Pro, Pro-Tyr Arg-Gly, Arg-Pro, Cys-Val, Gly-Pro, Gly-Ser, GlyThr, Gly-Trp, Phe-Ser, Ser-Trp Cys-Thr, Cys-Trp, Gly-His, Phe-Pro His-Pro Cys-Pro Asn-Asn Cys-Gly Pro-Pro Trp-Trp

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
aEnergy bNumber

5.095 5.108 5.120 5.131 5.146 5.161 5.172 5.192 5.209 5.226 5.247 5.270 5.291 5.342 5.369 5.427 5.493 5.651 5.727 5.899 5.985 8.001

4.898 5.028 5.435 4.965 5.031 5.687 6.123 6.057 5.907 6.010 5.428 5.957 6.321 6.010 6.382 6.861 5.883 5.823 6.873 6.639 6.819 8.669

0.780 0.787 0.723 0.991 0.789 0.548 0.678 0.604 0.704 0.635 0.498 0.511 0.480 0.437 0.437 0.403 0.491 0.164 0.733 0.164 0.414 0.854

7 9 8 7 16 12 7 12 7 8 9 2 7 5 9 4 1 1 1 1 1 1

977 1503 1040 1082 2625 1925 1211 1447 1017 852 1260 391 747 627 1548 265 63 68 82 127 48 8

in arbitrary units. of interaction types forming a class. cNumber of times the interactions occur in the calibration set of proteins.



TABLE VII. Long-range Interaction Parameters slong (?) Class 1 2 Initial 5.128 5.719 Final 5.081 5.225 elong (energy)a 0.053 0.054

Nmemb 2 21

Noccurc 53 846 801 072

Members Cys-Cys, Gly-Gly Ala-Gly, Ala-Ser, Arg-Gly, Asn-Gly, Asp-Gly, CysGly, Cys-Phe, Gln-Gly, Gly-Phe, Gly-Pro, GlySer, Gly-Thr, Gly-Trp, Gly-Tyr, His-Trp, ProPro, Pro-Ser, Pro-Trp, Thr-Thr, Trp-Trp, Trp-Tyr Ala-Met, Arg-Thr, Asn-Lys, Asn-Pro, Asp-Lys, Cys-Met, Cys-Thr, Cys-Trp, Cys-Tyr, Gln-Ser, Gln-Thr, Glu-Lys, Gly-His, Gly-Ile, Gly-Leu, Gly-Lys, Gly-Met, Gly-Val, His-His, His-Lys, Lys-Lys, Lys-Pro, Lys-Thr, Lys-Trp, Lys-Tyr, Met-Pro, Phe-Phe, Ser-Ser, Ser-Thr, Ser-Trp, Ser-Tyr Ala-Cys, Ala-Ile, Ala-Pro, Arg-Asp, Arg-Cys, ArgPro, Arg-Ser, Arg-Tyr, Asn-Asn, Asp-His, AspPhe, Asp-Thr, Cys-Ile, Cys-Pro, Gln-Gln, GluGly, His-Phe, Ile-Met, Ile-Tyr, Ile-Val, Lys-Ser, Phe-Thr, Phe-Val, Ser-Val, Thr-Trp, Thr-Tyr, Thr-Val, Val-Val Ala-Ala, Ala-Asp, Ala-Thr, Ala-Tyr, Arg-Arg, ArgAsn, Arg-Glu, Arg-His, Arg-Ile, Arg-Trp, AsnPhe, Asn-Tyr, Glu-Ser, Glu-Thr, Ile-Ile, MetPhe, Phe-Pro, Phe-Ser, Pro-Thr, Pro-Tyr, TyrTyr, Tyr-Val Ala-Glu, Ala-Leu, Ala-Phe, Ala-Trp, Ala-Val, AsnGlu, Asn-Ser, Asp-Met, Asp-Ser, Cys-Leu, CysVal, Gln-His, Gln-Phe, Gln-Pro, Glu-Glu, HisMet, His-Thr, His-Tyr, Ile-Leu, Ile-Trp, LeuMet, Leu-Trp, Leu-Tyr, Leu-Val, Phe-Trp, ProVal, Trp-Val Ala-Arg, Ala-Gln, Ala-Lys, Arg-Val, Asn-Asp, Asn-Cys, Asn-Thr, Asp-Pro, Cys-Ser, Gln-Trp, Gln-Val, Glu-Ile, Glu-Met, Glu-Phe, Glu-Tyr, His-Pro, His-Ser, Ile-Lys, Ile-Phe, Ile-Ser, IleThr, Leu-Phe, Leu-Ser, Lys-Phe, Met-Ser, MetThr, Met-Tyr, Met-Val Ala-Asn, Arg-Gln, Arg-Leu, Arg-Met, Arg-Phe, Asn-Gln, Asn-His, Asn-Ile, Asn-Met, Asn-Trp, Asn-Val, Asp-Asp, Asp-Gln, Asp-Trp, Cys-Gln, Cys-Lys, Gln-Glu, Gln-Ile, Gln-Lys, Gln-Met, Gln-Tyr, Glu-His, Glu-Val, His-Ile, Ile-Pro, Leu-Leu, Leu-Pro, Lys-Val, Met-Trp, Phe-Tyr Ala-His, Arg-Lys, Asn-Leu, Asp-Glu, Asp-Ile, Asp-Leu, Asp-Tyr, Asp-Val, Cys-Glu, Cys-His, Gln-Leu, Glu-Pro, Glu-Trp, His-Leu, His-Val, Leu-Lys, Leu-Thr, Lys-Met, Met-Met Asp-Cys, Glu-Leu






982 810






910 175






739 394






889 141






933 553






831 759






665 929

aEnergy bNumber





81 302

in arbitrary units. of interaction types forming a class. cNumber of times the interactions occur in the calibration set of proteins.

structures. The worse results are obtained for 1ctx, a 71 residue, snake toxin whose conformation is dominated by ?ve disul?de bridges, rather than regular secondary structure. The other weak results are for 1pcy, the apo form of plastocyanin, which normally has an integral copper atom in situ. It is also clear that the nearest local minima really are near to the

native structure. The last column of Table VIII shows that no structure moves by more than 0.2 ? (distance matrix difference) upon minimizing. To the extent that 10 proteins represent a signi?cant test, the force ?eld really is re?ecting general properties of protein folds, since all the test proteins were not homologous to proteins used in the parametrization.



TABLE VIII. Testing of Force Field Generalization Low-energy alternatives Before minimization Protein 4pti 1sn3 1ctx 1pcy 1lyz 4fxn 2sns 1rhd 1abp 5cpa 3tln
aNumber bNumber

After minimization Nc 1 1 1 489 498 0 0 25 0 0 0 0 %d 0.0 0.0 4.3 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Nresa 58 65 71 99 129 138 141 293 306 307 316

Nalternativesb 37 878 36 024 34 501 27 998 22 558 21 130 20 662 5 286 4 571 4 516 4 047

Nc 57 3 172 11 699 3 137 1 275 8 967 14 062 5 249 4 565 0 43

%d 0.2 8.8 33.9 11.2 5.7 42.4 68.1 99.3 99.9 0.0 1.1

rms shift e (?) 0.07 0.13 0.15 0.12 0.16 0.13 0.14 0.15 0.17 0.18 0.17

of residues. of alternative conformations generated. cNumber of alternatives with energy lower than that of the native structure. dPercentage of alternatives with energy lower than that of the native structure. eRoot-mean-square distance matrix difference of the native structure before and after minimization.

These results are disappointing in that they suggest that a force ?eld of this type would be slow for practical threading applications as it would only be useful after energy minimization. It also suggests that the inverse twelfth power repulsive terms in Equations (2)–(4) increase too steeply. In some cases, the native structure had a huge Elong contribution mostly due to just one or two pairs of residues at a short distance (data not shown). This result is not a surprise, given this force ?eld, but it is not clear why similar results are not seen with table-driven force ?elds2,31 based on a simple Boltzmann relation or perhaps whether they do sometimes occur. Amino acid pairs are sometimes present at unusually short distances in native structures. Statistically, these should be poorly represented and should give rise to correspondingly high energies in table-based force ?elds. DISCUSSION The results show the feasibility of an unusual method for force ?eld parametrization, although this implementation was certainly not ideal. Little attempt was made to optimize technical details such as the dynamics temperature or temperature coupling. Unfortunately, it is almost impossible to know ideal values for constants that de?ne the dynamic behavior in parameter space. One can compare this with the ?eld of protein molecular dynamics (MD) simulations that is relatively mature, but where analogous quantities are often chosen from experience rather than rational forethought. A simulation in parameter space should be based on the shape of the parameter energy hypersurface, but, to continue the comparison with protein simulations, this surface is still not well understood in the ?eld of protein MD.29 The weakest point of this calculation is probably not

whether or not the some control value was optimal, but rather the short length of the calculation; 100 time steps is not even a cursory peek in parameter space, let alone the thorough search one would like. It is certainly not enough to see convergence of the parameters. One strength of the optimization procedure here was the selection of alternative structures for calculating forces. From one point of view, this is an approximation to make the calculations faster; only alternative structures of low energy contributed to the force calculation. At the same time, there is a more subtle bene?t. Every alternative structure contributes local minima and maxima to the total parameter energy surface. Removing less important alternative structures must smooth the energy surface and make searching easier. Obviously, this is an observation of principle rather than a quantitative statement. There are also some aspects of the parameter optimization that have a systematic effect on parameters. For example, a scheme without any scaling of contributions from different proteins would be in?uenced most by the native structures of larger proteins (which contain the most interactions) and the alternative structures of smaller proteins (which have have the most misfolded alternatives). The scheme used here has replaced this by a different bias [Eq. (11)]. The parametrization scheme and use of misfolded structures may also limit the application areas of the force ?eld. By de?nition, a conventional molecular mechanics force ?eld has its minima located at positions of minimum energy. This force ?eld differs in that parameters are not chosen solely on the basis of native structures. Minima are, of course, located at pseudoenergy minima, but these are positioned so as to optimize discrimination ability, rather than



reproduce native structures. This means that mathematically, one could use the force ?eld for newtonian simulations, but the structures of low pseudoenergy may not be physically realistic con?gurations. This also means that the force ?eld may not be useful for looking for errors in structures, an application area possible with other protein scoring functions.5,30 This type of force ?eld may be useful for grossly wrong structures such as mistraced crystallographic density, since that kind of error will produce compact misfolded structures similar to the parametrization data. It may not be so useful detecting smaller errors in structures. The effect of other implementation choices is less clear. During the parameter dynamics, forces were calculated based on crystallographic coordinates of native structures. As shown in the Results section, energies of structures are only useful after energy minimizing. This is clearly a disadvantage when compared to table-based force ?elds, which produce good results directly from crystallographic coordinates.2,31 While one can speculate on the effects of the optimization method, a more fundamental question is the suitability of the form of interaction functions. Clearly, the set of functional forms has limitations, but it is not clear what the weakest aspects are and where it could be most pro?tably enhanced. As described in the Results section, the twelfth power repulsive term increases too quickly. Some weaknesses are highlighted by other workers. For example, our force ?eld uses isotropic interactions between Ca atoms, but this is only a poor representation of the anisotropic interactions of real amino acids.31 The single-well interaction functions [Eqs. (2)– (4)] are not an ideal ?t to the data when one considers published distributions of amino acid pair distributions,2,32 and the exponents in the potential energy terms are also quite arbitrary. One could well argue for 8–6 exponents14,25 or 12–10 exponents.19 There are even more fundamental questions about the most appropriate formulation. For example, we have an n, n 1 3 interaction where the choice of parameters depends on the central (n 1 1 and n 1 2) amino acids.19,25 This recognizes that amino acids have a statistical preference for the middle of speci?c secondary structures. By contrast, other workers have chosen parameters for the n, n 1 3 interaction based on the outer (n and n 1 3) residues.2,31 It might appear that this is an irreconcilable difference, but the best answer probably lies between the approaches. Considering the example of the n, n 1 3 term, there must be a contribution from both central and outer residues to the interaction. Given a tractable methodology for generating force ?elds, we hope to resolve points like this in a quantitative manner. Possibly, the biggest omission is the lack of an explicit solvation potential energy term. In fact, there is evidence that this kind of term alone may be

successful in recognizing correct folds.2,33,34 Solvation effects are included in our force ?eld in the same way as all structural in?uences affect the ?nal parameters. The weakness is that our LennardJones-like interactions may not be well suited to include the in?uence of solvent. If one wanted to justify our nonoptimal functional form, it is interesting to note that an explicit solvent term based on a property such as solvent-exposed surface area may not be necessary, and some workers have achieved remarkable results modeling the hydrophobic effect with very few parameters.35,36,37 Viewing the force ?eld generation as an exercise in model ?tting, the effect of a badly chosen functional form is clear. The ?tting has been conducted on a small set of proteins and reproduces the data well within that range. The worse the functional form, the less likely the ?tting is to be useful outside of that range. The functional form also places limitations on the areas of application, as does the parametrization methodology, discussed above. With interaction sites at only Ca positions, this is a low-resolution force ?eld and not necessarily sensitive to small changes or errors in structure. The force ?eld is also achiral, so it will never be able to distinguish between mirror images such as right- or left-handed helices. This is obviously not a problem if it is only challenged with alternate structures generated from other native folds. The force ?eld parameters are also a re?ection of the choice of proteins in the calibration set and the experimental conditions used in structure determination. This means that the results are in?uenced by a range of pH, salt concentrations, and so on. Nevertheless, the simple force ?eld is rather adept at recognizing correct structures, regardless of their origin or experimental conditions. Presumably, the force ?eld is biased toward soluble globular proteins under conditions most often used by crystallographers. It may well produce wrong answers outside of this set. The success of a simple force ?eld is the more surprising, considering the physical factors that go toward protein folding. If one could state that all the protein structures were at free-energy minima, then one could say that the force ?eld is a ?tting to underlying physical principles. If one believes that some protein structures are kinetically trapped local minima, then the force ?eld is an ill-de?ned mixture of energetic principles and the kinetics of protein folding. One can continue reasoning in this vein and note other in?uences that are neglected in the force ?eld. We assume there are 20 amino acids interacting without any interference, but this is probably not the case. The set of proteins probably includes some conformations that are only viable due to bound metal ions, and there may be sets of acidic residues that are unusually close to each other due to coordinated anions. Many proteins include poorly de?ned residues whose coordinates are simply a re?ection of the crystallographic re?nement programme. Finally,


P. ULRICH ET AL. tions of globular proteins: How good are they? J. Comput. Chem. 14:1194–1202, 1993. Levitt, M., Warshel, A. Computer simulation of protein folding. Nature 253:694–698, 1975. Levitt, M. A simpli?ed representation of protein conformations for rapid simulation of protein folding. J. Mol. Biol. 104:59–107, 1976. Tanaka, S., Scheraga, H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 9:945–950, 1976. Kuntz, I.D., Crippen, G.M., Kollman, P.A., Kimelman, D. Calculation of protein tertiary structure. J. Mol. Biol. 106:983–994, 1976. Miyazawa, S., Jernigan, R.L. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 18:534– 552, 1985. Berger, J.O. ‘‘Statistical Decision Theory and Bayesian Analysis.’’ New York: Springer-Verlag, 1985. Crippen, G.M., Snow, M.E. A 1.8 ? resolution potential function for protein folding. Biopolymers 29:1479–1489, 1990. Maiorov, V.N., Crippen, G.M. Contact potential that recognizes the correct folding of globular proteins. J. Mol. Biol. 227:876–888, 1992. Ryckaert, J.P., Ciccotti, G., Berendsen, H.J.C. Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23:327–341, 1977. Allen, M.P., Tildesley, D.J. ‘‘Computer Simulation of Liquids.’’ Oxford: Clarendon Press, 1987. Hobohm, U., Sander, C. Enlarged representative set of protein structures. Prot. Sci. 3:522–524, 1994. Crippen, G.M. Prediction of protein folding from amino acid sequence over discrete conformation spaces. Biochemistry 30:4232–4237, 1991. Seetharamulu, P., Crippen, G.M. A potential function for protein folding. J. Math. Chem. 6:91–110, 1991. Havel, T.F. The sampling properties of some distance geometry algorithms applied to unconstrained polypeptide chains: A study of 1830 independently computed conformations. Biopolymers 29:1565–1585, 1990. Maiorov, V.N., Crippen, G.M. Signi?cance of root-meansquare deviation in comparing three-dimensional structures of globular proteins. J. Mol. Biol. 235:625–634, 1994. Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., DiNola, A., Haak, J.R. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81:3684–3670, 1984. Bryngelson, J.D., Onuchic, J.N., Socci, N.D., Wolynes, P.G. Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins 21:167–195, 1995. Luthy, R., Bowie, J.U., Eisenberg, D. Assessment of protein ¨ models with three-dimensional pro?les. Nature 356:83–85, 1992. Kocher, J.-P.A., Rooman, M.J., Wodak, S.J. Factors in?uencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235:1598– 1613. Wilson, C., Doniach, S. A computer model to dynamically simulate protein folding: Studies with crambin. Proteins 6:193–209, 1989. Abagyan, R., Frishman, D., Argos, P. Recognition of distantly related proteins through energy calculations. Proteins 19:132–140, 1994. Bowie, J.U., Clarke, N.D., Pabo, C.O., Sauer, R.T. Identi?cation of protein folds: Matching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures. Proteins 7:257–264, 1990. Sun, S., Thomas, P.D., Dill, K.A. A simple protein folding algorithm using a binary code and secondary structure constraints. Prot. Eng. 8:769–778, 1995. Srinivasan, R., Rose, G.D. LINUS: A hierarchic procedure to predict the fold of a protein. Proteins 22:81–99, 1995. Huang, E.S., Subbiah, S., Levitt, M. Recognizing native folds by the arrangement of hydrophobic and polar residues. J. Mol. Biol. 252:709–720, 1995.

one should note that, even though we applied criteria based on resolution and R factors, a set of about 100 structures almost certainly includes misinterpreted electron density and maybe even mistakes in amino acid sequences. CONCLUSIONS This work has not produced a ?nal force ?eld for protein structure recognition. The tests on proteins outside of the calibration set show too many false positives (low energy structures) to be of direct practical use. Energy minimization is required to obtain signi?cant discrimination between native and misfolded structures. The work does, however, contain a number of improvements over other approaches to determining force ?elds. The use of quasi-newtonian dynamics is interesting because there is an implicit assumption that there is a volume of parameter space that will provide an acceptable force ?eld and that one only needs to ?nd some point within that volume. The overall form of the force ?eld is not innovative, but some aspects are a distinct improvement over other approaches. Classifying interaction types rather than residues allows more ?exibility in the ?tting operation without increasing the number of parameters and the parameter classi?cation algorithm is a rational method for decreasing the number of adjustable parameters. A truly systematic exploration of force ?eld functional forms will probably be beyond the scope of practical computations for some time. However, the framework developed here should allow a reliable and reproducible scheme for building and testing simple force ?elds for protein structure prediction. REFERENCES
1. Jones, D., Thornton, J. Protein fold recognition. J. CompAided Mol. Design 7:439–456, 1993. 2. Sippl, M.J. Boltzmann’s principle, knowledge-based mean ?elds and protein folding: An approach to the computational determination of protein structures. J. Comp.-Aided Mol. Design 7:473–501, 1993. 3. Wodak, S.J., Rooman, M.J. Generating and testing protein folds. Curr. Opin. Struct. Biol. 3:247–259, 1993. 4. Bryant, S.H., Altschul, S.F. Statistics of sequence-structure threading. Curr. Opin. Struct. Biol. 5:236–244, 1995. 5. Sippl, M.J. Recognition of Errors in three-dimensional structures of proteins. Proteins 17:355–362, 1993. 6. Ouzounis, C., Sander, C., Scharf, M., Schneider, R. Prediction of protein structure by evaluation of sequencestructure ?tness: Aligning sequences to contact pro?les derived from 3D structures. J. Mol. Biol. 232:805–825, 1993. 7. Bowie, J.U., Luthy, R., Eisenberg, D. A method to identify ¨ protein sequences that fold into a known three-dimensional structure. Science 253:164–170, 1991. 8. Covell, D.G., Jernigan, R.L. Conformations of folded proteins in restricted spaces. Biochemistry 29:3287–3294, 1990. 9. Skolnick, J., Kolinski, A. Simulations of the folding of a globular protein. Science 250:1121–1125, 1990. 10. Dill, K.A. Folding proteins: Finding a needle in a haystack. Curr. Opin. Struct. Biol. 3:99–103, 1993. 11. Kolinski, A., Skolnick, J. Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18:338–352 (1994). 12. Godzik, A., Kolinski, A., Skolnick, J. Lattice representa-

13. 14. 15.

16. 17.

18. 19. 20. 21.

22. 23. 24. 25. 26.

27. 28. 29. 30. 31.

32. 33. 34.

35. 36. 37.



All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。