PONDR® predicts upon single
sequences. PONDRs are typically feedforward neural networks that use sequence
attributes taken over windows of 9 to 21 amino acids. These attributes, such as
the fractional composition of particular amino acids, hydropathy, or sequence
complexity, are averaged over these windows and the values are used to train the
neural network during predictor construction. The same values are used as inputs
to make predictions.
The neural network predictors (NNPs) were trained on carefully chosen,
non-redundant sets of ordered and disordered sequences that help to insure
modest predictor biases and to enable the predictors to generalize to new
sequences.
When making predictions, NNP outputs are between 0 and 1 and are then smoothed
over a sliding window of 9 amino acids. If a residue value exceeds or matches a
threshold of 0.5 the residue is considered disordered.
The extensions added to PONDR® describe the training data of a particular predictor. The first letter refers to the method of characterization:
|
|
X for x-ray |
|
|
N for NMR |
|
|
C for circular dichroism |
|
|
V for various |
The second letter refers to the length or location of the disordered region:
|
|
S for short (8 - 9 residues) |
|
|
M for medium (20 - 39 residues) |
|
|
L for long (40 or more residues) |
|
|
N for amino terminal (5 or more residues) |
|
|
C for carboxyl terminal (5 or more residues) |
|
|
T for residues at either terminus (5 or more residues) |
When training data is from a particular
protein family, an abbreviation for the protein family is used, such as
CaN for calcineurin.
Thus, PONDR® VL-XT refers to the merger of three
predictors, one trained on Variously characterized Long
disordered regions and two trained on X-ray characterized Terminal
disordered regions.
| PONDR® | Input Attributes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
VL1 |
Coordination number |
Net charge |
W F Y |
W |
Y |
F |
D |
E |
K |
R |
|
XN |
Coordination number |
V |
V I Y F W |
M |
N |
H |
D |
P E V K |
|
|
|
XC |
Coordination number |
Hydropathy |
V I Y F W |
M |
T |
H |
P E V K |
R |
|
|
Attributes used in the different neural networks that make up the VL-XT PONDR® algorithm. The single characters and strings of characters are the one letter codes of the amino acid compositions that make up a particular attribute. Individual attributes are defined by the table's cell borders. VL1 uses 10 attributes, while both XN and XC use 8 attributes.
Output for the VL1
predictor starts and ends 11 amino acids from the termini. The output for the XT
predictors start at the first or last sequence position and continues for 14
residues inward from the termini. A simple average is taken for the overlapping
predictions from 11 to 14. A sliding window of nine amino acids is used to
smooth the prediction values along the length of the sequence. Unsmoothed
prediction values from the XT predictors are used for the first and last four
sequence positions.
VL-XT outputs are real numbers between 1 and 0, where 1 is the ideal prediction
of disorder and 0 is the ideal prediction of order. VL-XT outputs are typically
not ideal and a threshold is applied with disorder assigned to values greater
than or equal to 0.5.
The XL1 predictor is
a feedforward neural network optimized to predict regions of disorder greater
than 39 amino acids (Romero et al., 1997).
It was trained on 7 of the 8 disordered regions identified from missing electron
density that were used to train the VL1 predictor. The attributes used by this
predictor are listed in the table, below (taken from Romero et al., 2001).
This predictor uses a sliding window of 9 amino acids to smooth the prediction
values along the length of the sequence, so predictions are only provided
starting and ending 15 amino acids from the termini.
It is combined with the XT predictor, as described
above.
| PONDR® | Input Attributes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| XL1 | Flexibility | Hydropathy | C | W | Y | H | D | E | K | S |
Attributes used in the different neural networks that make up the XL1 PONDR® algorithm.
The CaN predictor is
a feedforward neural network that was trained on regions of 13 calcineurin
proteins that were identified by sequence homology with the known disordered
region of human calcineurin (Romero et al., 1997). The attributes used by this
predictor are listed in the table, below.
This predictor shows poor out of sample accuracy, but in some cases the contrast
of its output with other predictors provides insight into binding regions of
disordered sequences (Garner et al., 1999).
It is combined with the XT predictor, as described
above.
| PONDR® | Input Attributes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CaN | beta-moment | V | F | W | Y | H | C | E | S | R |
Attributes used in the different neural networks that make up the CaN-XT PONDR® algorithm.
The VL3-BA predictor is a
feedforward neural network that was trained on regions of 152 long regions of
disorder that were characterized by various methods.The set of ordered proteins
consisted of 290 PDB-Select-25 chains having no disordered residues.
This predictor is is based on 20 attributes (18 amino acid frequencies, average
flexibility and sequence
complexity) in an input window of length 41. The raw predictions are averaged
over an output window of length 31 to obtain the final prediction for a given
position.
The putative boundaries between order and disorder were corrected using the
order/disorder boundary predictor. The closest maximum prediction from the
boundary predictor (above 0.8) became the new boundary between the ordered and
disordered regions.
The VSL1 predictor combines two predictors
optimized for long (>30 residues) and short (<=30 residues) disordered regions,
respectively, using weights generated by a third meta-predictor.
The training data are 1,335 non-redundant protein sequences, containing 230 long
disordered regions with 25,958 residues, 983 short disordered regions with 9,632
residues, and 354,169 ordered residues.
The attributes used include amino acid frequencies, sequence complexity, ratio
of net charge / hydrophobicity, averaged flexibility, and averaged PSI-BLAST*
profiles calculated over symmetric input windows.
The version of this predictor presented at CASP6 also used secondary structure
predictions, however, those attributes were later removed as the accuracy
increase was minimal and those methods have high computational needs.
All three component predictors are logistic regression models built on balanced
training sets. Attribute selection and window length optimization were performed
independently for the three component predictors to maximize prediction
accuracy.
* Altschul,
S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman,
D.J. (1997)
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs." Nucleic Acids Res. 25:3389-3402
CDF is used here on the
PONDR® VL-XT to predict fully disordered or fully ordered proteins. The
Cumulative Distribution Function (CDF) applied to PONDR® VL-XT predictions is a
cumulative histogram of the PONDR® scores for a given protein. In this analysis,
discrimination was made by means of a boundary. This boundary can be viewed as a
measure of proportion of residues with high and low PONDR® scores.
Proteins that have CDF curves above the boundary (i.e., a high proportion of low
scores) were predicted to be ordered, proteins with curves below the boundary
(i.e., a high proportion of high scores) are predicted to be disordered, and
proteins with curves that crossed the boundary are predicted to be mixtures of
order and disorder.
Three sets of proteins were used for this analysis: completely disordered
proteins, completely ordered proteins, and ordered proteins that contain
disordered residues. The sets of completely disordered proteins and completely
ordered proteins are intended to serve as model sets, so that the relative
characteristics of each could be determined. The set of completely disordered
proteins was derived from a set provided by Keith Dunker and a set provided by
Vladimir Uversky. The set of completely ordered proteins was derived from
proteins in PDB. The set of ordered proteins with regions of disorder was
derived from structures that contained a single chain and a unit cell with a
primitive space group from PDB.
|
|
Completely Ordered proteins: 105 proteins/22,829 residues. |
|
|
Completely Disorered proteins: 54 proteins/10,782 residues. |
|
|
Proteins with disordered regiosn: 64 proteins/23,785 residues, 4,074 of which are disordered. |
A method of analysis that
distinguishes ordered and disordered proteins based only on net charge and
hydropathy was introduced by Uversky et al. These charge-hydropathy plots
compare the absolute, mean net charge - neglecting histidine - and the mean,
scaled Kyte-Doolittle hydropathy. The hydropathy measure is scaled between 0 and
1. Ordered and disordered proteins plotted in this chargehydropathy space can be
mostly separated by a linear boundary. Here, this method is re-derived and
evaluated for the classification of proteins. Datasets used in this section are
the same as described above for CDF.
The boundary between the ordered and disorder proteins was determined using a
linear discriminate function, assuming norma l distributions and equal
covariance matrices. Note that the assumptions used in this analysis were not
justified. However, experimentation with methods that do not require normally
distributed data (e.g., neural networks) and methods that do not require equal
covariance matrices (e.g., quadratic discrimination) yielded results equivalent
to those obtained using linear discrimination. An estimation of the accuracy of
this boundary for the discrimination of ordered and disordered proteins was made
using the jack-knifing procedure described above. This gave an estimated
classification accuracy of 83%, 76% for disordered proteins and 91% for ordered
proteins.
Garner E, Romero P, Dunker AK, Brown C, and Obradovic Z. (1999) Predicting binding regions within disordered proteins, Genome Informatics, 10, 41-50.
Li X, Romero P, Rani M, Dunker AK, and Obradovic Z. (1999) Predicting protein disorder for N-, C-, and internal regions, Genome Informatics, 10, 30-40.
Romero P, Obradovic Z, and Dunker AK. (1997) Sequence data analysis for long disordered regions prediction in the calcineurin family, Genome Informatics, 8, 110-124.
Uversky V, Gillespie J, and Fink A. (2000) Why are "natively unfolded" proteins unstructured under physiological conditions? Proteins: Struct. Funct. Gen., 41(3): 415-427.
Romero P, Obradovic Z, Li X, Garner E, Brown C, and Dunker AK. (2001) Sequence complexity of disordered protein, Proteins: Struct. Funct. Gen., 42, 38-48.
Vucetic S, Brown CJ, Dunker AK, Obradovic Z. (2003) Flavors of protein disorder. Proteins 52, 573-584.
Radivojac P, Obradovic Z, Brown CJ, Dunker AK. (2003) Prediction of boundaries between intrinsically ordered and disordered protein regions. Pac. Symp. Biocomput. 8, 216-227.
Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, A. K. Dunker (2003) Predicting Intrinsic Disorder from Amino Acid Sequence. Proteins: Struct. Funct. Gen., 53, 566-572.