Predictor Of Naturally Disordered Regions

PONDR® Algorithms

Predictor Construction

PONDR® predicts upon single sequences. PONDRs are typically feedforward neural networks that use sequence attributes taken over windows of 9 to 21 amino acids. These attributes, such as the fractional composition of particular amino acids, hydropathy, or sequence complexity, are averaged over these windows and the values are used to train the neural network during predictor construction. The same values are used as inputs to make predictions.

The neural network predictors (NNPs) were trained on carefully chosen, non-redundant sets of ordered and disordered sequences that help to insure modest predictor biases and to enable the predictors to generalize to new sequences.

When making predictions, NNP outputs are between 0 and 1 and are then smoothed over a sliding window of 9 amino acids. If a residue value exceeds or matches a threshold of 0.5 the residue is considered disordered.

PONDR® Nomenclature

The extensions added to PONDR® describe the training data of a particular predictor. The first letter refers to the method of characterization:
bullet X for x-ray
bullet N for NMR
bullet C for circular dichroism
bullet V for various
The second letter refers to the length or location of the disordered region:
bullet S for short (8 - 9 residues)
bullet M for medium (20 - 39 residues)
bullet L for long (40 or more residues)
bullet N for amino terminal (5 or more residues)
bullet C for carboxyl terminal (5 or more residues)
bullet T for residues at either terminus (5 or more residues)
When training data is from a particular protein family, an abbreviation for the protein family is used, such as CaN for calcineurin.

Thus, PONDR® VL-XT refers to the merger of three predictors, one trained on Variously characterized Long disordered regions and two trained on X-ray characterized Terminal disordered regions.

 

PONDR® VL-XT

The VL-XT predictor integrates three feedforward neural networks: the VL1 predictor (Romero et al. 1997), the N-terminus predictor (XN), and the C-terminus predictor (XC) (both from Li et al. 1999). VL1 was trained using 8 long disordered regions identified from missing electron density in x-ray crystallographic studies, and 7 long disordered regions characterized by NMR. The XN and XC predictors, together called XT, were also trained using x-ray crystallographic data, where the terminal disordered regions were 5 or more amino acids in length.

The attributes used by these three predictors are shown in the table, below.

PONDR® Input Attributes
VL1 Coordination number Net charge W F Y W Y F D E K R
XN Coordination number V V I Y F W M N H D P E V K    
XC Coordination number Hydropathy V I Y F W M T H P E V K R    

Attributes used in the different neural networks that make up the VL-XT PONDR® algorithm. The single characters and strings of characters are the one letter codes of the amino acid compositions that make up a particular attribute. Individual attributes are defined by the table's cell borders. VL1 uses 10 attributes, while both XN and XC use 8 attributes.

Output for the VL1 predictor starts and ends 11 amino acids from the termini. The output for the XT predictors start at the first or last sequence position and continues for 14 residues inward from the termini. A simple average is taken for the overlapping predictions from 11 to 14. A sliding window of nine amino acids is used to smooth the prediction values along the length of the sequence. Unsmoothed prediction values from the XT predictors are used for the first and last four sequence positions.

VL-XT outputs are real numbers between 1 and 0, where 1 is the ideal prediction of disorder and 0 is the ideal prediction of order. VL-XT outputs are typically not ideal and a threshold is applied with disorder assigned to values greater than or equal to 0.5.

 

PONDR® XL1

The XL1 predictor is a feedforward neural network optimized to predict regions of disorder greater than 39 amino acids (Romero et al., 1997).

It was trained on 7 of the 8 disordered regions identified from missing electron density that were used to train the VL1 predictor. The attributes used by this predictor are listed in the table, below (taken from Romero et al., 2001).

This predictor uses a sliding window of 9 amino acids to smooth the prediction values along the length of the sequence, so predictions are only provided starting and ending 15 amino acids from the termini.

It is combined with the XT predictor, as described
above.

PONDR® Input Attributes
XL1 Flexibility Hydropathy C W Y H D E K S

Attributes used in the different neural networks that make up the XL1 PONDR® algorithm.

 

PONDR® CaN-XT

The CaN predictor is a feedforward neural network that was trained on regions of 13 calcineurin proteins that were identified by sequence homology with the known disordered region of human calcineurin (Romero et al., 1997). The attributes used by this predictor are listed in the table, below.

This predictor shows poor out of sample accuracy, but in some cases the contrast of its output with other predictors provides insight into binding regions of disordered sequences (Garner et al., 1999).

It is combined with the XT predictor, as described
above.

PONDR® Input Attributes
CaN beta-moment V F W Y H C E S R

Attributes used in the different neural networks that make up the CaN-XT PONDR® algorithm.

 

PONDR® VL3-BA

The VL3-BA predictor is a feedforward neural network that was trained on regions of 152 long regions of disorder that were characterized by various methods.The set of ordered proteins consisted of 290 PDB-Select-25 chains having no disordered residues.

This predictor is is based on 20 attributes (18 amino acid frequencies, average exibility and sequence complexity) in an input window of length 41. The raw predictions are averaged over an output window of length 31 to obtain the final prediction for a given position.

The putative boundaries between order and disorder were corrected using the order/disorder boundary predictor. The closest maximum prediction from the boundary predictor (above 0.8) became the new boundary between the ordered and disordered regions.

 

PONDR® VSL1

The VSL1 predictor combines two predictors optimized for long (>30 residues) and short (<=30 residues) disordered regions, respectively, using weights generated by a third meta-predictor.

The training data are 1,335 non-redundant protein sequences, containing 230 long disordered regions with 25,958 residues, 983 short disordered regions with 9,632 residues, and 354,169 ordered residues.

The attributes used include amino acid frequencies, sequence complexity, ratio of net charge / hydrophobicity, averaged flexibility, and averaged PSI-BLAST* profiles calculated over symmetric input windows.

The version of this predictor presented at CASP6 also used secondary structure predictions, however, those attributes were later removed as the accuracy increase was minimal and those methods have high computational needs.

All three component predictors are logistic regression models built on balanced training sets. Attribute selection and window length optimization were performed independently for the three component predictors to maximize prediction accuracy.

* Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997)
"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402

 

DEPP (Disorder Enhanced Phosphorylation Predictor)

Depp was trained on over 1500 known (i.e. experimentally confirmed) protein phosphorylation sites.

The observation that amino acid composition, sequence complexity, hydrophobicity, charge and other sequence attributes of regions adjacent to phosphorylation sites are very similar to those of intrinsically disordered protein regions suggests that disorder in and around the potential phosphorylation target site is an important prerequisite for phosphorylation.

Thus, DEPP uses disorder information to improve the discrimination between phosphorylation and non-phosphorylation sites. The accuracy of DEPP reaches 76.0 +/- 0.3%, 81.3 +/- 0.3% and 83.3 +/- 0.3% for serine, threonine and tyrosi respectively.

Only residues with a prediction >0.5 are considered to be phosphorylated. The score generally approximates the probability that the residue is phosphorylated.

 

CDF (Cumulative Distribution Function)

CDF is used here on the PONDR® VL-XT to predict fully disordered or fully ordered proteins. The Cumulative Distribution Function (CDF) applied to PONDR® VL-XT predictions is a cumulative histogram of the PONDR® scores for a given protein. In this analysis, discrimination was made by means of a boundary. This boundary can be viewed as a measure of proportion of residues with high and low PONDR® scores.

Proteins that have CDF curves above the boundary (i.e., a high proportion of low scores) were predicted to be ordered, proteins with curves below the boundary (i.e., a high proportion of high scores) are predicted to be disordered, and proteins with curves that crossed the boundary are predicted to be mixtures of order and disorder.

Three sets of proteins were used for this analysis: completely disordered proteins, completely ordered proteins, and ordered proteins that contain disordered residues. The sets of completely disordered proteins and completely ordered proteins are intended to serve as model sets, so that the relative characteristics of each could be determined. The set of completely disordered proteins was derived from a set provided by Keith Dunker and a set provided by Vladimir Uversky. The set of completely ordered proteins was derived from proteins in PDB. The set of ordered proteins with regions of disorder was derived from structures that contained a single chain and a unit cell with a primitive space group from PDB.

bullet Completely Ordered proteins: 105 proteins/22,829 residues.
bullet Completely Disorered proteins: 54 proteins/10,782 residues.
bullet Proteins with disordered regiosn: 64 proteins/23,785 residues, 4,074 of which are disordered.

 

Charge-Hydropathy analysis

A method of analysis that distinguishes ordered and disordered proteins based only on net charge and hydropathy was introduced by Uversky et al. These charge-hydropathy plots compare the absolute, mean net charge - neglecting histidine - and the mean, scaled Kyte-Doolittle hydropathy. The hydropathy measure is scaled between 0 and 1. Ordered and disordered proteins plotted in this chargehydropathy space can be mostly separated by a linear boundary. Here, this method is re-derived and evaluated for the classification of proteins. Datasets used in this section are the same as described above for CDF.

The boundary between the ordered and disorder proteins was determined using a linear discriminate function, assuming norma l distributions and equal covariance matrices. Note that the assumptions used in this analysis were not justified. However, experimentation with methods that do not require normally distributed data (e.g., neural networks) and methods that do not require equal covariance matrices (e.g., quadratic discrimination) yielded results equivalent to those obtained using linear discrimination. An estimation of the accuracy of this boundary for the discrimination of ordered and disordered proteins was made using the jack-knifing procedure described above. This gave an estimated classification accuracy of 83%, 76% for disordered proteins and 91% for ordered proteins.

 

Bibliography

Garner E, Romero P, Dunker AK, Brown C, and Obradovic Z. (1999) Predicting binding regions within disordered proteins, Genome Informatics, 10, 41-50.

Li X, Romero P, Rani M, Dunker AK, and Obradovic Z. (1999) Predicting protein disorder for N-, C-, and internal regions, Genome Informatics, 10, 30-40.

Romero P, Obradovic Z, and Dunker AK. (1997) Sequence data analysis for long disordered regions prediction in the calcineurin family, Genome Informatics, 8, 110-124.

Uversky V, Gillespie J, and Fink A. (2000) Why are "natively unfolded" proteins unstructured under physiological conditions? Proteins: Struct. Funct. Gen., 41(3): 415-427.

Romero P, Obradovic Z, Li X, Garner E, Brown C, and Dunker AK. (2001) Sequence complexity of disordered protein, Proteins: Struct. Funct. Gen., 42, 38-48.

Vucetic S, Brown CJ, Dunker AK, Obradovic Z. (2003) Flavors of protein disorder. Proteins 52, 573-584.

Radivojac P, Obradovic Z, Brown CJ, Dunker AK. (2003) Prediction of boundaries between intrinsically ordered and disordered protein regions. Pac. Symp. Biocomput. 8, 216-227.

Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, A. K. Dunker (2003) Predicting Intrinsic Disorder from Amino Acid Sequence. Proteins: Struct. Funct. Gen., 53, 566-572.