Previous Article | Next Article ![]()
Molecular and Cellular Biology, November 1999, p. 7357-7368, Vol. 19, No. 11
Cold Spring Harbor Laboratory, Cold Spring
Harbor, New York 117241; Department of
Biological Chemistry, University of California, Irvine, California
927172; and Proteome, Inc., Beverly,
Massachusetts 019153
Received 15 June 1999/Returned for modification 16 July
1999/Accepted 28 July 1999
In this study, we examined yeast proteins by two-dimensional (2D)
gel electrophoresis and gathered quantitative information from about
1,400 spots. We found that there is an enormous range of protein
abundance and, for identified spots, a good correlation between protein
abundance, mRNA abundance, and codon bias. For each molecule of
well-translated mRNA, there were about 4,000 molecules of protein. The
relative abundance of proteins was measured in glucose and ethanol
media. Protein turnover was examined and found to be insignificant for
abundant proteins. Some phosphoproteins were identified. The behavior
of proteins in differential centrifugation experiments was examined.
Such experiments with 2D gels can give a global view of the yeast proteome.
The sequence of the yeast genome has
been determined (9). More recently, the number of mRNA
molecules for each expressed gene has been measured (27,
30). The next logical level of analysis is that of the expressed
set of proteins. We have begun to analyze the yeast proteome by using
two-dimensional (2D) gels.
2D gel electrophoresis separates proteins according to isoelectric
point in one dimension and molecular weight in the other dimension
(21), allowing resolution of thousands of proteins on a
single gel. Although modern imaging and computing techniques can
extract quantitative data for each of the spots in a 2D gel, there are
only a few cases in which quantitative data have been gathered from 2D
gels. 2D gel electrophoresis is almost unique in its ability to examine
biological responses over thousands of proteins simultaneously and
should therefore allow us a relatively comprehensive view of cellular metabolism.
We and others have worked toward assembling a yeast protein database
consisting of a collection of identified spots in 2D gels and of data
on each of these spots under various conditions (2, 7, 8, 10, 23,
25). These data could then be used in analyzing a protein or a
metabolic process. Saccharomyces cerevisiae is a good
organism for this approach since it has a well-understood physiology as
well as a large number of mutants, and its genome has been sequenced.
Given the sequence and the relative lack of introns in S. cerevisiae, it is easy to predict the sequence of the primary
protein product of most genes. This aids tremendously in identifying
these proteins on 2D gels.
There are three pillars on which such a database rests: (i)
visualization of many protein spots simultaneously, (ii) quantification of the protein in each spot, and (iii) identification of the gene product for each spot. Our first efforts at visualization and identification for S. cerevisiae have been described
elsewhere (7, 8). Here we describe quantitative data for
these proteins under a variety of experimental conditions.
Strains and media.
S. cerevisiae W303
(MATa ade2-1 his3-11,15 leu2-3, 112 trp1-1 ura3-1
can1-100) was used (26). Isotopic labeling of yeast and preparation of cell extracts.
Yeast strains were labeled and proteins were extracted as described by
Garrels et al. (7, 8). Briefly, cells were grown to 5 × 106 cells per ml. at 30°C; 1 ml of culture was
transferred to a fresh tube, and 0.3 mCi of
[35S]methionine (e.g., Express protein labeling mix; New
England Nuclear) was added to this 1-ml culture. The cells were
incubated for a further 10 to 15 min and then transferred to a 1.5-ml
microcentrifuge tube, chilled on ice, and harvested by centrifugation.
The supernatant was removed, and the cell pellet was resuspended in 100 µl of lysis buffer (20 mM Tris-HCl [pH 7.6], 10 mM NaF, 10 mM
sodium pyrophosphate, 0.5 mM EDTA, 0.1% deoxycholate; just before use, phenylmethylsulfonyl fluoride was added to 1 mM, leupeptin was added to
1 µg/ml, pepstatin was added to 1 µg/ml, tosylsulfonyl phenylalanyl
chloromethyl ketone was added to 10 µg/ml, and soybean trypsin
inhibitor was added to 10 µg/ml).
0270-7306/99/$04.00+0
Copyright © 1999, American Society for Microbiology. All rights reserved.
A Sampling of the Yeast Proteome
![]()
ABSTRACT
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
![]()
INTRODUCTION
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
![]()
MATERIALS AND METHODS
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
Met YNB (yeast nitrogen base) medium was 1.7 g of YNB (Difco) per liter, 5 g of
ammonium sulfate per liter, and adenine, uracil, and all amino acids
except methionine;
Met
Cys YNB medium was the same but without
methionine or cysteine. Medium was supplemented with 2% glucose (for
most experiments) or with 2% ethanol (for ethanol experiments).
Low-phosphate YEPD was described by Warner (28).
70°C. About 5 µl of this
supernatant was used for each 2D gel.
2D polyacrylamide gels. 2D gels were made and run as described elsewhere (6-8).
Image analysis of the gels. The Quest II software system was used for quantitative image analysis (20, 22). Two techniques were used to collect quantitative data for analysis by Quest II software. First, before the advent of phosphorimagers, gels were dried and fluorographed. Each gel was exposed to film for three different times (typically 1 day, 2 weeks, and 6 weeks) to increase the dynamic range of the data. The films were scanned along with calibration strips to relate film optical density to disintegrations per minute in the gels and analyzed by the software to obtain a linear relationship between disintegrations per minute in the spots and optical densities of the film images. The quantitative data are expressed as parts per million of the total cellular protein. This value is calculated from the disintegrations per minute of the sample loaded onto the gel and by comparing the film density of each data spot with density of the film over the calibration strips of known radioactivity exposed to the same film. This yields the disintegrations per minute per millimeter for each spot on the gel and thence its parts-per-minute value.
After the advent of phosphorimaging, gels bearing 35S-labeled proteins were exposed to phosphorimager screens and scanned by a Fuji phosphorimager, typically for two exposures per gel. Calibration strips of known radioactivity were exposed simultaneously. Scan data from the phosphorimager was assimilated by Quest II software, and quantitative data were recorded for the spots on the gels.Measurements of protein turnover. Cells in exponential phase were pulse-labeled with [35S]methionine, excess cold Met and Cys were added, and samples of equal volume were taken from the culture at intervals up to 90 min (in one experiment) or up to 160 min (in a second experiment). Incorporation of 35S into protein was essentially 100% by the first sample (10 min). Extracts were made, and equal fractions of the samples were loaded on 2D gels (i.e., the different samples had different amounts of protein but equal amounts of 35S). Spots were quantitated with a phosphorimaging and Quest software.
The software was queried for spots whose radioactivity decreased through the time course. The algorithm examined all data points for all spots, drew a best-fit line through the data points, and looked for spots where this line had a statistically significant negative slope. In one of the experiments, there was one such spot. To the eye, this was a minor, unidentified spot seen only in the first two samples (10 and 20 min). In the other experiment, the Quest software found no spots meeting the criteria. Therefore, we concluded that none of the identified spots (and all but one of the visible spots) represented proteins with long half-lives.Centrifugal fractionation. Cells were labeled, harvested, and broken with glass beads by the standard method described above except that no detergent (i.e., no deoxycholate) was present in the lysis buffer. The crude lysate was cleared of unbroken cells and large debris by centrifugation at 300 × g for 30 s. The supernatant of this centrifugation was then spun at 16,000 × g for 10 min to give the pellet used for Fig. 6B. The supernatant of the 16,000 × g, 10-min spin was then spun at 100,000 × g for 30 min to give the supernatant used for Fig. 6A.
Protein abundance calculations.
A haploid yeast cell
contains about 4 × 10
12 g of protein (1,
15). Assuming a mean protein mass of 50 kDa, there are about 50 × 106 molecules of protein per cell. There are
about 1.8 methionines per 10 kDa of protein mass, which implies
4.5 × 108 molecules of methionine per cell
(neglecting the small pool of free Met). We measured (i) the counts per
minute in each spot on the 2D gels, (ii) the total number of counts on
each gel (by integrating counts over the entire gel), and (iii) the
total number of counts loaded on the gel (by scintillation counting of
the original sample). Thus, we know what fraction of the total
incorporated radioactivity is present in each spot. After correcting
for the methionine (and cysteine [see below]) content of each
protein, we calculated an absolute number of protein molecules based on the fraction of radioactivity in each spot and on 50 × 106 total molecules per cell.
mRNA abundance calculations. For estimation of mRNA abundance, we used SAGE (serial analysis of gene expression) data (27) and Affymetrix chip hybridization data (29a, 30). The mRNA column in Table 1 shows mRNA abundance calculated from SAGE data alone. However, the SAGE data came from cells growing in YEPD medium, whereas our protein measurements were from cells growing in YNB medium. In addition, SAGE data for low-abundance mRNAs suffers from statistical variation. Therefore, we also used chip hybridization data (29a, 30) for mRNA from cells grown in YNB. These hybridization data also had disadvantages. First, the amounts of high-abundance mRNAs were systematically underestimated, probably because of saturation in the hybridizations, which used 10 µg of cRNA. For example, the abundance of ADH1 mRNA was 197 copies per cell by SAGE but only 32 copies per cell by hybridization, and the abundance of ENO2 mRNA was 248 copies per cell by SAGE but only 41 by hybridization. When the amount of cRNA used in the hybridization was reduced to 1 µg, the apparent amounts of mRNA were similar to the amounts determined by SAGE (29a, 29b). However, experiments using 1 µg of cRNA have been done for only some genes (29a). Because amounts of mRNA were normalized to 15,000 per cell, and because the amounts of abundant mRNAs were underestimated, there is a 2.2-fold overestimate of the abundance of nonabundant mRNAs. We calculated this factor of 2.2 by adding together the number of mRNA molecules from a large number of genes expressed at a low level for both SAGE data and hybridization data. The sum for the same genes from hybridization data is 2.2-fold greater than that from SAGE data.
To take into account these difficulties, we compiled a list of "adjusted" mRNA abundance as follows. For all high-abundance mRNAs of our identified proteins, we used SAGE data. For all of these particular mRNAs, chip hybridization suggested that mRNA abundance was the same in YEPD and YNB media. For medium-abundance mRNAs, SAGE data were used, but when hybridization data showed a significant difference between YEPD and YNB, then the SAGE data were adjusted by the appropriate factor. Finally, for low-abundance mRNAs, we used data from chip hybridizations from YNB medium but divided by 2.2 to normalize to the SAGE results. These calculations were completed without reference to protein abundance.CAI. The codon adaptation index (CAI) was taken from the yeast proteome database (YPD) (13), for which calculations were made according to Sharp and Li (24). Briefly, the index uses a reference set of highly expressed genes to assign a value to each codon, and then a score for a gene is calculated from the frequency of use of the various codons in that gene (24).
Statistical analysis. The JMP program was used with the aid of T. Tully. The JMP program showed that neither mRNA nor protein abundances were normally distributed; therefore, Spearman rank correlation coefficients (rs) were calculated. The mRNA (adjusted and unadjusted) and protein data were also transformed so that Pearson product-moment correlation coefficients (rp) could be calculated. First, this was done by a Box-Cox transformation of log-transformed data. This transformation produced normal distributions, and an rp of 0.76 was achieved. However, because the Box-Cox transformation is complex, we also did a simpler logarithmic transformation. This produced a normal distribution for the protein data. However, the distribution for the mRNA and adjusted mRNA data was close to, but not quite, normal. Nevertheless, we calculated the rp and found that it was 0.76, identical to the coefficient from the Box-Cox transformed data. We therefore believe that this correlation coefficient is not misleading, despite the fact that the log(mRNA) distribution is not quite normal.
| |
RESULTS |
|---|
|
|
|---|
Visualization of 1,400 spots on three gel systems. Yeast proteins have isoelectric points ranging from 3.1 to 12.8, and masses ranging from less than 10 kDa to 470 kDa. It is difficult to examine all proteins on a single kind of gel, because a gel with the needed range in pI and mass would give poor resolution of the thousands of spots in the central region of the gel. Therefore, we have used three gel systems: (i) pH "4 to 8" with 10% polyacrylamide; (ii) pH "3 to 10" with 10% polyacrylamide; and (iii) nonequilibrium with 15% polyacrylamide (7, 8). Each gel system allows good resolution of a subset of yeast proteins.
Figure 1 shows a pH 4-8, 10% polyacrylamide gel. The pH at the basic end of the isoelectric focusing gel cannot be maintained throughout focusing, and so the proteins resolved on such gels have isoelectric points between pH 4 and pH 6.7. For these pH 4-8 gels, we see 600 to 900 spots on the best gels after multiple exposures.
|
Spot identification. The identification of various spots has been described elsewhere (7, 8). At present, 169 different spots representing 148 proteins have been identified. Many of these spots have been independently identified (2, 10, 23, 25). The main methods used in spot identification have been analysis of amino acid composition, gene overexpression, peptide sequencing, and mass spectrometry.
Pulse-chase experiments and protein turnover. Pulse-chase experiments were done to measure protein half-lives (Materials and Methods). Cells were labeled with [35S]methionine for 10 min, and then an excess of unlabeled methionine was added. Samples were taken at 0, 10, 20, 30, 60, and 90 min after the beginning of the chase. Equal amounts of 35S were loaded from each sample; 2D gels were run, and spots were quantitated. Surprisingly, almost every spot was nearly constant in amount of radioactivity over the entire time course (not shown). A few spots shifted from one position to another because of posttranslational modifications (e.g., phosphorylation of Rpa0 and Efb1). Thus, the proteins being visualized are all or nearly all very stable proteins, with half-lives of more than 90 min. Gygi et al. (10) have come to a similar conclusion by using the N-end rule to predict protein half-lives. This result does not imply that all yeast proteins are stable. The proteins being visualized are abundant proteins; this is partly because they are stable proteins.
Protein quantitation.
Because all of the proteins seen had
effectively the same half-life, the abundance of each protein was
directly proportional to the amount of radioactivity incorporated
during labeling. Thus, after taking into account the total number of
protein molecules per cell, the average content of methionine and
cysteine, and the methionine and cysteine content of each identified
protein, we could calculate the abundance of each identified protein
(Tables 1 and
2; Materials and Methods). About 1,000 unidentified proteins were also quantified, assuming an average content
of Met and Cys.
|
|
Correlation of protein abundance with mRNA abundance. Estimates of mRNA abundance for each gene have been made by SAGE (27) and by hybridization of cRNA to oligonucleotide arrays (30). These two methods give broadly similar results, yet each method has strengths and weaknesses (Materials and Methods). Table 1 lists the number of molecules of mRNA per cell for each gene studied. One measurement (mRNA) uses data from SAGE analysis alone (27); a second incorporates data from both SAGE and hybridization (30) (adjusted mRNA) (Table 1; Materials and Methods). We correlated protein abundance with mRNA abundance (Fig. 2). For adjusted mRNA versus protein, the Spearman rank correlation coefficient, rs, was 0.74 (P < 0.0001), and the Pearson correlation coefficient, rp, on log transformed data (Materials and Methods) was 0.76 (P < 0.00001). We obtained similar correlations for mRNA versus protein and also for other data transformations (Materials and Methods). Thus, several statistical methods show a strong and significant correlation between mRNA abundance and protein abundance. Of course, the correlation is far from perfect; for mRNAs of a given abundance, there is at least a 10-fold range of protein abundance (Fig. 2). Some of this scatter is probably due to posttranscriptional regulation, and some is due to errors in the mRNA or protein data. For example, the protein Yef3 runs poorly on our gels, giving multiple smeared spots. Its abundance has probably been underestimated, partly explaining the low protein/mRNA ratio of Yef3. It is the most extreme outlier in Fig. 2.
|
Correlation of codon bias with protein abundance.
The mRNAs
for highly expressed proteins preferentially use some codons rather
than others specifying the same amino acid (14). This
preference is called codon bias. The codons preferred are those for
which the tRNAs are present in the greatest amounts. Use of these
codons may make translation faster or more efficient and may decrease
misincorporation. These effects are most important for the cell for
abundant proteins, and so codon bias is most extreme for abundant
proteins. The effect can be dramatic
highly biased mRNAs may use only
25 of the 61 codons.
|
|
Changes in protein abundance in glucose and ethanol. A comparison of cells grown in glucose (Fig. 1A) with cells grown in ethanol (Fig. 1B) is shown in Table 1. As is well known, some proteins are induced tremendously during growth on ethanol. Two striking examples are the peroxisomal enzymes Icl1 (isocitrate lyase) and Cit2 (citrate synthase), which are induced in ethanol by more than 100- and 12-fold, respectively (Fig. 1; Table 1). These enzymes are key components of the glyoxylate shunt, which diverts some acetyl coenzyme A (acetyl-CoA) from the tricarboxylic acid cycle to gluconeogenesis. S. cerevisiae requires large amounts of carbohydrate for its cell wall; in ethanol medium, this carbohydrate comes from gluconeogenesis, which depends on the glyoxylate shunt and on the glycolytic pathway running in reverse. The need for gluconeogenesis also explains why glycolytic enzymes are abundant even in ethanol medium. Thus, 2D gel analysis shows the prominence of the glycolytic and glyoxylate shunt enzymes in cells grown on ethanol, emphasizing that gluconeogenesis, presumably largely for production of the cell wall, is a major metabolic activity under these conditions.
During gluconeogenesis, substrate-product relationships are reversed for the glycolytic enzymes. One might expect that not all glycolytic enzymes would be well adapted to the reverse reaction. Indeed, 2D gels show that in ethanol, Adh2 (alcohol dehydrogenase 2) is strongly induced (16), while its isozyme Adh1 is not greatly affected. Adh1 and Adh2 each interconvert acetaldehyde and ethanol. Adh1 has a relatively high Km for ethanol (17 mM), while Adh2 has a lower Km (0.8 mM) (5). Thus, it is thought that Adh1 is specialized for glycolysis (acetaldehyde to ethanol), while Adh2 is specialized for respiration (ethanol to acetaldehyde) (5, 29). Similarly, Eno1 (enolase 1) is induced in ethanol, while its isozyme Eno2 (enolase 2) decreases in abundance (Table 1) (4, 19). Eno1 is inhibited by 2-phosphoglycerate (the glycolytic substrate), while Eno2 is inhibited by phosphoenolpyruvate (the gluconeogenic substrate) (4). Perhaps Eno1 has a lower Km for phosphoenolpyruvate than does Eno2, though to our knowledge this has not been tested. Thus, the 2D gels distinguish isozymes specialized for growth on glucose (Adh1 and Eno2) from isozymes specialized for ethanol (Adh2 and Eno1). Many heat shock proteins (e.g., Hsp60, Hsp82, Hsp104, and Kar2) were about twofold more abundant in ethanol medium than in glucose medium. This is consistent with the increased heat resistance of cells grown in ethanol (3). Enzymes involved in protein synthesis (Eft1, Rpa0, and Tif1) were about twice as abundant in glucose medium as in ethanol medium. This may reflect the higher growth rate of the cells in glucose.Phosphorylation of proteins. To examine protein phosphorylation, we labeled cells with 32P and ran 2D gels to examine phosphoproteins. About 300 distinct spots, probably representing 150 to 200 proteins, could be seen on pH 4-8 gels (Fig. 5B). We then aligned autoradiograms of three gels, each with a different kind of labeled protein (32P only [Fig. 5B], 32P plus 35S [Fig. 5A], and 35S only [not shown, but see Fig. 1 for example]). In this way, we made provisional identification of some of the 32P-labeled spots as particular 35S-labeled spots. All such identifications are somewhat uncertain, since precise alignments are difficult, and of course multiple spots may exactly comigrate. Nevertheless, we believe that most of the provisional identifications are probably correct. Among the major 32P-labeled proteins are the hexokinases Hxk1 and Hxk2, the acidic ribosome-associated protein Rpa0, the translation factors Yef3 and Efb1, and probably Hsp70 heat shock proteins of the Ssa and Ssb families. Rpa0 and Efb1 are quantitatively monophosphorylated.
Many yeast proteins resolve into multiple spots on these 2D gels (7). Yef3 has five or more spots, at least four of which comigrate with 32P. Tpi1 has a major spot showing no 32P labeling and a minor, more acidic spot which overlaps with some 32P label. Tif1 has at least seven spots (7); two of these overlap with some 32P label, but five do not (Fig. 5). Eft1 has at least three spots (7), and none of these overlap with 32P, although there are three nearby, unidentified 32P-labeled spots (a, c, and d in Fig. 5). Spots that seem to be extra forms of Met6, Pdc1, Eno2, and Fba1 can be seen in Fig. 6A, but there is little 32P at these positions in Fig. 5. Thus, phosphorylation explains some but not all of the different protein isoforms seen.
|
|
-factor, in cells synchronized in G1
by depletion of G1 cyclins, and in cells synchronized in M
phase with nocodazole. Only very minor differences were seen, and these
were difficult to reproduce. The cell cycle proteins regulated by
phosphorylation may not be abundant enough for this technique to be
applied easily.
Centrifugal fractionation. We fractionated 35S-labeled extracts by centrifugation (Materials and Methods). Figure 6A shows the proteins in the supernatant of a high-speed (100,000 × g, 30 min) centrifugation, while Fig. 6B shows the proteins in the pellet of a low-speed (16,000 × g, 10 min) centrifugation. Many proteins are tremendously enriched in one fraction or the other, while others are present in both. Most glycolytic enzymes (e.g., Tdh2, Tdh3, Eno2, Pdc1, Adh1, and Fba1) are enriched in the supernatant fraction. The only exception is Pfk1 (not indicated), which is found in both pellet and supernatant fractions. Many proteins involved in protein synthesis (Eft1, Yef3, Prt1, Tif1, and Rpa0) are in the pellet, possibly because of the association of ribosomes with the endoplasmic reticulum. However, Efb1 is in the supernatant, as is a substantial portion of the Eft1. Perhaps surprisingly, several mitochondrial proteins (Atp2 [not shown] and Ilv5) are largely in the supernatant. Perhaps glass bead breakage of cells releases mitochondrial proteins. The nuclear protein Gsp1 is in the pellet fraction. The enrichment produced by centrifugation makes it possible to see minor spots which are otherwise poorly resolved from surrounding proteins. Figure 6B shows that the previously identified Tif1 spot is surrounded by as many as six other spots that cofractionate. We observed six identical or very similar additional spots when we overexpressed Tif1 from a high-copy-number plasmid (not shown). Signal overlaps only one or two of these spots in 32P-labeling experiments (Fig. 5), and so the different forms are not mainly due to different phosphorylation states.
| |
DISCUSSION |
|---|
|
|
|---|
Our experience with developing a 2D gel protein database for S. cerevisiae is summarized here. With current technology, we can see the most abundant 1,200 proteins, which is about one-third to one-quarter of the proteins expressed. The remaining proteins will be difficult to see and study with the methods that we have used, not because of a lack of sensitivity but because weak spots are covered by nearby strong spots.
Of the 1,200 proteins seen, we have identified 148, with a bias toward the most abundant proteins. Steady application of the methods already used would allow identification of most of the remaining proteins. Gene overexpression will be particularly useful, since it is not affected by the lower abundance of the remaining visible proteins.
2D gels of the kind that we have used are not suitable for visualization of rare proteins. However it will be possible to study on a global basis metabolic processes involving relatively abundant proteins, such as protein synthesis, glycolysis, gluconeogenesis, amino acid synthesis, cell wall synthesis, nucleotide synthesis, lipid metabolism, and the heat shock response.
Gygi et al. (10) have recently completed a study similar to ours. Despite generating broadly similar data, Gygi et al. reached markedly different conclusions. We believe that both mRNA abundance and codon bias are useful predictors of protein abundance. However, Gygi et al. feel that mRNA abundance is a poor predictor of protein abundance and that "codon bias is not a predictor of either protein or mRNA levels" (10). These different conclusions are partly a matter of viewpoint. Gygi et al. focus on the fact that the correlations of mRNA and codon bias with protein abundance are far from perfect, while we focus on the fact that, considering the wide range of mRNA and protein abundance and the undoubted presence of other mechanisms affecting protein abundance, the correlations are quite good.
However, the different conclusions are also partly due to different methods of statistical analysis and to real differences in data. With respect to statistics, Gygi et al. used the Pearson product-moment correlation coefficient (rp) to measure the covariance of mRNA and protein abundance. Depending on the subset of data included, their rp values ranged from 0.1 to 0.94. Because of the low rp values with some subsets of the data, Gygi et al. concluded that the correlation of mRNA to protein was poor. However, the rp correlation is a parametric statistic and so requires variates following a bivariate normal distribution; that is, it would be valid only if both mRNA and protein abundances were normally distributed. In fact, both distributions are very far from normal (data not shown), and so a calculation of rp is inappropriate. There was no statistical backing for the assertion that codon bias fails to predict protein abundance.
We have taken two statistical approaches. First, we have used the Spearman rank correlation coefficient (rs). Since this statistic is nonparametric, there is no requirement for the data to be normally distributed. Using the rs, we find that mRNA abundance is well correlated with protein abundance (rs = 0.74), and the CAI is also well correlated with protein abundance (rs = 0.80) (and also with mRNA abundance [data not shown]). For the data of Gygi et al. (10), we obtained similar results, though with their data the correlation is not as good; rs = 0.59 for the mRNA-to-protein correlation, and rs = 0.59 for the codon bias-to-protein correlation.
In a second approach, we transformed the mRNA and protein data to forms where they were normally distributed, to allow calculation of an rp (Materials and Methods). Two transformations, Box-Cox and logarithmic, were used; both gave good correlations with our data [e.g., rp = 0.76 for log(adjusted RNA) to log(protein)]. We were not able to transform the data of Gygi et al. to a normal distribution.
Finally, there are also some differences in data between the two studies. These may be partly due to the different measurement techniques used: Gygi et al. measured protein abundance by cutting spots out of gels and measuring the radioactivity in each spot by scintillation counting, whereas we used phosphorimaging of intact gels coupled to image analysis. We compared our data to theirs for the proteins common between the studies (but excluding proteins whose mRNAs are known to differ between rich and minimal media, and excluding Tif1, which was anomalous in differing by 100-fold between the two data sets). The rs between the two protein data sets was 0.88 (P < 0.0001). Although this is a strong correlation, the fact that it is less than 1.0 suggests that there may have been errors in measuring protein abundance in one or both studies. After normalizing the two data sets to assume the same amount of protein per cell, we found a systematic tendency for the protein abundance data of Gygi et al. to be slightly higher than ours for the highest-abundance proteins and also for the lowest-abundance proteins but slightly lower than ours for the middle-abundance proteins. These systematic differences suggest some systematic errors in protein measurement. Although we do not know what the errors are, we suggest the following as a reasonable speculation. For the highest-abundance proteins, we may have underestimated the amount of protein because of a slightly nonlinear response of the phosphorimager screens. For the lowest-abundance proteins, Gygi et al. may have overestimated the amount of protein because of difficulties in accurately cutting very small spots out of the gel and because of difficulties in background subtraction for these small, weak spots. The difference in the middle abundance proteins may be a consequence of normalization, given the two errors above.
The low-abundance proteins in the data set of Gygi et al. have a poor correlation with mRNA abundance. We calculate that the rs is 0.74 for the top 54 proteins of Gygi et al. but only 0.22 for the bottom 53 proteins, a statistically significant difference. However, with our data set, the rs is 0.62 for the top 33 proteins and 0.56 (not significantly different) for the bottom 33 proteins (which are comparable in abundance to the bottom 53 proteins of Gygi et al.). Thus, our data set maintains a good correlation between mRNA and protein abundance even at low protein abundance. This is consistent with our speculation that protein quantification by phosphorimaging and image analysis may be more accurate for small, weak spots than is cutting out spots followed by scintillation counting. Our relatively good correlations even for nonabundant proteins may also reflect the fact that we used both SAGE data and RNA hybridization data, which is most helpful for the least abundant mRNAs. In summary, we feel that the poor correlation of protein to mRNA for the nonabundant proteins of Gygi et al. may reflect difficulty in accurately measuring these nonabundant proteins and mRNAs, rather than indicating a truly poor correlation in vivo. It is not surprising that observed correlations would be poorer with less-abundant proteins and mRNAs, simply because the accuracy of measurement would be worse.
How well can mRNA abundance predict protein abundance? With rp = 0.76 for logarithmically transformed mRNA and protein data, the coefficient of determination, (rp)2, is 0.58. This means that more than half (in log space) of the variation in protein abundance is explained by variation in mRNA abundance. When converted back to arithmetic values, protein abundances vary over about 200-fold (Table 1), and (rp)2 = 0.58 for the log data means that of this 200-fold variation, about 20-fold is explained by variation in the abundance of mRNA and about 10-fold is unexplained (but could be due partly to measurement errors). For proteins much less abundant than those considered here, we imagine the in vivo correlation between mRNA and protein abundance will be worse, and other regulatory mechanisms such as protein turnover will be more important.
Some important conclusions can be drawn from this sampling of the proteome. First, there is an enormous range of protein abundance, from nearly 2,000,000 molecules per cell for some glycolytic enzymes to about 100 per cell for some cell cycle proteins (26a). Second, about half of all cellular protein is found in fewer than 100 different gene products, which are mostly involved in carbohydrate metabolism or protein synthesis. Third, the correlation between protein abundance and CAI is log linear as far as we can see, which is from about 10,000 protein molecules per cell to about 1,000,000. This is somewhat surprising, because it implies that selective forces for codon bias are significant even at moderate expression levels. It also means that codon bias is a useful predictor of protein abundance even for moderately low bias proteins. Fourth, there is a good correlation between protein abundance and mRNA abundance for the proteins that we have studied. This validates the use of mRNA abundance as a rough predictor of protein abundance, at least for relatively abundant proteins. Fifth, for these abundant proteins, there are about 4,000 molecules of protein for each molecule of mRNA. This last conclusion raises questions as to how the levels of nonabundant proteins are regulated and suggests that protein instability, regulated translation, suboptimal rates of translation, and other mechanisms in addition to transcriptional control may be very important for these proteins.
| |
ACKNOWLEDGMENTS |
|---|
We thank Neena Sareen and Nick Bizios (CSHL 2D gel laboratory) for production of 2D gels, Tom Volpe for help with some experiments, Corine Driessens for help with calculations and statistics, and Herman Wijnen and Nick Edgington for comments on the manuscript. We especially thank Tim Tully for in-depth statistical analysis and for insightful discussions on statistical interpretations.
This work was supported by grant P41-RR02188 from the NIH Biomedical Research Technology Program, Division of Research Resources, to J.I.G., by Small Business Innovation Research grant R44 GM54110 to Proteome, Inc., by grant DAMD17-94-J4050 from the Army Breast Cancer Program to B.F., and by NIH grant RO1 GM45410 to B.F.
| |
FOOTNOTES |
|---|
* Corresponding author. Mailing address: Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724. Phone: (516) 367-8828. Fax: (516) 367-8369. E-mail: futcher{at}cshl.org.
| |
REFERENCES |
|---|
|
|
|---|
| 1. |
Baroni, M. D.,
E. Martegani,
P. Monti, and L. Alberghina.
1989.
Cell size modulation by CDC25 and RAS2 genes in Saccharomyces cerevisiae.
Mol. Cell. Biol.
9:2715-2723 |
| 2. | Boucherie, H., F. Sagliocco, R. Joubert, I. Maillet, J. Labarre, and M. Perrot. 1996. Two-dimensional gel protein database of Saccharomyces cerevisiae. Electrophoresis 17:1683-1699[Medline]. |
| 3. | Elliott, B., and B. Futcher. 1993. Stress resistance of yeast cells is largely independent of cell cycle phase. Yeast 9:33-42[Medline]. |
| 4. | Entian, K. D., B. Meurer, H. Kohler, K. H. Mann, and D. Mecke. 1987. Studies on the regulation of enolases and compartmentation of cytosolic enzymes in Saccharomyces cerevisiae. Biochim. Biophys. Acta 923:214-221[Medline]. |
| 5. |
Ganzhorn, A. J.,
D. W. Green,
A. D. Hershey,
R. M. Gould, and B. V. Plapp.
1987.
Kinetic characterization of yeast alcohol dehydrogenases. Amino acid residue 294 and substrate specificity.
J. Biol. Chem.
262:3754-3761 |
| 6. |
Garrels, J. I.
1989.
The Quest system for quantitative analysis of two-dimensional gels.
J. Biol. Chem.
264:5269-5282 |
| 7. | Garrels, J. I., B. Futcher, R. Kobayashi, G. I. Latter, B. Schwender, T. Volpe, J. R. Warner, and C. S. McLaughlin. 1994. Protein identifications for a Saccharomyces cerevisiae protein database. Electrophoresis 15:1466-1486[Medline]. |
| 8. | Garrels, J. I., C. S. McLaughlin, J. R. Warner, B. Futcher, G. I. Latter, R. Kobayashi, B. Schwender, T. Volpe, D. S. Anderson, R. Mesquita-Fuentes, and W. E. Payne. 1997. Proteome studies of S. cerevisiae: identification and characterization of abundant proteins. Electrophoresis 18:1347-1360[Medline]. |
| 9. | Goffeau, A., B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. 1996. Life with 6000 genes. Science 274:563-567. |
| 10. |
Gygi, S. P.,
Y. Rochon,
B. R. Franza, and R. Aebersold.
1999.
Correlation between protein and mRNA abundance in yeast.
Mol. Cell. Biol.
19:1720-1730 |
| 11. | Hereford, L. M., and M. Rosbash. 1977. Number and distribution of polyadenylated RNA sequences in yeast. Cell 10:453-462[Medline]. |
| 12. |
Herrick, D.,
R. Parker, and A. Jacobson.
1990.
Identification and comparison of stable and unstable mRNAs in Saccharomyces cerevisiae.
Mol. Cell. Biol.
10:2269-2284 |
| 13. |
Hodges, P. E.,
A. H. McKee,
B. P. Davis,
W. E. Payne, and J. I. Garrels.
1999.
The Yeast Proteome Database (YPD): a model for the organization of genome-wide functional data.
Nucleic Acids Res.
27:69-73 |
| 14. | Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34[Abstract]. |
| 15. | Johnston, G. C., F. R. Pringle, and L. H. Hartwell. 1977. Coordination of growth with cell division in the yeast S. cerevisiae. Exp. Cell Res. 105:79-98[Medline]. |
| 16. | Johnston, M., and M. Carlson. 1992. Regulation of carbon and phosphate utilization, p. 193-281. In E. Jones, J. Pringle, and J. Broach (ed.), The molecular and cellular biology of the yeast Saccharomyces. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. |
| 17. | Kornblatt, M. J., and A. Klugerman. 1989. Characterization of the enolase isozymes of rabbit brain: kinetic differences between mammalian and yeast enolases. Biochem. Cell. Biol. 67:103-107[Medline]. |
| 17a. | Latter, G., and B. Futcher. Unpublished data. |
| 18. | Mathews, B., N. Sonenberg, and J. W. B. Hershey. 1996. Origins and targets of translational control, p. 1-29. In J. W. B. Hershey, M. B. Mathews, and N. Sonenberg (ed.), Translational control. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. |
| 19. |
McAlister, L., and M. J. Holland.
1982.
Targeted deletion of a yeast enolase structural gene. Identification and isolation of yeast enolase isozymes.
J. Biol. Chem.
257:7181-7188 |
| 20. |
Monardo, P. J.,
T. Boutell,
J. I. Garrels, and G. I. Latter.
1994.
A distributed system for two-dimensional gel analysis.
Comput. Appl. Biosci.
10:137-143 |
| 21. |
O'Farrell, P. H.
1975.
High resolution two-dimensional electrophoresis of proteins.
J. Biol. Chem.
250:4007-4021 |
| 22. | Patterson, S. D., and G. I. Latter. 1993. Evaluation of storage phosphor imaging for quantitative analysis of 2-D gels using the Quest II system. BioTechniques 15:1076-1083[Medline]. |
| 23. | Sagliocco, F., J. C. Guillemot, C. Monribot, J. Capdevielle, M. Perrot, E. Ferran, P. Ferrara, and H. Boucherie. 1996. Identification of proteins of the yeast protein map using genetically manipulated strains and peptide-mass fingerprinting. Yeast 12:1519-1533[Medline]. |
| 24. |
Sharp, P. M., and W. H. Li.
1987.
The Codon Adaptation Index a measure of directional synonymous codon usage bias, and its potential applications.
Nucleic Acids Res.
15:281-1295.
|
| 25. |
Shevchenko, A.,
O. N. Jensen,
A. V. Podtelejnikov,
F. Sagliocco,
M. Wilm,
O. Vorm,
P. Mortensen,
A. Shevchenko,
H. Boucherie, and M. Mann.
1996.
Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels.
Proc. Natl. Acad. Sci. USA
93:14440-14445 |
| 26. | Thomas, B. J., and R. Rothstein. 1989. Elevated recombination rates in transcriptionally active DNA. Cell 56:619-630[Medline]. |
| 26a. | Tyers, M., and B. Futcher. Unpublished data. |
| 27. | Velculescu, V. E., L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett, Jr., P. Hieter, B. Vogelstein, and K. W. Kinzler. 1997. Characterization of the yeast transcriptome. Cell 88:243-251[Medline]. |
| 28. | Warner, J. 1991. Labeling of RNA and phosphoproteins in S. cerevisiae. Methods Enzymol. 194:423-428[Medline]. |
| 29. | Wills, C. 1976. Production of yeast alcohol dehydrogenase isoenzymes by selection. Nature 261:26-29[Medline]. |
| 29a. | Wodicka, L. Personal communication. |
| 29b. | Wodicka, L. Unpublished data. |
| 30. | Wodicka, L., H. Dong, M. Mittmann, M.-H. Ho, and D. J. Lockhart. 1997. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol. 15:1359-1367[Medline]. |
This article has been cited by other articles: