This method of identification is much more reliable than using
fingerprints based on PAGE migration patterns or HPLC retention
times. However, peptide mass fingerprinting is limited to the
identification of proteins for which sequences are already known,
it is not a method of structural elucidation.
An enzyme of low specificity, which digests proteins to a mixture
of very short peptides, is not a good choice, because almost any given 3 or 4 residue
peptide will be found in many database entries. The longer the peptide, the
greater the specificity. A further consideration
for MALDI analysis is that the low mass region, below ~500 Da,
is obscured by the presence of matrix peaks.
In general, it is best to use enzymes of specificity equal
to or greater than trypsin.
Setting the number of allowed missed cleavage sites to zero simulates a limit digest.
If you are confident that your digest is perfect, with no partial fragments present,
this will give maximum discrimination and the highest score.
If experience shows that your digest mixtures usually include some partials,
that is, peptides with missed cleavage sites, you should choose a setting of 1, or
maybe 2 missed cleavage sites. Don't specify a higher number without good reason,
because each additional level of missed cleavages increases the number
of calculated peptide masses to be matched against the experimental data.
If the actual digest does not contain extended
partials, this simply increases the number of random matches, and so reduces discrimination.
Select experimental mass values that are large enough to offer
good discrimination, yet not so large as to be likely to be extended
partials. A good mass range for trypsin is 1000 to 3500 Da.
If you have misgivings about an experimental mass value, then
it is best to leave it out. An example would be a peak which is
broader than the others, indicating that it may be an unresolved
Imagine a tryptic digest of a 20 kDa protein. We would expect something around
20 perfect cleavage peptides. If the digest was incomplete,
or there was a non-quantitative modification, we might expect to double the number of
If 100 peaks are taken from the mass spectrum of this digest and submitted to Mascot then
either 60 to 80 peaks are noise or there are extensive non-quantitative modifications.
Either possibility is bad news for search specificity.
Autolytic Peptide Masses
For low level digests, it can be useful to screen the experimental
data for enzyme autolysis fragments.
Be generous in setting the peptide
mass tolerance. If an experimental mass falls just outside
the allowed window, then it contributes nothing towards the score.
However, remember that the number of spurious matches, and the
search time, increase with the size of the error window.
With Mascot 2.2 and later, if intensity information is supplied,
Mascot will attempt to use this to discriminate against noise peaks.
However, this is not a substitute for having a high
quality peak list.
Supplying a protein molecular weight to some search engines can be risky, because many of
the sequence database entries are for the least processed form of a protein.
For example, the SwissProt entry for bovine insulin, INS_BOVIN, is actually
the sequence of the precursor protein including signal and connecting
peptides. This adds up to a molecular weight of 11,394 Da, so
that a search based too tightly around an experimental measurement
of the molecular weight of this protein (5734 Da) would fail to
find a correct match.
This is not a problem with Mascot, because the protein molecular
weight is applied as a sliding window. That is, for each database
entry, Mascot looks for the highest scoring set of peptide matches
which are within a contiguous stretch of sequence less than or
equal to the specified protein molecular weight.
This will often be less than
the mass of the entire sequence entry (unless the data set happens to include
both the N-terminal and C-terminal peptides).
Consequently, if you specify a value for the protein molecular weight,
this acts only as a ceiling. Not only will you see smaller proteins on
the hit list, you will also see larger ones, but all of the reported matches will
be within a stretch of sequence less than or equal to the specified mass.
Confidence in a peptide mass fingerprint result may come from having independent supporting
evidence. For example, if the analyte originated from
a spot at approximately 40 kDa on a 2D gel separation of yeast proteins,
then the anticipated result of a peptide mass fingerprint is a 40 kDa yeast protein. If the
top scoring protein fits this expectation, the search is deemed "successful".
If the top scoring match is a 200 kDa protein from a different species, the
initial reaction is likely to be that the search has "failed".
While this is a reasonable approach, Mascot provides additional guidance in the
form of a significance level. By default, the significance level is set at 5%. That is,
if the score for a particular match exceeds the significance level, there is less than
a 1 in 20 chance that the observed match is a random event.
If the score is substantially above the significance level, look carefully
before dismissing the result as spurious. Conversely,
if the score is below the significance level, examine the match sceptically.
In most cases, there is prior knowledge of the origin of a sample, so
it is only natural to look for matches to proteins from a particular species
or kingdom. While a Mascot search can be restricted to a particular species,
the taxonomy filter should be used with care:
It is the uncertainty in the mass of the intact protein which is the Achilles
heel of a peptide mass fingerprint. This uncertainty is unavoidable, even when
an accurate experimental
mass for the intact protein is available, because it is unlikely that the
mass of the expressed and processed protein will be exactly the same as that
of the sequence entry in the protein database. A peptide mass fingerprint
can only provide the
statistically most probable identification. This is a great step
forward over simply counting peptide mass matches, which can only work
when a ceiling is placed on the intact protein mass. Otherwise, the mega-proteins
always come out top of the list due to random matches. Unfortunately, even
with an ideal scoring algorithm, there may be insufficient matching
mass values for a confident identification without making assumptions about
the intact protein mass or the species.
Many sequence databases do not provide species information in a systematic
and rigorous form
Contaminants can never be ruled out, and could come from any species, e.g.
BSA or keratins
Unless the genome of the species of interest is completely sequenced, there
is no guarantee that the true sequence of the analyte protein is actually present in
the database. If it is missing, then high scoring matches from other species
are of interest because they are likely to be homologous to the unknown.
One method of improving the specificity of a peptide mass fingerprint was
first proposed by Peter James [James, 1994]. Simply do
additional digests using different proteases. Seeing the same protein with
a high score in two independent digests provides a similar degree of confidence
to seeing multiple peptide matches in an MS/MS ions search.