Target–decoy searching

Target–decoy searching is the most popular way to validate peptide–spectrum matches (PSMs) identified with liquid chromatography–tandem mass spectrometry (LC–MS/MS). It’s not the only approach, but it has been the most accepted for the past several years. Let’s start with a little background.

Mass spectrometry–based proteomics emerged in the early 1990s, and the first search algorithm, Sequest, showed up around that time. (In fact, you could argue that Sequest drove the early development of proteomics rather than the other way around, but I digress). Sequest scores PSMs based on the correlation between experimental and theoretical tandem mass spectra. In silico, protein sequence databases are digested to yield peptides, for which theoretical product ions are calculated and matched to those observed. The main Sequest score is called XCorr, where higher indicates a better match. But how high does the XCorr need to be before the PSM should be accepted? Since there is not a probabilistic interpretation for the XCorr (i.e., you can’t set a p-value threshold of 0.01 for a 1% error rate, and even if there was, it might not be that reliable), manual validation is required.

This leads to a number of problems. First, it’s time consuming. Thousands of tandem mass spectra can now be acquired per hour by LC–MS/MS, and verifying all or even a significant subset of them manually would be arduous. Second, it’s inconsistent. People will invariably have different standards for what constitutes a good match.

To solve this problem, somebody came up with the idea of searching a decoy database, i.e., a database of sequences known to be incorrect. You might wonder, why would you possibly want to search against wrong answers? The reason is, because it tells you what scores for wrong answers look like. This is incredibly powerful information.

Decoy PSMs serve as a surrogate for incorrect target PSMs, which is what we really care about. This allows us to calculate false discovery rate (FDR)—the number of false positives over total positives (where total positives is just true positives plus false positives). Note that this is different than the more-familiar-sounding false positive rate, which is false positives over total negatives (where total negatives is the sum of false positives and true negatives).

In its most basic form, after getting your search results, you simply pick a PSM score threshold and count the number of target and decoy matches that exceed it. The basic assumption is that for every decoy PSM—which is definitely incorrect—there exists an incorrect target PSM at the same score. So in order to calculate FDR, you simply divide the number of decoy PSMs (i.e., incorrect target PSMs) by the number of target PSMs. If you want to get a certain FDR—say, 1%—you just adjust the score threshold until you get as close to it as possible.

It turns out there is an easy way to generate decoy databases: simply reversing protein sequences. It also results in virtually the same number of target and decoy peptides, which means the FDR calculation doesn’t need to be corrected for bias. There is some overlap between target and decoy sequences, but luckily these are mostly short peptides which are hard to identify by mass spectrometry anyway, so it’s not a major concern. You can either search separate target and decoy databases, or a single concatenated database. You get pretty much the same results, but the concatenated approach is a lot easier so that’s what most people do.