A Proteomics Search Algorithm Specifically Designed for High-Resolution MS/MS

Search algorithms, post-searching processing, quantitation software, etc. Share and discuss software here.
Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Mon Jun 24, 2013 9:53 am

Hi Amit,

I just released a new version (available at https://sourceforge.net/projects/morpheus-ms/files/) which is more tolerant of mzML files that are missing information that is normally present. I tested it and it worked on an MGF file converted to mzML with ProteoWizard msconvert.

I may add native MGF support some time in the future. In the meantime sorry for the awkward workaround.

-Craig

aky
Albumin Member
Posts: 91
Joined: Sat Sep 10, 2011 2:33 pm

Postby aky » Fri Jun 28, 2013 2:01 am

Hi Craig,

Thanks for the fix. I was able to run it. How do I select PSMs at say 5 % FDR. Does it always assume a concatenated target-decoy search is what we want?

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Fri Jun 28, 2013 8:29 am

Hi Amit,

There is a "maximum FDR" box which defaults to 1% but you can change it to 5% (note this only affects the ID numbers in the log output; all PSMs are always reported in the PSMs TSV output and can be filtered by q-value ≤ 5% in Excel). The program is designed assuming most people will want to do concatenated target–decoy searches, but you can do separate searches as well. Just make sure the "create target–decoy database on the fly" box is unchecked.

-Craig

Zhang
Angiotensin Member
Angiotensin Member
Posts: 27
Joined: Fri Aug 24, 2012 8:08 am

Postby Zhang » Tue Jul 02, 2013 5:07 am

I was shocked after the first time running Morpheus. It's really simple and fast!!
It cost less than 10 minutes for a single 3.5GB QExactive raw file of human LC-MSMS, provided more than 5000 protein groups identified under 1% FDR! I am kind of curious about how could it be that fast even faster than copying my raw files through Ethernet. Could it be integrate on server-end for real-time identification?

It would be happy to see a support for zipped mzML files, setting of controlling minimum length for peptides, and a user manual.

Anyway, thank you Craig, great work!

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Tue Jul 02, 2013 9:01 pm

Hi Zhang,

Thanks for your kind words. 5000 human protein groups at 1% FDR in a single run is really impressive!

The speed of Morpheus is a result of two things: (1) the simplicity of the entire program and (2) multithreaded searching--and in the latest version, spectral preprocessing--between all the available processor cores. Your question about real-time identification is right on target: a paper describing instant sequence confirmation (inSeq) was published last year in PNAS which uses a variant of the Morpheus scoring. This enables a variety of really interesting workflows.

Your three suggestions are excellent, I will add them to my list.

-Craig

odedk
Phosphoserine Member
Phosphoserine Member
Posts: 10
Joined: Tue Feb 05, 2013 12:59 pm

Postby odedk » Sun Aug 18, 2013 11:18 pm

Hi Craig,
I run into an odd problem with Morpheus (btw thanks again I use it a lot).
When I try to search against Arabidopsis thaliana Fasta file, the option to "Create a Target-Decoy Database on the Fly" is automatically unchecked and if I try to enable it manually and start a run, I am getting an error message saying some like:
"System.ArgumentException E:\dbase\ARATH.fasta contains decoy proteins; database should not contain decoy proteins when "create target-decoy database on the fly" option is enabled at Morpheus.ProteinFastaReader.CountProteins(FileStreamproteinFastaDatabase, Boolean on TheFlyDecoys, Int32& targetProteins, Int32&decoyProteins, Int32& onTheFlyDecoyProteins) at Morpheus.DatabaseSearcher.Seasrch()"

I tried it with FASTA from Unirprot and also from TAIR (arabidopsis.org) - still got similar error.
Is there any quick solution?
Thanks,
Oded

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Mon Aug 19, 2013 7:17 am

Hi Oded,

This means your database has "DECOY" in at least one of the protein descriptions already. Could you search the FASTAs with a text editor and confirm if this is the case and if so, what protein(s)? If you have generated a concatenated target–decoy database already you will want to keep the "on the fly" option unchecked, but it doesn't sound like that is the case.

-Craig

odedk
Phosphoserine Member
Phosphoserine Member
Posts: 10
Joined: Tue Feb 05, 2013 12:59 pm

Postby odedk » Mon Aug 19, 2013 6:50 pm

Hi Craig,
Yep. Apparently there are 2 proteins sequences in arabidopsis fasta that there genes are called "DECOY".
Thanks,
Oded

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Tue Aug 20, 2013 7:35 am

Morpheus has been modified to use "DECOY_" as the decoy identifier rather than "DECOY" to account for the Arabidopsis gene. You can get the latest versions at http://sourceforge.net/projects/morpheus-ms/files/.

odedk
Phosphoserine Member
Phosphoserine Member
Posts: 10
Joined: Tue Feb 05, 2013 12:59 pm

Postby odedk » Tue Aug 20, 2013 5:25 pm

Thanks that work well. We have now another problem (actually we have seen it before too) when we use a large database like the entire Uniprot or NCBI we are getting System.AggregateException error see attached
Morpheus_error.PNG
.
Is there a way to overcome it?
Thanks again.
You do not have the required permissions to view the files attached to this post.

aky
Albumin Member
Posts: 91
Joined: Sat Sep 10, 2011 2:33 pm

Postby aky » Tue Aug 20, 2013 7:57 pm

Hi Craig,

I have a query about the scan title. I had converted mgf files to mzML using msconvert. Morpheus returns the scan identity as integers like scan 1, scan 2, and so on. Although I could map the scan titles with the combination of m/z, RT and charge state information, it seems like an overkill for a simple job. Is it possible to include the scan titles in TSV output?

Zhang
Angiotensin Member
Angiotensin Member
Posts: 27
Joined: Fri Aug 24, 2012 8:08 am

Postby Zhang » Wed Aug 21, 2013 8:12 am

I as well faced a problem on converting. It is about the RT or "scan time".
In MGF format parameter "RTINSECONDS=" is scaled at second level. But when converted into mzML and searched by Morpheus, it would be considered as minute scale, wrongly calculated by 60x greater and output in pep.xml result.

I suppose it is a bug in your source code of Exporters.cs
output.WriteAttributeString("retention_time_sec", TimeSpan.FromMinutes(psm.Spectrum.RetentionTime).TotalSeconds.ToString());

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Wed Aug 21, 2013 8:41 am


Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Wed Aug 21, 2013 9:39 am

Hi Zhang,

Thanks for noticing that. I had assumed retention times in mzML were always in minutes. This has been fixed: http://sourceforge.net/projects/morpheus-ms/files/.

-Craig

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Tue Aug 27, 2013 9:28 pm

odedk wrote:Thanks that work well. We have now another problem (actually we have seen it before too) when we use a large database like the entire Uniprot or NCBI we are getting System.AggregateException error see attached [ATTACH=CONFIG]48[/ATTACH].
Is there a way to overcome it?
Thanks again.


This was caused by empty protein sequences. I added some logic to deal with these cases.

Dmitry
Proton Member
Proton Member
Posts: 1
Joined: Thu Oct 17, 2013 1:35 pm

Corrupted modification.tsv file

Postby Dmitry » Thu Oct 17, 2013 1:46 pm

Hi Craig!

I can not use the Morpheus. Upon initialization the program gives message "Your modifications file ... is likely corrupt. Please correct it." I have Morpheus installed on Windows 7 x64 (the copy for x32 is also installed). It seems that the problem is in strings describing modifications. Program starts normally with file containing only header. What can be wrong with descriptions?

Best regards,
Dmitry

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Thu Oct 17, 2013 4:33 pm

Hi Dmitry,

Can you send me your modifications.tsv file?

-Craig

moushutic
Proton Member
Proton Member
Posts: 1
Joined: Tue Oct 22, 2013 6:05 am

Postby moushutic » Tue Oct 22, 2013 6:26 am

Dear Craig,
We have the same problem with tsv file as Dmitry. We have recently loaded Morpheus ans Thermo MSFileReader from http://morpheus-ms.sourceforge.net/. There is also no oppotunity to insert the modifications into the search form.
Best regards, Maria

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Wed Nov 06, 2013 9:57 am

Anybody who has been getting the error message "your modifications file is likely corrupt" and has their computer set to use commas instead of points as the decimal mark, the new version of Morpheus (http://sourceforge.net/projects/morpheus-ms/files/) resolves this.

AngelPDiz
Glycine Member
Glycine Member
Posts: 5
Joined: Mon Feb 24, 2014 3:53 am

Postby AngelPDiz » Mon Feb 24, 2014 8:01 am

Congratulations to you and your colleague for this fantastic work and leaving it for free-use.

I have a couple of queries.

1. Is it possible to get the p-values (and fixing a threshold, e.g. p<0.05) instead of q-values (i.e., without applying this multi-test correction) for PSMs and protein groups?.

2. Could you provide a clear definition about the meaning of PSM?. I see that the definition is about the number of spectra that match one (several?) peptide, is it correct?. How is the exact number of PSMs then calculated in a Protein groups outfile?. I would appreciate if you could explain this point by making an analogy to Mascot output: i.e. "Matches:total (significant)".....I guess "distinct peptides (sign)" is equivalente to Unique peptides in Morpheus.

Sorry for the above basic questions. I am not very expert in MS analysis but very keen to fully understand what is going on in this type of analysis to play around with my own data.

Cheers,

Angel

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Mon Feb 24, 2014 4:39 pm

Hi Angel,

1. Currently this isn't possible. I have considered adding a simple p-value calculation similar to the one in Andromeda but can't say if and when that will be done. Is there a reason you need p-values? I personally consider target/decoy-calculated q-values to be more accurate because they are empirical.

2. PSM is a peptide–spectrum match, i.e. one MS/MS spectrum that is identified as a particular peptide. Unique peptides are then those PSMs reduced to only the best-scoring PSM per unique peptide sequence. Note than in some cases PSMs means all whereas other places it means only the PSMs that pass your FDR threshold, e.g. in the protein groups output. I believe your interpretation is correct that total matches in Mascot is equivalent to PSMs in Morpheus.

-Craig

AngelPDiz
Glycine Member
Glycine Member
Posts: 5
Joined: Mon Feb 24, 2014 3:53 am

Postby AngelPDiz » Tue Feb 25, 2014 4:19 am

Thanks Craig for your prompt and useful answer.

I have not clear this issue yet (related to protein identification), but I was wondering if applying different multiple hypothesis testing methods over the list of p-values could produce different results in terms of protein identifications (e.g., the same protein being covered with more PSMs and/or unique peptides, or a new matched protein suddenly arises in the proteingroup list...). In other words, using a less stringent (more powerfull) correction method might enhance the number/quality of protein identifications (this might be desirable but only under some cirscumstances). For example, I am thinking in the situation where you are analysing MS/MS data from one protein spot (excised from a 2-DE gel....I know this is old-fashion...but still works and lot of people is using this technique), the situation might be quite different as when you try to infer protein groups from an initial mix of ten thousands of peptides from thousand of proteins (i.e. a shotgun proteomic experiment). (Of course, to be less stringent in Morpheus, an alternative is using a higher q-value cutoff).

Generally speaking, I like the decoy approach because of you say but have also read some criticisms when this approach is not used under certain circunstances. I still need to study this issue in more detail, play around with some data and under different scenarios, in order to reach proper conclusions.

I have found quite usseful to know which is the equivalence between a score threshold (e.g. as it happens with Mascot scores) and p-value threshold (i.e. setting the type I error) applied by the user. The score threshold (e.g. when the type I error is fixed, e.g. to 0.05) seems to be dependent of database size used for search, among other parameters, as far as I know. I do not think this equivalence is also possible to reach with Morpheus score (please correct me if I am wrong). I do not know at which rate this might be also useful info for other people....may be not for most people.

Cheers,

Angel

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Tue Feb 25, 2014 2:37 pm

Hi Angel,

Just to clarify, q-values in proteomics don't exactly correlate to q-values you see elsewhere because there is no explicit multiple hypothesis testing correction. They are just empirical FDRs. There are some problems with target/decoy FDR but overall I consider it more reliable than trying to calculate accurate p-values for PSMs. To be less stringent I would suggest just using a higher FDR, as you mentioned.

-Craig

AngelPDiz
Glycine Member
Glycine Member
Posts: 5
Joined: Mon Feb 24, 2014 3:53 am

Postby AngelPDiz » Wed Feb 26, 2014 2:26 am

Thanks for clarification Craig. Actually I have been trying to figure out how is possible to calculate q-values in Morpheus without having a list of p-values, as q-value calculation takes into account the distribution of p-values from all tests. This might be confussing to people. Anyway, thanks for feedback and, as you say, we can always choose a less stringent (high) FDR value in case it is needed.
Cheers,
Angel

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Wed Feb 26, 2014 9:12 am

Hi Angel,

Q-values in proteomics usually mean calculated with the target/decoy database strategy. It is an empirical way of determining FDR that need not start from p-values. There are good reviews about it by Elias/Gygi here and Käll/Storey/MacCoss/Noble here and here. Last but not least we have a SharedProteomics tutorial about it here!

-Craig


Return to “Bioinformatics”

Who is online

Users browsing this forum: No registered users and 1 guest