Interview – Mike MacCoss (part 2)

Click here to read part 1 of the interview

SP – The accuracy and reliability of mass spectrometry is still debatable due to the its dependence on technical prowess of the mass spectrometrist. Is it overcome in SRM or do we need a highly skilled mass spectrometrist for that as well? Or can we enjoy (reach closer to) the level of automation in experimental protocol vis-a-vis DNA or RNA Seq?

 

Mike – I think the cultural aspects of proteomics needs to change. Our efforts have been focused on identification of peptides, modifications, etc. — basically generation of lists. In most other applications of mass spectrometry, those fields have focused on the generation and validation of assays, not lists. It is hard to transfer an assay from one lab to another if the figures of merit are things like peptide IDs. I believe that things will get better once we start to focus more on the classical figures of merit like like LOD, LOQ, CV, chromatographic reproducibility, etc. There are groups working towards this but we need more.

There are labs that have absolutely incredible amounts of automation and throughput. I just visited LabCorp and saw a room full of mass spectrometers each running 2,000 samples per instrument per day! Interestingly when we switch to making peptide measurements by SRM our challenges are rarely because of the mass spectrometry. Instead it is almost exclusively based on the chromatography. The quality of our chromatography is and will continue to be the biggest challenges to making quantitative proteomics really high throughput and robust. It is possible to get identifications by data dependent acquisition with poor chromatography but you will not be able to get good quantitative data by SRM if you do not have good chromatographic resolution, peak shape, and reproducibility.

 

SP – One problem that has plagued the proteomic community is a well organized and easily queryable repository for proteomic data sets. You are part of an initiative to develop a data sharing system. Can you tell us about your vision? How will it be funded? Will it be free to deposit data there? What is the plan to deal with different and ever-changing raw data types? And will it be queryable? For example, can a user pull up all phosphopeptide analyses of HeLa cells?

 

Mike – Years ago, Chris Wu and I would discuss what we felt were the biggest challenges that are holding our field back. We both still believe a major challenge is our ability to store, share, visualize, and analyze our data. If you look at the big research labs, they have all built a significant infrastructure to address these problems. Unfortunately, all of these labs have built this infrastructure independently. We now have 100s of proteomics pipelines worldwide, yet, the small research labs that don’t have the resources of larger labs are placed at a disadvantage. Furthermore, instead of everyone building off of one another, we have a situation where each person is starting from scratch — this is incredibly wasteful and unproductive.

When Nate Yates left Merck Research Labs to go to the University of Pittsburgh he told me that his research focus was going to build computational capabilities like what they had at Merck (that were probably second to none) and make them broadly available to the entire community. I told him about Chris and my ideas about building a single infrastructure that every lab could use and build upon. We then decided to combine our efforts and work together.

Shortly after that we realized that Andrey Bondarenko, who managed the development of the proteomics software when he was at Merck, had a similar vision. Andrey had some unique ideas for how to store mass spectrometry data more efficiently than they are in the RAW vendor files and make the random retrieval of chromatograms or spectra from 1,000s of files almost instantaneous. This meant we could store data cheaply because it would take less space, we could display it faster, and if we put these tools on a common platform we could make sharing and dissemination of data seamless. There are now four of us that are “all in” and we feel strongly that once we get things going there will be others.

We are calling the platform, the CHORUS Project, and you will be able to find more information shortly athttp://chorusproject.org. We are modeling things off of a Google Docs type interface. Each user has an account and can upload datafiles. The datafiles can be kept completely private, shared within a research group, shared with collaborators, shared with journal editors, or made completely public. There are capabilities (Apps) similar to a “Qual Browser” for people to display data. Users will be able to search for datafiles available to them that meet certain criteria. We also plan to host a basic MS/MS search pipeline based on Comet and Percolator initially with many more tools to follow. We have spent a lot of time thinking about the security of the data; both in the context of ensuring that no datafiles can get lost or corrupted and, also, that users’ privacy is protected.

Just like Google Docs, each user will get a certain amount of storage and processing for free. Beyond that, users will have to pay for it — if they want to. This fee will be cheaper than what most people could do if they did it themselves. We will be able to reduce costs by consolidating resources, providing innovative solutions for the storage of data, and the economy of scale. Users will not have to worry about backups, making data available to collaborators, there will be no IT support costs, etc. Furthermore, their data will be on a platform that is continually being upgraded with new data processing tools and visualization capabilities. Also, we strongly want to encourage file sharing and making data public, so data made fully public and accessible to everyone will always be free. Just like Gmail and Dropbox, we fully anticipate that many people will chose to use just the free services. However, if we can give value to the user at equal or lesser cost than what they are currently paying then we believe users will see value in it and use it for all of their storage and analyses.

It is very important to note that CHORUS is not intended to make money, only cover the operating costs. Money that is raised to support CHORUS will be used to expand the capabilities and further reduce the costs to the users.

We already have a number of partners in this project and we unofficially have a number of others that have voiced strong interest. At ASMS there will be an evening workshop on CHORUS. During the workshop we hope to provide a demonstration of the capabilities that will be immediately available, discuss our plans for the future, and get feedback on what the community feels are essential features of such a community platform. The plan is to build an API so that developers can build software tools to analyze data stored within CHORUS.

 

SP – Given your interest in data storage, what are your thoughts on data formats (e.g. raw vendor formats, mzXML/mzData/mzML, mz5, SQLite-based YAFMS, etc.)?

 

Mike – The datafiles that are most important are the RAW datafiles. Those files are sacred and maintain the data exactly as the vendor intended. Through the ProteoWizard project (http://proteowizard.sourceforge.net/) it is possible to access the data from a RAW file just as easy as through the “standard” datafile formats. The other formats are largely just “transfer” formats. These formats enable us to use the data with a specific software tool on an operating system other than Windows. My preference then is the format that works with the tool we are interested in using. Fortunately, the MSConvert program from the ProteoWizard project handles the conversion between most major file formats making the specific format required largely a non-issue. If developers work with a format that is not supported by the Proteowizard data access layer then I would encourage them to work with the Proteowizard team to add it.

When you think about a project like CHORUS, none of these formats are appropriate. If you need to be able to extract an MS1 spectrum from a specific retention time from a specific datafile, among thousands of datafiles, the existing formats just won’t handle it. Likewise, what if you wanted to see the XIC or base peak chromatogram from 100s of files from all data collected on HeLa lysates in the last 5 years? While the formats you mention are great to store and transfer parts of a datafile, it won’t scale when you want to do analyses of 1,000s of datafiles. For a reasonable amount of data one could imagine putting the data into a relational database that had a clever way of storing and indexing the data. However, how about if you had 100s of petabytes or even exabytes distributed between many computers? Fortunately, Google has solved this problem with the development of MapReduce. Andrey Bondarenko and his team has developed an approach to “chunk” and index datafiles like you would with a relational database but store them in a MapReduce data structure.

 

SP – What exactly is a ‘trail ultra marathon’ and why would you subject yourself to it?

 

Mike – Hahahaha. An trail ultra marathon is simply a trail run that is longer than a marathon or 26.2 miles. Common distances are 50k, 50 miles, 100k or 100 miles. I have only completed endurance events up to 100k but I would like to do a 100 miler at some point. I enjoy doing events like this because it is important to have a goal that may seem out of reach but is obtainable with hard work and preparation. We are all capable of more than we think. Running these sorts of distances requires that you make a lifestyle change; it is unlikely that you can simply just wake up and decide to run 50 miles. It requires making exercise a priority. While I find it is sometimes hard to get out the door, especially when it is dark and rainy here in Seattle, I never regret having gone for a run once I do. I also find exercise and fresh air very therapeutic — most bad days can be resolved by a long run in the mountains.

Discuss this interview with other SharedProteomics members here.


SONY DSCMichael MacCoss
Associate Professor
University of Washington
Department of Genome Sciences

 

Bio

Mike is a leader in the field of quantitative proteomics. The focus of his lab is the development of high-throughput quantitative proteomic methods and their application to model organisms. During his post-doc he developed RelEx, one of the first tools to quantify proteins from stable isotope labeling experiments. His lab at the University of Washington has developed several widely used tools for quantitative proteomic analysis including Skyline, a free software package for the design and interpretation of targeted proteomics experiments.

Selected Publications

MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, Kern R, Tabb DL, Liebler DC, MacCoss MJ. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics. 2010 Apr 1;26(7):966-8

 

MacCoss, MJ, Wu, CC, Liu, H, Sadygov, R, Yates, JR, III A correlation algorithm for the automated analysis of quantitative shotgun proteomics data. Anal. Chem. 75, 6912-6921 (2003).

 

Käll, L, Canterbury, JD, Weston, J, Noble, WS, MacCoss, MJ, A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets, Nat. Methods, 4, 923-925 (2007)

Fun Facts

Preferred brand of running shoe?
I have tried pretty much tried every pair of running shoes imaginable. Currently I’m getting over some injuries and have been running in Hoka One One Stinson Evos. They are an unusual shoe but are pretty good at minimizing impact while also promoting a normal running gait. I also like the Inov-8 Roclite 295 and Montrail Rogue Racer.

 

Favorite Seattle restaurant?
There are too many to name. Depends on what you like to eat. I like oysters and Elliot’s Oyster House is hard to beat.

 

How do you drink your coffee?
I always drink my coffee black, straight up. My wife bought a Nespresso maker and it is now my new favorite instrument. I haven’t figured out a safe way to take it IV yet.