SP – Your education includes both a B.S. and an M.S. in electrical engineering from the University of Washington. How did you end up in John Yates’ lab doing computational proteomics? And was this transition a difficult or a natural one?
Jimmy – When I finished with my master’s degree, I was just looking for a job and ended up applying to a position that I saw advertised in the Seattle Times classified ads. I remember it listed requirements such as a C programmer, knowledge with databases (which turned out to be sequence databases and not relational databases), Unix experience, etc. The posting was from the Yates’ lab which had recently moved up from Caltech as part of the new Molecular Biotechnology Department at the UW. I had a lot to learn so the transition wasn’t natural but I wouldn’t classify it as difficult either. It was a matter of getting educated on proteomics basics and terminology and going from there. That process was made easier because the lab members were both smart and helpful.
SP – In the early nineties you wrote Sequest, the first major database search algorithm for identifying peptides from protein databases based on MS/MS data. How did the idea to write Sequest come about? Can you give us some insight into what was going on in the field around that time?
Jimmy – The idea was John’s which he formulated long before I ever first heard of the word “peptide”. The position that he posted and I was hired for was to implement his idea into an automated search tool. The proteomics community was obviously much smaller at the time. And the computational proteomics community was even smaller. So anyone with the insight and resources to hire a computational person for their lab really had the opportunity to make significant contributions to the field. The ability to identify peptides, then automate the visualization of protein sequence coverage, and displaying the tandem mass spectrum with identified fragment ions, and seeing all of this from any web browser were all very unique in the early-mid 90’s.
SP – What was the first version of Sequest like (e.g. language written in, speed, quality of results, etc.)?
Jimmy – Sequest was written in C on a DEC Ultrix system, which was the same operating system used for the data system running the Finnigan TSQ 70 at that time. The data landscape back in 1993 is a lot different than it is now or was even 10 years ago. Automated data-dependent acquisition hadn’t been invented yet. The lab was acquiring just a handful (maybe 5 to 10) of identifiable tandem mass spectra at a time; I remember writing down all peptide IDs in my lab notebook as we went along. So the notion of search speed wasn’t really an issue we concerned ourselves with in the early days. The search results were actually really good for the quality of data. There’s a night and day difference between tandem mass spectra we analyze now versus back then (in terms of mass accuracy and noise) due to the rapid advancement in instrumentation through the years.
SP – What other major software have you developed? Any other peptide identification algorithms?
Jimmy – There’s really no other significant software that I can attribute just to myself. When the ICAT strategy was being developed in the early 2000s, I wrote a tool named XPRESS to automate the quantification of that data. This tool is still in use today as part of the Trans-Proteomic Pipeline (TPP) suite of open source tools. The mzXML format was a pet project of mine and I was fortunate to work with smart people on getting that off the ground. A few folks were tangibly involved, including our collaborators at Insilicos. But in particular, Patrick Pedrioli, a grad student then, took it upon himself to build the format and support tools to make it successful. The TPP from the Aebersold group came about by contributions from lots of researchers. In the early days, there were just a few of us who strived to put it all together (in particular Andy Keller did a ton of the programming to tie everything together). As for other peptide identification algorithms, the closest thing that exists today would be the k-score plug-in to X!Tandem. The whole scoring plug-in framework was implemented by Brendan MacLean of Skyline fame when we both were part of the McIntosh group at the Fred Hutchinson Cancer Research Center. I had the fortune of working with Brendan on this and the k-score plug-in algorithm is just another variant of a correlation or dot product score of mine that the Hutch was using before we converted their pipeline over to X!Tandem.
SP – What other search algorithms have you used? Tell us your general impression of those you have tried.
Jimmy – I have used Mascot, X!Tandem, and OMSSA on a more than just casual basis. There are tools like MS-Tag in ProteinProspector, InsPecT, Crux/Tide, etc. that I’ve played with through the years but I’m primarily a Sequest user followed by X!Tandem and Mascot in that order. I’ve only fairly recently interacted with Mascot from an entire software standpoint (such as installing databases, administering users, etc.) and I do appreciate what a nice, well done piece of software it is. I like tools that are easy to use and just work and Mascot definitely falls into that vein. For data processing, I end up using X!Tandem more because it fits into my software pipeline easier since I tend to do everything on a Unix command line. It’s just a little more work for me to launch Mascot searches and then retrieve results. I can appreciate that most researchers would probably have the opposite preference for interacting through a web interface versus using the command line. As for my impressions of each tool beyond what I stated above, I’ll just say that every tool has its own pros and cons and these pros/cons are likely different for each person . I won’t name which ones but there’s a couple of tools above that seem to perform consistently poor for me, say when measured at a given false discovery rate in a target-decoy search. I’m sure that these tools can perform better when the searches are performed differently or the FDR analysis is processed differently than how I do things.
SP – What is the current focus of your work at the University of Washington?
Jimmy – I work at the UW’s Proteomics Resource (UWPR). The UWPR primarily provides access to instrumentation to the local proteomics community. I’m a computational resource for UWPR collaborators and the various proteomics groups on campus. I still do work on Sequest here, adding functionality and features as requested by researchers. I also interact with computational researchers who have written their own database search tools. I collaborate on research projects with grad students and post docs. This includes assisting with data analysis, helping to implement novel analysis tools, etc. And every now and then a technical or systems administration issue comes up that I lend a hand with. We have quite a few proteomics groups at the UW so there’s always a wide range of interesting research being performed here.
SP – Using target-decoy databases is a common method for estimating FDR in poteomics experiments. In the past few years this approach has been combined with machine learning methods, allowing for better differentiation of true and false positives, often by combining multiple scores. What do you think of this strategy? In your experience what are the best ways to estimate FDR? And are there problems we should be wary of?
Jimmy – I always use a reverse database concatenated to the target database and simply sort results by a single score such as the expectation score. I do this because it’s straightforward and because I personally don’t think it’s worth a lot of my own mental effort to try and optimize the heck out of the FDR calculation to squeeze as many IDs out as possible. Striving to go from one large list of peptides or proteins to an even larger list of peptides/proteins just doesn’t seem that interesting to me. Some (many?) machine learning methods make use of the same target/decoy results for both the learning step and the evaluation of results. They keep evaluating some search/machine learning space to maximize target hits. The fact that the training data is the same as the final test data bothers me a bit. There really is no independent test data as the algorithms train until some maximum peptide count is reached at a given FDR.
SP – MS/MS data are typically searched against protein databases constructed from genome sequences. Advances in genomic sequencing technologies have permitted the sequencing of hundreds and thousands of individuals. How can this data effectively be incorporated into protein databases and searched? Will this require that we rethink how databases are organized and searched?
Jimmy – Just thinking about managing the explosion in spectral data as instruments continually become more efficient in acquiring data is an issue. And having to consider an explosion in the sequence databases that this data needs to be searched against is definitely daunting. If we are going to be faced with searching against the sequences of many individuals, there’s no question the format of databases and current implementation of search tools will need to be revisited to perform this analysis as efficiently as possible. Luckily there are many talented computational researchers in proteomics now so I’m confident a few of those smart people will solve this problem elegantly when the need arises.
SP – There has been a recent push to produce peptides for every protein in the human (and other) proteome in order to facilitate high throughput targeted quantitative proteomics experiments. Once this is complete do you forsee a huge shift towards targeted proteomic experiments (e.g. selected reaction monitoring) and a diminished role for database searching?
Jimmy – I do see targeted proteomics experiments growing if for no other reason than this community tends to do what leaders in the field do. Quite a few high profile groups are proponents of SRM and it’s only a matter of time before the followers start to perform the same types of experiments themselves on their own respective systems. That being said, it will be a long time before the need for sequence database searching starts to wane.
SP – How do you see the role of searching spectrum libraries changing over the next few years?
Jimmy – This is a little surprising to me; I’m not sure why spectral library searching is still such a niche tool that it is. For spectral library searching to be adopted more fully, it needs to be integrated into a sequence database search pipeline. It will be a long time before folks give up sequence database searching as habits are hard to change so the best way to ramp adoption and acceptance of spectral library searching is to have it come for free and add value to the database search results.
SP – What do you see as some of the biggest hurdles we have yet to overcome in proteomics? And where do you think we are headed?
Jimmy – These are hard questions to answer and I’m in no real position to answer them. I work in my own little world and at some level am oblivious to the breath of proteomics research being performed in the community. At this point in time, we can all identify and quantify peptides and proteins. Whether we all do that well or not is another question but instruments are ubiquitous and easy to use and software exists for everyone to do proteomics experiments. Being experienced or educated enough to ask the right questions and do the right experiments (and analyze the data correctly) is a challenge because proteomics analysis is more accessible now than ever to a wide range of expertise. Big changes in any field occur when something disruptive comes about. Otherwise we just end up with incremental improvements. The biggest opportunity for a disruptive change is some novel instrument that is capable of doing things that I can’t even perceive now. It’s only a matter of time before this instrument is developed and everything changes at that time.
Discuss this interview with other SharedProteomics members here.
Jimmy K. Eng
Senior Software Engineer
University of Washington
Jimmy received both a BS and an MS in electrical engineering from the University of Washington in Seattle. He has been working as a software engineer in Seattle for nearly twenty years and has held positions at the University of Washington (twice), the Institute for Systems Biology, and the Fred Hutchinson Cancer Research Center. He is most well known for writing Sequest, the first and probably most widely used computer algorithm for identifying peptides from mass spectrometry-based proteomic analyses. He has also played a major role in advancing open source proteomic tools and implementing standardized proteomic file formats.
Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001 May 4;292(5518):929-34. PubMed PMID: 11340206.
Gygi SP, Rist B, Griffin TJ, Eng J, Aebersold R. Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. J Proteome Res. 2002 Jan-Feb;1(1):47-54. PubMed PMID: 12643526.
Keller A, Eng J, Zhang N, Li XJ, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol. 2005;1:2005.0017. Epub 2005 Aug 2. PubMed PMID: 16729052; PubMed Central PMCID: PMC1681455.