mzXML parsing

Search algorithms, post-searching processing, quantitation software, etc. Share and discuss software here.
aky
Albumin Member
Posts: 89
Joined: Sat Sep 10, 2011 2:33 pm

mzXML parsing

Postby aky » Wed Feb 25, 2015 4:24 am

Hi friends,

I am writing a parser (in Perl) to read an mzXML file and I am stuck with peaks in Base64 encoding part.
This conceptually seems very simple- read it as string, decode using MIME::Base64 and that's it.

But this is not working for me as there is Zlib compression alongwith Base64 encoding. When I convert a RAW file (Thermo) using msconvert to mzXML it has Base64 encoding with zlib compression. But when I convert same file from mgf to mzXML (again using msconvert) it may not have compression in default mode.

There is an old mzXML 2.0 and 2.1 tutorial which has code example but doesnt work. It has some arbitrary @data variable without any explanation or declaration (peaks are in some other variable). There is no proper tutorial that shows how to do this properly. Searching on older spctools forums gives a few hints but that code only reads peaks (but not their intensities).

As of now I do not understand hostorder and ways of compression much, so be a little kind in showing/explaining a little. I ll be grateful.

Thanks. Looking forward to the wisdom of this group.

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Wed Feb 25, 2015 6:41 am

Have you tried turning off zlib compression in msconvert and making sure you can read the base64-encoded data then? If compression is the only issue, I'd definitely recommend using a library to handle this. It looks like there are several options out there for doing this in Perl. Not sure if this will help but I have C# code for reading base64-encoded data here. Also, have you considered supporting mzML instead, as it's a replacement for mzXML (and mzData)?

jke000
Phosphoserine Member
Phosphoserine Member
Posts: 20
Joined: Wed Aug 17, 2011 10:05 pm

Postby jke000 » Wed Feb 25, 2015 11:59 am

Not sure if helpful or not but ...

In mzXML, the encoded data array is first converted to network byte order (big endian), possibly zlib compressed, and then base64 encoded. To read the data, you need to base64 decode, possibly zlib uncompress, and then convert the results from network byte order back to your host order (likely little endian). At that point, you should have a valid peak list array that you can try to read. You'll know if you have to deal with the zlib compress/decompression by the presence of 'compressionType=zlib"' and 'compressedLen="XX"' attributes in the peaks element.

For some background on the whole host order or network byte order thing, read this: https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html
Host order refers to the endian-ness of your local host machine, likely little-endian. Network byte order is big endian.

Not to steer you away from this forum but if your still stuck, consider posting questions to the spctools-discuss google group as there should be developers on that list who would be able to answer specific perl parsing questions (I think!). Good luck.

aky
Albumin Member
Posts: 89
Joined: Sat Sep 10, 2011 2:33 pm

Postby aky » Wed Feb 25, 2015 7:53 pm

Thanks Craig and Jimmy for help.

Craig, I am able to read non-compressed Base64 encoded data easily in Perl. For both zlib and Base64, I am already using standard cpan libraries.
I do have an mzML parser but want to convert few older files which are in mzXML format for writing a pipeline parser.

JKE, thanks for the help on byte order. Let me try this and get back.

Craig
E. Coli Lysate Member
E. Coli Lysate Member
Posts: 220
Joined: Sun Jun 26, 2011 6:49 pm

Postby Craig » Thu Feb 26, 2015 6:21 am

If it's just for a few older files and you already have an mzML parser, have you considering using msconvert to convert from mzXML to mzML?

aky
Albumin Member
Posts: 89
Joined: Sat Sep 10, 2011 2:33 pm

Postby aky » Mon Apr 06, 2015 11:51 pm

Sorry for late reply.

I solved the problem with help from David Trudgian's post on SPC forums. I am including how to separate m/z - intensity pairs in the code below -

#!/usr/bin/perl
use strict;
use warnings;
use Compress::Zlib qw(uncompress);
use MIME::Base64;

# Data is Base64 encoded and compressed with zlib
my $data = 'ENCODED DATA STRING';#keep encoded data here or read from file

# Decode and uncompress the data
my $base64decoded = uncompress( decode_base64($data) );

# Data is in 64-bit floats in network order
# Unpack as a 64-bit network order quad int
my @hostOrder = unpack("Q>*", $base64decoded );
my $flag=0; #flag zero for m/z and 1 for intensity
foreach my $i (@hostOrder) {
# Pack into a native quad then unpack into the correct 64-bit float
my $val = ( unpack("d", pack("Q", $i ) ) );

if ($flag==0)
{
$val=sprintf("%.12g" , $val);
print "$val ";
$flag=1;
next;
}
elsif($flag==1)
{
if ($val==0)
{
$val="0.0";
}
else
{
$val=sprintf("%.12g" , $val);
}

print "$val\n";
$flag=0;
next;
}
}


Return to “Bioinformatics”

Who is online

Users browsing this forum: Yahoo [Bot] and 1 guest