Page **1** of **1**

### Simple math question

Posted: **Fri Apr 10, 2015 6:28 pm**

by **msaeed**

Hi folks,

I want to calculate the frequency with which a defined stretch of 8 amino acids should appear in the human proteome. For example, assuming that there are 20,000 proteins in a human cell and each protein is 300 aa long, how often one would see the 'RSPMRSSM' stretch? Any help is highly appreciated. Thanks.

Posted: **Mon Apr 13, 2015 4:39 am**

by **Rolando**

msaeed wrote:Hi folks,

I want to calculate the frequency with which a defined stretch of 8 amino acids should appear in the human proteome. For example, assuming that there are 20,000 proteins in a human cell and each protein is 300 aa long, how often one would see the 'RSPMRSSM' stretch? Any help is highly appreciated. Thanks.

I am not an expert in math/probabilities, but here an approach:

The peptide is 8 aa long -> if on each position every amino acid is equally likely -> Total 21^8 (= 37,822,859,361) different peptides possible.

Per protein of 300 aa we can have 293 chances to find the peptide of interest (300-8+1). Since there are 20'000 proteins we have 20'000 times the chance of 293 to find the protein (5,860,000 chances).

Now we devide the chances we have from all proteins by the total possibilities for a 8 aa peptide -> 5,860,000/37,822,859,361 = this means we have a chance of 1.55E-04 to see that peptide.

Of course the proteins do not use the whole sequence space (a lot of proteins contain conserved domains with similar peptide stretches).

What do you think of my calculation? Any mistakes?

Posted: **Mon Apr 13, 2015 6:12 am**

by **Infinity**

Just one correction. Probability of each aa is not the same, you have to account for the "natural abundance" of each aa. You can calculate frequency of each aa by using human proteome database.

Posted: **Mon Apr 13, 2015 11:11 am**

by **Christopher**

You can also count how many times a given sequence actually appears rather easily if you were interested. In R just use:

require(Biostrings)

seq <- readAAStringSet('./human_orf_trans.fasta', format='FASTA',use.names=T)

sum(vcountPattern('RSPMRSSM', seq, fixed=TRUE))

Posted: **Tue Apr 14, 2015 11:31 am**

by **msaeed**

Hi Rolando, thanks for taking time to figure out the answer. I could not get my head around how did you come up with the number 293. How about tackling this problem by simple maths:

1. Total number of possible 8-mer peptides= 20^8= 2.56 X 10^10

2. If every other factor is unchanged, the chances of appearing my peptide of choice will be once in 2.56 X 10^10 peptides, or once every 2 X 10^11 amino acids (2.56 X 10^10* 8).

3. Since there are 20,000 proteins each containing 300 amino acids, total number of amino acids are 6,000,000.

4. 6,000,000/2X10^11 = 2.9 X 10^-5, so apparently the chances of appearing my peptide of choice in the whole proteome are almost negligible.

Let me know if this makes sense to you. thanks again

Posted: **Tue Apr 14, 2015 11:33 am**

by **msaeed**

Hi Christopher, thanks for the R script. I will give it a shot and let you know if it works.

Posted: **Wed Apr 15, 2015 6:36 am**

by **woa**

293 is the number of OVERLAPPING 8 amino acid peptides one can get from a 300 amino acid long protein sequence

Posted: **Wed Apr 15, 2015 11:39 am**

by **msaeed**

woa wrote:293 is the number of OVERLAPPING 8 amino acid peptides one can get from a 300 amino acid long protein sequence

I got it. Thank you.