## Simple math question

If it doesn't fit into any other category post it here.
msaeed
Carbon Member Posts: 4
Joined: Fri Apr 10, 2015 6:12 pm

### Simple math question

Hi folks,
I want to calculate the frequency with which a defined stretch of 8 amino acids should appear in the human proteome. For example, assuming that there are 20,000 proteins in a human cell and each protein is 300 aa long, how often one would see the 'RSPMRSSM' stretch? Any help is highly appreciated. Thanks.

Rolando
Carbon Member Posts: 4
Joined: Thu Jan 23, 2014 1:17 am
msaeed wrote:Hi folks,
I want to calculate the frequency with which a defined stretch of 8 amino acids should appear in the human proteome. For example, assuming that there are 20,000 proteins in a human cell and each protein is 300 aa long, how often one would see the 'RSPMRSSM' stretch? Any help is highly appreciated. Thanks.

I am not an expert in math/probabilities, but here an approach:

The peptide is 8 aa long -> if on each position every amino acid is equally likely -> Total 21^8 (= 37,822,859,361) different peptides possible.

Per protein of 300 aa we can have 293 chances to find the peptide of interest (300-8+1). Since there are 20'000 proteins we have 20'000 times the chance of 293 to find the protein (5,860,000 chances).

Now we devide the chances we have from all proteins by the total possibilities for a 8 aa peptide -> 5,860,000/37,822,859,361 = this means we have a chance of 1.55E-04 to see that peptide.
Of course the proteins do not use the whole sequence space (a lot of proteins contain conserved domains with similar peptide stretches).

What do you think of my calculation? Any mistakes?

Infinity
E. Coli Lysate Member Posts: 107
Joined: Wed Dec 21, 2011 8:22 pm
Just one correction. Probability of each aa is not the same, you have to account for the "natural abundance" of each aa. You can calculate frequency of each aa by using human proteome database.

Christopher
Angiotensin Member Posts: 42
Joined: Thu Dec 27, 2012 12:26 pm
You can also count how many times a given sequence actually appears rather easily if you were interested. In R just use:

require(Biostrings)

msaeed
Carbon Member Posts: 4
Joined: Fri Apr 10, 2015 6:12 pm
Hi Rolando, thanks for taking time to figure out the answer. I could not get my head around how did you come up with the number 293. How about tackling this problem by simple maths:
1. Total number of possible 8-mer peptides= 20^8= 2.56 X 10^10
2. If every other factor is unchanged, the chances of appearing my peptide of choice will be once in 2.56 X 10^10 peptides, or once every 2 X 10^11 amino acids (2.56 X 10^10* 8).
3. Since there are 20,000 proteins each containing 300 amino acids, total number of amino acids are 6,000,000.
4. 6,000,000/2X10^11 = 2.9 X 10^-5, so apparently the chances of appearing my peptide of choice in the whole proteome are almost negligible.
Let me know if this makes sense to you. thanks again

msaeed
Carbon Member Posts: 4
Joined: Fri Apr 10, 2015 6:12 pm
Hi Christopher, thanks for the R script. I will give it a shot and let you know if it works.

woa
Ubiquitin Member Posts: 63
Joined: Sun Sep 25, 2011 6:29 am
293 is the number of OVERLAPPING 8 amino acid peptides one can get from a 300 amino acid long protein sequence

msaeed
Carbon Member Posts: 4
Joined: Fri Apr 10, 2015 6:12 pm
woa wrote:293 is the number of OVERLAPPING 8 amino acid peptides one can get from a 300 amino acid long protein sequence

I got it. Thank you.