Information Contents Of Hypothetical DNA Sequences

by

Huen Y.K.

CAHRC, P.O.Box 1003, Singapore 911101
http://web.singnet.com.sg/~huens/
email: huens@mbox3.singnet.com.sg

(A short communication - 1st released: 27/6/97.)


Abstract

DNA sequences are the repositories of genetic information. Since there are only four distinct nucleotides or base-pairs in the {ATCG} symbolic set, representations of information is unusually lengthy. This can be appreciated if one is asked to tell one's life story with four alphabets. DNA sequences are just number sequences expressed in the base of four. Using the method developed for finding information contents of number theoretic functions, information contents of hypothetic finite DNA strands are investigated starting with those with highly repetitive patterns. Measurements are based on a chosen standard reference represented by infinite repeats in the integer set {1357} or {ATGC}. Unlike number sequences, DNA sequences are always in unnormalsied forms and one is seldom able to find the closed formulations. The information content unit is based on the "Chdog" which is defined as equivalent to 126 ascii characters or symbols in Maple programming syntices (Chdog = Chaitin + Godel reversed"; pronounced as shdog). Infinite information content could occur in a perfectly random infinitely long sequence but it is conjectured that information content of DNA sequences is finite especially when when confined to the expressions of a single gene. Purely predicting information content from a DNA sequence could be simplistic as it probably interacts with environment factors. But it is difficult to quantify external factors not encoded in the sequence. So in this study, information contents are calibrated purely based on sequence algebraic formulae only.


1. Introduction

DNA and RNA are chainlike molecules composed of subunits called nucleotides. DNA is the genetic material which is the basis of life. Watson-Crick model of DNA structure shows that DNA molecule is double-helical and the bases pair in a specific way: adenine (A) with thymine (T), and guanine (G) with cytosine (C). Imagine a twisted ladder with the two rails formed by the double- helix sugar-phosphate chains which are joined across by rungs of base-pairs. When DNA replicates, the parental strands separate; each then serves as the template for making a new, complementary strand. Genes of all true organisms are made of DNA; certain viruses and all viroids have genes made of RNA. A gene is a repository of information, i.e, it acts as blueprints for making proteins [1].

Figure 1 shows a short strand of double helix. There are four possible base-pairs, i.e., AT, TA, GC and CG. Chemically, A as complementary to T and C to G. DNA sequences are thus built upon only four alphabets, viz., A, T, C, and G in single strands or AT, TA, GC, and CG as base-pairs in double helices. Using Sequence algebraic notations, separate strands from figure 1 can be represented by S35(z) and S53(z) in equation (1). The z-order variables in the denominator will be used to indicate the sequence order of bases in these strands.

S35
Helix: ....................3'====A=====T=====G=====C====5'
Base-pairs:.........................|.............|.............|..............|
Helix:.....................5'====T=====A=====C=====G====3'
S53

....................Fig. 1 - A Short Segment Of DNA Sequence.
....................The two strands are antiparallel and are identified
....................by numbering such as 3'5' or 5'3'. The 5'-end bears
....................a free 5'-phosphate group and the 3'-end bears a
....................3'-hydroxyl group. For pairing of bases, these two
....................sequences have to be antiparallel from spatial
....................considersations.


..................S35(z):= A/z+T/z^2+G/z^3+C/z^4;
..................S53(z) = T/z+ A/z^2+ C/z^3+ G/z^4; ...............................................(1).

When these two strands are combined we get a double-stranded DNA sequence or double helix or duplex as shown in equation (2).

..............................................................AT........TA..........GC........CG
.....C(z) := Combine(S35(z),S53(z)):= ------- + ------- + ------- + ------- ............(2).
...................................................................1..........2.............3............4
................................................................z...........z.............z............z

For page economy, single strands will be represented in set notations shown as follows:

.......................DNA(z) := {ATTAUGGC.........}

Double stranded DNA sequences or double helices can be represented as follows:

.......................DNA(z) := {AT,TA,GC,CG,AT,AT,......}

Since the symbolic set contains only four ascii characters, we can use either the alphabet set of {ATGC} or the numeral set of {1357} neither of which is superior to the other.


2. The Factorisability Of DNA Sequences

Factorisability is tested using the Maple intrinsic function called Factor( ). Factorisability of a sequence is dependent only on the regularity of repeats and is not dependent on the choice of the symbol set of {ATGC} or {1357}. As soon as one base deviates from the regular pattern, factorisability is lost completely. Since there is no difference whether one chooses ATGC or 1357 for the symbolic set, the former will be adopted for familiarity. Then sequence algebraic analyses of single-stranded or double-stranded sequences are identical at least for primary structures.

(i) Using Alphabetic Symbols:

............................T........G.......C.........A........T........G.......C.......A......T.......G.......C
DNA(z) := A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- .........(3).
.............................2.........3.........4.........5..........6.........7........8.......9......10.....11.....12
............................z..........z........z........z...........z.........z.........z........z.......z........z.......z

...................................2................2.............4....2............3.......2
................................(z + z + 1) (z - z + 1) (z - z + 1) (A z + T z + G z + C)
.............factorised := ----------------------------------------------------------- ...........(4).
.............................................................................12
...........................................................................z

(ii) Using Numerical Symbols:

............................3........5.........7.......1........3.........5.........7.......1........3........5........7
DNA(z) := 1/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ...........(5).
..............................2........3.........4.......5.......6..........7.........8........9......10......11......12
............................z........z..........z........z........z.........z..........z.........z.......z........z.........z

..................................2................2...............3........2...................4....2
...............................(z + z + 1) (z - z + 1) (z + 3 z + 5 z + 7) (z - z + 1)
..........factorised := ---------------------------------------------------------.......(6).
......................................................................12
....................................................................z

As soon as one errant base appears, factorisability is totally lost. Factorisability is unlikely to be an important factor in DNA sequence representations. It is hard to factor a very long sequence and it is very sensitive to errant bases which interrupt the regularities of patterns.

(i)

...........................T........G.......C........A........T........G........C........A.......T.......G.......C
DNA(z) := T/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ..........(7).
.............................2........3........4.........5.........6........7..........8........9.......10......11.....12
...........................z.........z.........z.........z.........z........z..........z........z.........z........z........z

.............................11......10........9........8........7........6........5........4........3.........2
.........................T z + T z + G z + C z + A z + T z + G z + C z + A z + T z + G z + C
.....factorised := ------------------------------------------------------------------------- ..................................................................................12
................................................................................z ...................................(8).

(ii)

............................3........5........7.........1.......3........5.........7........1........3.......5.......7
DNA(z) := 2/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ..........(9).
..............................2.........3........4.........5........6.........7........8.......9.......10.....11.....12
............................z........z........z...........z.........z........z..........z.......z........z........z.......z

................................11.....10......9.....8.....7......6.....5......4......3.......2
............................2 z + 3 z + 5 z + 7 z + z + 3 z + 5 z + 7 z + z + 3 z + 5 z + 7
.......factorised := -------------------------------------------------------------------- ........................................................................12
......................................................................z ........................................(10).

3. Information Content Measurements

Infinite repeats of the (ATGC) set will be adopted as the standard measure for information. Two sequence algebraic formulations for this sequence are developed as given by equations (9) and (10). The open formulation is adopted since it yields lower information content than the closed form. Unlike in number theory, derivations of closed forms are rarely possible. So open forms will be adopted as the norm. In counting the number of ascii characters in the expression, all intrinsic functions are counted as single ascii characters and the semicolon at the end of the expression is not counted. Also counting is only done to the right of the equal sign. Units are measured in Godch which is equivalent to seven ascii characters [5].

(i) Open form:

DNA(z):=sum(A/z^(4*i)+T/z^(4*i+1)+G/z^(4*i+2)+C/z^(4*i+3),i=0..ub);

Infomation content = 55 ascii characters = 7.8571428 Godch.

DNA(z) :=

.....................G......C.......A........T........G........C.......A.......T.......G.......C
..A + T/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- ........................(11).
.........................2.......3......4.........5..........6.......7.......8........9.......10......11
......................z........z........z.........z..........z.........z........z.........z.......z.......z

(ii) Closed form:

DNA(z):=expand(z^4*series(A/(z*(z^4-1))+T/(z^2*(z^4-1))+G/(z^3*(z^4-1))+C/(z^4*(z^4- 1)),z=infinity,ub));

Infomation content = 78 ascii characters = 11.142857 Godch.

.............................T.......G.......C.......A........T.......G........C.......A.......T......G......C......4......1
DNA(z) :=A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- + z O(---)
...............................2........3.......4........5..........6.......7........8........9.......10......11.......12.........17
..............................z........z........z........z..........z........z........z........z........z.........z..........z...........z

.................................................................................................................................................(12). Based on the above definition for Godch, then a pure nucleotide sequence contain only A will have less information than the standard reference. Test in equation (13) shows that the information content is just over 25% that of the standard reference given by equation (9). The scaling seems reasonable.

sum(A/z^i,i=1..ub);

Infomation content = 16 ascii characters = 2.2857142 Godch.

.............A......A........A.......A.......A.......A........A.......A.......A.......A.......A
A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ......................(13).
...............2........3........4.......5........6........7.........8........9.......10......11......12
.............z.........z........z.......z.........z.........z.........z.........z........z........z.........z

Next, we test whether a palindromic sequence have information content different from the standard reference. Palindromic sticky ends are used in the docking of the free ends of cut plasmids in recombinant DNA experiments [1,2]. ub is the upperbound which is usually take at infinity.

DNA(z):=sum(A/z^(4*i)+T/z^(4*i+1)+G/z^(4*i+2)+C/z^(4*i+3)+C/z^(4*i+4)+G/z^ (4*i+5)+A/z^(4*i+6)+T/z^(4*i+7),i=0..ub);

Information content = 103 ascii characters = 14.71285 Godch which is 1.873 times that of the standard reference. The scaling is again quite reasonable.

..................................G......C.......C......G.......A.......T........A.......T........G........C........C.......G
DNA(z) := A + T/z + ----+ ----+ --- + --- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----
.....................................2........3.......4........5.......6........7........4........5........6.......7........8........9
..................................z........z........z........z.........z..........z........z........z.........z........z........z........z

.........A......T.......A........T.......G......C......C......G......A.......T
.....+ --- + --- + ---- + ---- + --- + --- + --- + --- + --- + --- ..............................................(14).
...........10......11......8.......9......10.....11.....12.....13.....14.....15
..........z.......z.........z........z.......z.......z........z.......z.......z........z

Now we test on a restriction enzyme called Not1 which is a recognition sequence specific on the cutting sites of GC^GGCCGC [1]. The ^-mark is where this enzyme will cut. In sequence algebra, we introduce a 0 numerator where this cut mark is located so that this site is recognised algebraically in manipulations. It is inconceivable that Nature would not have some tagging method by which the restriction enzyme could recognise this site.

Site(z):=G+C/z^1+G/z^2+0/z^3+G/z^4+C/z^5+C/z^6+G/z^7+C/z^8; ...................(15).

information content = 49 ascii characters = 7 Godch.

Now if this site is buried in a parent sequence, then the information content will increase drastically making site recognition quite a feat. Imagine you are asked to find a knot in a messed up woollen ball after rough treatments by your kitten.


4. Information Of Protein Sequences

Remember that each protein is encoded by a codon made up of three bases or a trinucleotide[1]. A codon can encode up to some 20 different type of amino-acids and a protein is encoded by the amino-acid sequence. Then the protein sequences have a base of 20 instead of 4. The information content of a protein sequence increases dramatically. The symbol sets for the 20 amino acids will be formed by a numeral/alphabetic set given below:

.............{12345 6789A BCDEF GHJKL} ....................................... (16).

0 and i are not used because the former will not appear in a Maple sequence and the latter could be mixed up with the numeral 1. It is probably more convenient to define another reference standard for protein sequence instead of breaking each one down to 3 nucleotides although the two are almost linearly proprotional for long sequences.


Protein Sequence Standard

Proten(z):=1+2/z+3/z^3+4/z^4+5/z^5+ 6/z^6+7/z^7+8/z^8+9/z^9+A/z^10+ B/z^11+C/z^12+D/z^13+E/z^14+F/z^15+ G/z^16+H/z^17+J/z^18+K/z^19+L/z^20;

Information content = 126 ascii characters. Now 1 Godch = 7 ascii characters. Therefore we define a new unit for the measurement of protein information as 126/7 = 18 Godch = 1 Chdog (Chaitin + Godel in reverse). The polite pronounciations for Godch is gosh and that for Chdog is shdog. Thus one could use this conversion to measure protein information in either Godch or Chdog. The scaling is also about right but cannot be exact multiples of 20 in view of Maple's programming syntices.

.....................................3.......4.........5.........6.........7........8.......9.......A.......B.......C.......D....E
Protein(z) := 1 + 2/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- + --- + ---
......................................3........4........5.........6..........7........8........9......10.......11.....12....13...14
.....................................z........z.........z.........z...........z........z........z........z........z........z........z.....z

............F......G......H.....J......K......L
......+ --- + --- + --- + --- + --- + --- ................................................................(17).
............15.....16.....17....18.....19.....20
............z.......z......z.......z.......z.......z

5. What is the point of all these?

A good question. Probably only molecular geneticists could answer it. The author thinks that geneticists are more interested in a segment of a sequence with high information content. It is like reading a newspaper. An eye-catching headline helps the circulations. So all DNA and protein sequences should be recorded with indices on information content in chdogs with segments of high information contents highlighted. What can be done with these is another matter. Presumably, a stretch of intron is more informative than a long stretch of repetitive DNA sequence. Therefore software could be developed to search for the stretch with the maximum information content. Conceptually this is done as follows:

..............................................max info' content ..............................................
..............................................found here!!
{xxxxxxxxxxxxxxxxxxxxxxATGCCGUGUGUxxxxxxxxxxxxxxxxxxxxx}.....(18).

This is only a conjecture. Presumably, the closeness of the DNA sequences from two relative species must have about the same information content. For example at present scientists found that Orang Utans are genetically more remote from Homo Sapiens than chimpanzees. Will the computations of information contents be able to confirm the findings?


6. Conclusions

This paper shows that DNA and protein sequences are just like number sequences and can be anlaysed using sequence algebra. The information contents of these sequences can be computed and calibrated against agreed standards. The two standards recommeneded are:

DNA sequences: 1 Godch = 7 ascii characters.

Protein sequences: 1 Chdog = 126 ascii characters = 18 Godch.

All calibrations should be done using the same algebraic syntices. For example in this paper, we use Maple programming syntices to represent the sequence algebraic formualtions. It is not mandatory but other symbolic packages will probably give about the same results.

The author is of the opinion that there is no point is attempting to find closed formulations for DNA or Protein sequences although if these could be found, it could result in savings of printing cost. Currently some very unwieldy volumes of genetic sequences are on displays in the libraries which get out-of-date fast. If closed forms can be found, then all we need is to email the new formulations to scientists all over the world. Is this a pipedream? Well, dreams are the stuff which humans are made of.

(Postscript: The reference section contains all current papers in sequence algebra some of which may be remotely connected with DNA sequences. Most of these can be downloaded free from this URL site.)


7. References

1. Weaver R.F. and Hedrick P.W. : Basic Genetics, Second edition (1995), Wm.C.Brown Publishers pp 131 to 177.

2. Patrick DeGeest's Palindrome Page : Comprehensive source on computational palindromes can be found in DeGeest's URL site: http://www.ping.be /~ping6758/index.htm.

3. Picover A.C. (editor) : Visualizing Bilogical Information, please read "Representation of Biological Sequences Using Posint Geometry Analysis" by Huen Y.K. pp 165 to 182. World Scientific. 1995.

4. Huen Y.K.: A Simple Introduction To Sequence Algebra, URL site: http://web.singnet.com.sg/~huens/

5. Huen Y.K.: The Canonical Generating Function or CGF(z) - a Swiss-knife function. URL site: http://web.singnet.com.sg/~huens/ .

6. Huen Y.K.: Information Contents Of Number Theoretic Functions. URL site: http://web.singnet.com.sg/~huens/ .

7. Huen Y.K.: In Search Of Exotic Arithmetic Operators, URL site: http://web.singnet.com.sg /~huens/ .

8. Huen Y.K.: Visual Solutions Of Number Theoretic Functions in Multidimensional Sequence Space, URL site: http://web.singnet.com.sg /~huens/ .

9. Huen Y.K.: Final Value Theorems Applied To Number Sequences -- its strengths and weaknesses, URL site: http://web.singnet.com.sg /~huens/ .

10. Huen Y.K.: Unsolved Problems In Sequence Algebra, URL site: http://web.singnet.com.sg /~huens/ .

11. Huen Y.K.: Explicit Formulation For Modular Arithmetic In Sequence Algebra, URL site: http://web.singnet.com.sg /~huens/ .

12. Huen Y.K.: Cyclic Generating Functions In Sequence Algebra, URL site: http://web.singnet.com.sg /~huens/ .

13. Huen Y.K. :Methods Of Developing Sequence Algebraic Formulations For Comp(z) and Prime(z). URL site: http://web.singnet.com.sg /~huens/ .

14. Huen Y.K.: A Matrix Map for Prime and Non-prime Numbers, INT. J. Math. Educ. Sci. Technol., 1994, VOL. 25, NO.6, pp 913-920.

15. Huen Y.K.: Some Interesing Properties Of The Natural Number System, Int. J. Math. Educ. Sci. Technol., 1996, VOL.27, NO. 5, 685-691.

16. Huen Y.K.: Visual algebra and its applications, INT. J. Math. Educ. Sci. Technol.,1996, VOL.??, NO.?, ???-??? (In the press as proof paper mes 100421).

17. Huen Y.K.: Twin primes revisited: INT. J. Math. Educ. Sci. Technol., 1997, VOL.??,NO.?, ???-???. (In the press as proof paper mes 100488).

=====================END OF PAPER ======================