1. Introduction
DNA and RNA are chainlike molecules composed of subunits called nucleotides. DNA is the
genetic material which is the basis of life. Watson-Crick model of DNA structure shows that DNA
molecule is double-helical and the bases pair in a specific way: adenine (A) with thymine (T), and
guanine (G) with cytosine (C). Imagine a twisted ladder with the two rails formed by the double-
helix sugar-phosphate chains which are joined across by rungs of base-pairs. When DNA
replicates, the parental strands separate; each then serves as the template for making a new,
complementary strand. Genes of all true organisms are made of DNA; certain viruses and all
viroids have genes made of RNA. A gene is a repository of information, i.e, it acts as blueprints
for making proteins [1].
Figure 1 shows a short strand of double helix. There are four possible base-pairs, i.e., AT, TA,
GC and CG. Chemically, A as complementary to T and C to G. DNA sequences
are thus built upon only four alphabets, viz., A, T, C, and G in single strands or AT, TA, GC, and
CG as base-pairs in double helices. Using Sequence algebraic notations, separate strands
from figure 1 can be represented by S35(z) and S53(z) in equation (1). The z-order variables in
the denominator will be used to indicate the sequence order of bases in these strands.
S35
Helix: ....................3'====A=====T=====G=====C====5'
Base-pairs:.........................|.............|.............|..............|
Helix:.....................5'====T=====A=====C=====G====3'
S53
....................Fig. 1 - A Short Segment Of DNA Sequence.
....................The two strands are antiparallel and are identified
....................by numbering such as 3'5' or 5'3'. The 5'-end bears
....................a free 5'-phosphate group and the 3'-end bears a
....................3'-hydroxyl group. For pairing of bases, these two
....................sequences have to be antiparallel from spatial
....................considersations.
..................S35(z):= A/z+T/z^2+G/z^3+C/z^4;
..................S53(z) = T/z+ A/z^2+ C/z^3+ G/z^4; ...............................................(1).
When these two strands are combined we get a double-stranded DNA sequence or double helix
or duplex as shown in equation (2).
..............................................................AT........TA..........GC........CG
.....C(z) := Combine(S35(z),S53(z)):= ------- + ------- + ------- + ------- ............(2).
...................................................................1..........2.............3............4
................................................................z...........z.............z............z
For page economy, single strands will be represented in set notations shown as follows:
.......................DNA(z) := {ATTAUGGC.........}
Double stranded DNA sequences or double helices can be represented as follows:
.......................DNA(z) := {AT,TA,GC,CG,AT,AT,......}
Since the symbolic set contains only four ascii characters, we can use either the alphabet set of
{ATGC} or the numeral set of {1357} neither of which is superior to the other.
2. The Factorisability Of DNA Sequences
Factorisability is tested using the Maple intrinsic function called Factor( ). Factorisability of a
sequence is dependent only on the regularity of repeats and is not dependent on the choice of
the symbol set of {ATGC} or {1357}. As soon as one base deviates from the regular pattern,
factorisability is lost completely. Since there is no difference whether one chooses ATGC or
1357 for the symbolic set, the former will be adopted for familiarity. Then sequence algebraic
analyses of single-stranded or double-stranded sequences are identical at least for primary
structures.
(i) Using Alphabetic Symbols:
............................T........G.......C.........A........T........G.......C.......A......T.......G.......C
DNA(z) := A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- .........(3).
.............................2.........3.........4.........5..........6.........7........8.......9......10.....11.....12
............................z..........z........z........z...........z.........z.........z........z.......z........z.......z
...................................2................2.............4....2............3.......2
................................(z + z + 1) (z - z + 1) (z - z + 1) (A z + T z + G z + C)
.............factorised := ----------------------------------------------------------- ...........(4).
.............................................................................12
...........................................................................z
(ii) Using Numerical Symbols:
............................3........5.........7.......1........3.........5.........7.......1........3........5........7
DNA(z) := 1/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ...........(5).
..............................2........3.........4.......5.......6..........7.........8........9......10......11......12
............................z........z..........z........z........z.........z..........z.........z.......z........z.........z
..................................2................2...............3........2...................4....2
...............................(z + z + 1) (z - z + 1) (z + 3 z + 5 z + 7) (z - z + 1)
..........factorised := ---------------------------------------------------------.......(6).
......................................................................12
....................................................................z
As soon as one errant base appears, factorisability is totally lost. Factorisability is
unlikely to be an important factor in DNA sequence representations. It is hard to
factor a very long sequence and it is very sensitive to errant bases which interrupt
the regularities of patterns.
(i)
...........................T........G.......C........A........T........G........C........A.......T.......G.......C
DNA(z) := T/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ..........(7).
.............................2........3........4.........5.........6........7..........8........9.......10......11.....12
...........................z.........z.........z.........z.........z........z..........z........z.........z........z........z
.............................11......10........9........8........7........6........5........4........3.........2
.........................T z + T z + G z + C z + A z + T z + G z + C z + A z + T z + G z + C
.....factorised := -------------------------------------------------------------------------
..................................................................................12
................................................................................z ...................................(8).
(ii)
............................3........5........7.........1.......3........5.........7........1........3.......5.......7
DNA(z) := 2/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ..........(9).
..............................2.........3........4.........5........6.........7........8.......9.......10.....11.....12
............................z........z........z...........z.........z........z..........z.......z........z........z.......z
................................11.....10......9.....8.....7......6.....5......4......3.......2
............................2 z + 3 z + 5 z + 7 z + z + 3 z + 5 z + 7 z + z + 3 z + 5 z + 7
.......factorised := --------------------------------------------------------------------
........................................................................12
......................................................................z ........................................(10).
3. Information Content Measurements
Infinite repeats of the (ATGC) set will be adopted as the standard measure for information. Two
sequence algebraic formulations for this sequence are developed as given by equations (9) and
(10). The open formulation is adopted since it yields lower information content than the closed
form. Unlike in number theory, derivations of closed forms are rarely possible. So open forms
will be adopted as the norm. In counting the number of ascii characters in the expression, all
intrinsic functions are counted as single ascii characters and the semicolon at the end of the
expression is not counted. Also counting is only done to the right of the equal sign. Units are
measured in Godch which is equivalent to seven ascii characters [5].
(i) Open form:
DNA(z):=sum(A/z^(4*i)+T/z^(4*i+1)+G/z^(4*i+2)+C/z^(4*i+3),i=0..ub);
Infomation content = 55 ascii characters = 7.8571428 Godch.
DNA(z) :=
.....................G......C.......A........T........G........C.......A.......T.......G.......C
..A + T/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- ........................(11).
.........................2.......3......4.........5..........6.......7.......8........9.......10......11
......................z........z........z.........z..........z.........z........z.........z.......z.......z
(ii) Closed form:
DNA(z):=expand(z^4*series(A/(z*(z^4-1))+T/(z^2*(z^4-1))+G/(z^3*(z^4-1))+C/(z^4*(z^4-
1)),z=infinity,ub));
Infomation content = 78 ascii characters = 11.142857 Godch.
.............................T.......G.......C.......A........T.......G........C.......A.......T......G......C......4......1
DNA(z) :=A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- + z O(---)
...............................2........3.......4........5..........6.......7........8........9.......10......11.......12.........17
..............................z........z........z........z..........z........z........z........z........z.........z..........z...........z
.................................................................................................................................................(12).
Based on the above definition for Godch, then a pure nucleotide sequence contain only A will
have less information than the standard reference. Test in equation (13) shows that the
information content is just over 25% that of the standard reference given by equation (9). The
scaling seems reasonable.
sum(A/z^i,i=1..ub);
Infomation content = 16 ascii characters = 2.2857142 Godch.
.............A......A........A.......A.......A.......A........A.......A.......A.......A.......A
A/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- ......................(13).
...............2........3........4.......5........6........7.........8........9.......10......11......12
.............z.........z........z.......z.........z.........z.........z.........z........z........z.........z
Next, we test whether a palindromic sequence have information content different from the
standard reference. Palindromic sticky ends are used in the docking of the free ends of cut
plasmids in recombinant DNA experiments [1,2]. ub is the upperbound which is usually
take at infinity.
DNA(z):=sum(A/z^(4*i)+T/z^(4*i+1)+G/z^(4*i+2)+C/z^(4*i+3)+C/z^(4*i+4)+G/z^
(4*i+5)+A/z^(4*i+6)+T/z^(4*i+7),i=0..ub);
Information content = 103 ascii characters = 14.71285 Godch which is 1.873 times that of the
standard reference. The scaling is again quite reasonable.
..................................G......C.......C......G.......A.......T........A.......T........G........C........C.......G
DNA(z) := A + T/z + ----+ ----+ --- + --- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----
.....................................2........3.......4........5.......6........7........4........5........6.......7........8........9
..................................z........z........z........z.........z..........z........z........z.........z........z........z........z
.........A......T.......A........T.......G......C......C......G......A.......T
.....+ --- + --- + ---- + ---- + --- + --- + --- + --- + --- + --- ..............................................(14).
...........10......11......8.......9......10.....11.....12.....13.....14.....15
..........z.......z.........z........z.......z.......z........z.......z.......z........z
Now we test on a restriction enzyme called Not1 which is a recognition sequence specific on
the cutting sites of GC^GGCCGC [1]. The ^-mark is where this enzyme will cut. In sequence
algebra, we introduce a 0 numerator where this cut mark is located so that this site is recognised
algebraically in manipulations. It is inconceivable that Nature would not have some tagging
method by which the restriction enzyme could recognise this site.
Site(z):=G+C/z^1+G/z^2+0/z^3+G/z^4+C/z^5+C/z^6+G/z^7+C/z^8; ...................(15).
information content = 49 ascii characters = 7 Godch.
Now if this site is buried in a parent sequence, then the information content will increase
drastically making site recognition quite a feat. Imagine you are asked to find a knot in a messed
up woollen ball after rough treatments by your kitten.
4. Information Of Protein Sequences
Remember that each protein is encoded by a codon made up of three bases or a trinucleotide[1].
A codon can encode up to some 20 different type of amino-acids and a protein is encoded by the
amino-acid sequence. Then the protein sequences have a base of 20 instead of 4. The
information content of a protein sequence increases dramatically. The symbol sets for the 20
amino acids will be formed by a numeral/alphabetic set given below:
.............{12345 6789A BCDEF GHJKL} ....................................... (16).
0 and i are not used because the former will not appear in a Maple sequence and the latter could
be mixed up with the numeral 1. It is probably more convenient to define another reference
standard for protein sequence instead of breaking each one down to 3 nucleotides although the
two are almost linearly proprotional for long sequences.
Protein Sequence Standard
Proten(z):=1+2/z+3/z^3+4/z^4+5/z^5+ 6/z^6+7/z^7+8/z^8+9/z^9+A/z^10+
B/z^11+C/z^12+D/z^13+E/z^14+F/z^15+ G/z^16+H/z^17+J/z^18+K/z^19+L/z^20;
Information content = 126 ascii characters. Now 1 Godch = 7 ascii characters. Therefore we
define a new unit for the measurement of protein information as 126/7 = 18 Godch = 1 Chdog
(Chaitin + Godel in reverse). The polite pronounciations for Godch is gosh and that for Chdog is shdog. Thus one could use this conversion to measure protein information
in either Godch or Chdog. The scaling is also about right but cannot be exact multiples of 20 in
view of Maple's programming syntices.
.....................................3.......4.........5.........6.........7........8.......9.......A.......B.......C.......D....E
Protein(z) := 1 + 2/z + ---- + ---- + ---- + ---- + ---- + ---- + ---- + --- + --- + --- + --- + ---
......................................3........4........5.........6..........7........8........9......10.......11.....12....13...14
.....................................z........z.........z.........z...........z........z........z........z........z........z........z.....z
............F......G......H.....J......K......L
......+ --- + --- + --- + --- + --- + --- ................................................................(17).
............15.....16.....17....18.....19.....20
............z.......z......z.......z.......z.......z
5. What is the point of all these?
A good question. Probably only molecular geneticists could answer it. The author thinks that
geneticists are more interested in a segment of a sequence with high information content. It is
like reading a newspaper. An eye-catching headline helps the circulations. So all DNA and
protein sequences should be recorded with indices on information content in chdogs with segments of
high information contents highlighted. What can be done with these is another matter.
Presumably, a stretch of intron is more informative than a long stretch of repetitive DNA
sequence. Therefore software could be developed to search for the stretch with the maximum
information content. Conceptually this is done as follows:
..............................................max info' content ..............................................
..............................................found here!!
{xxxxxxxxxxxxxxxxxxxxxxATGCCGUGUGUxxxxxxxxxxxxxxxxxxxxx}.....(18).
This is only a conjecture. Presumably, the closeness of the DNA sequences from two relative
species must have about the same information content. For example at present scientists found
that Orang Utans are genetically more remote from Homo Sapiens than chimpanzees. Will the
computations of information contents be able to confirm the findings?
6. Conclusions
This paper shows that DNA and protein sequences are just like number sequences and can be
anlaysed using sequence algebra. The information contents of these sequences can be
computed and calibrated against agreed standards. The two standards recommeneded are:
DNA sequences: 1 Godch = 7 ascii characters.
Protein sequences: 1 Chdog = 126 ascii characters = 18 Godch.
All calibrations should be done using the same algebraic syntices. For example in this
paper, we use Maple programming syntices to represent the sequence algebraic formualtions. It
is not mandatory but other symbolic packages will probably give about the same results.
The author is of the opinion that there is no point is attempting to find closed formulations for
DNA or Protein sequences although if these could be found, it could result in savings of printing
cost. Currently some very unwieldy volumes of genetic sequences are on displays in the libraries which
get out-of-date fast. If closed forms can be found, then all we need is to email the new
formulations to scientists all over the world. Is this a pipedream? Well, dreams are the stuff
which humans are made of.
(Postscript: The reference section contains all current papers in sequence algebra some of which may be
remotely connected with DNA sequences. Most of these can be downloaded free from this URL
site.)
7. References
1. Weaver R.F. and Hedrick P.W. : Basic Genetics, Second edition (1995), Wm.C.Brown
Publishers pp 131 to 177.
2. Patrick DeGeest's Palindrome Page
3. Picover A.C. (editor) : Visualizing Bilogical Information, please read "Representation of Biological
Sequences Using Posint Geometry Analysis" by Huen Y.K. pp 165 to 182. World Scientific. 1995.
4. Huen Y.K.
5. Huen Y.K.
6. Huen Y.K.
7. Huen Y.K.
8. Huen Y.K.
9. Huen Y.K.
10. Huen Y.K.
11. Huen Y.K.
12. Huen Y.K.
13. Huen Y.K.
14. Huen Y.K.: A Matrix Map for Prime and Non-prime Numbers, INT. J. Math. Educ. Sci.
Technol., 1994, VOL. 25, NO.6, pp 913-920.
15. Huen Y.K.: Some Interesing Properties Of The Natural Number System, Int. J. Math. Educ.
Sci. Technol., 1996, VOL.27, NO. 5, 685-691.
16. Huen Y.K.: Visual algebra and its applications, INT. J. Math. Educ. Sci. Technol.,1996,
VOL.??, NO.?, ???-??? (In the press as proof paper mes 100421).
17. Huen Y.K.: Twin primes revisited: INT. J. Math. Educ. Sci. Technol., 1997, VOL.??,NO.?,
???-???. (In the press as proof paper mes 100488).
=====================END OF PAPER ======================