Empirical Curve Fitting Of DNA Sequences

by

Huen Y.K.

CAHRC, P.O.Box 1003, Singapore 911101
http://web.singnet.com.sg/~activweb/
Related URL-sites: http://web.singnet.com.sg/~huens/
email: huens@mbox3.singnet.com.sg

(A short communication - 1st released: 14/12/97 )


Abstract

Sequence algebra has found utilities in the developments of generating functions for predicting number sequences [1,2,3]. But there are sequences outside number theory which are not so amenable to mathematical analyses. A glaring example is that of DNA sequences. In spite of the gathering of much useful information from experimental observations followed by subsequent theoretical explanations, so far the mathematical basis of DNA sequence has eluded scientists. In this paper the author suggests how short DNA sequences can be curve-fitted by generating functions. Empirical curve-fittings enables one to write down the closed form generating functions for short DNA sequences. From the exercise the author suggests a canonial form for DNA-generating function. At the moment this is still a trial- and-error algorithm as an algebraic method which can deliver the whole generating function without manual interventions has not been discovered.


1. Introduction

The paper will begin by describing the 4-alphabet language problem. A real example is that of the four distinctive nucleotides C, G, A and T which are the building blocks of DNA sequences. This language is much easier to handle than the English language with its 26 alphabets but the principle developed here can be extended to any other alphabet-based languages, even hanyu pinyin Chinese provided one is willing to accept the increased complexity. We liken DNA sequences to a linguistic string because genes are made of DNA, and that each gene contains information for three functions which includes replication, production of proteins and the accumulation of mutations [5].

The 4-alphabet language problem is based on the following premise:

"Any DNA sequence can be curve-fitted by the summation of four periodic sequences based on the order variables A,C,G, and T plus a fudging expression which may or may not be periodic."

The proposed canonical generating function for DNA-sequences is given by equation (1).

.................1..............1.............1...........1
DNA := ------- + ------- + ------- + ------- + (Fudged Terms or Sequence Expression)....(1).
.................p..............q..............r............s
............AA - 1.....CC - 1....GG - 1....TT - 1

Indices p,q,r, and s are prime integers. Equation (1), when series expanded is expected to generate the required DNA-sequence curve-fitted by the four periodic nucleotide sequences plus the fudged expression.

To solve this problem, it is necessary to establish a theorem based on periodicities between two sequences which is stated as follows:

Theorem On Periodic Sequences: If two periodic sequences have uniform intervals of p and q units which are primes, then the first occurrence of overlaps between these two sequences will be at the (p*q)th term..

Proof: The first two periodic sequences are shown as F1(z) and F2(z) where p and q are primes. Any term of F3(z) less than the (p*q)th term is not divisible by either p or q and therefore such terms do not overlap. However at the (p*q)th term both sequences are in exact multiples. F1(z) will have q repeats whilst F2(z) will have p repeats and so these two sequences will overlap on that term (shown in bold font). There are further overlaps in multiples of p*q but the range beyond p*q is not useful in curve-fitting.

F1(z) = 1 + 1/z^p + 1/z^(2*p) +1/z^(3*p) + ..........................
F2(z) = 1 + 1/z^q + 1/z^(2*q) +1/z^(3*q) + ..........................
F3(z) = 1 + 1/z^(p*q) + 1/z^(2*p*q) +1/z^(3*p*q) + ............ .......................(2). Q.E.D.


2. Method Of Curve-Fitting

The implication of the above proof when applied to a 4-alphabet string is that the longest string which can be curve-fitted by the sum of nonoverlapping periodic sequences must be below p*q where p and q are prime intervals belonging to the two sequences with the largest intervals. The general procedure in curve-fitting a DNA-sequence is outlined as follows:

Step (1): Do a population counts of the nucleotides A, C, G, and T.

Step(2): Assign to the two nucleotides with the largest and second largest population suitably large prime intervals p and q respectively. This determines the limit of the (p*q)th term.

Step(3): For the remaining two nucleotide sequences, choose prime intervals r and s with decreasing magnitudes. This reduces further the lower limit for the useable length of the string to r*s. This is why in step (2) we must choose p and q suitably large to compensate for this reduction.

Step (4): Assume the four nucleotides in order of population sizes to be A, C, G, and T. Assume the prime intervals chosen to be 19,17,13, and 11. Layout A with successive prime intervals of 19. Do similarly for C, G, and T making sure that the upper limit of 11*7 is not exceeded as shown in figure 1. Suppose the real DNA sequence to be curve fitted is (TGCATGTTCAGTCGT) but the sequence generating function in equation (2) generates: (TGCATGTCAGTCGT) then you have to fudge the 5th term in equation (1) by addig one more T-term which you can insert somewhere between T and C in figure 1 (zone underlined by circumflexes).

===================A==================A===================A===
=================C=================C=================C========
============G============G============G===============G=======
==========T==========T==========T==========T=============T====
.........................................................................^^^

................Fig. 1 - Layout of The Proposed Periodic Sequences A,C,G, and T.

Therefore the generating function for the DNA-sequence is:

DNA:= 1/(AA^19-1)+ 1/(CC^17-1) + 1/(GG^13-1) + 1/(TT^11-1) - 1/TT^34.....(3).

When expanded this will yield the required DNA-sequence:

..........1........1........1............1
A := ---- + ---- + ---- + O(----)............................(4).
...........19.......38.......57..........76
........AA.....AA.....AA.........AA

.........1........1.........1
C := ---- + ---- + ---- +....................(5).
...........17......34.......51
.......CC.....CC......CC

..........1........1.........1.......1
G := ---- + ---- + ---- + ---- +.................(6).
...........13.......26.......39......52
........GG.....GG.....GG.....GG

.........1.........1........1........1........1
T := ---- + ---- + ---- + ---- + ---- + ......................................(7)
...........11.......22.......33......44......55
........TT......TT......TT......TT.....TT

When summed using equation (3), this will yield correctly the required DNA-sequence.

......1............1............1...........1..........1........1.........1.......1.........1........1........1........1.......1
O(----)+O(----)+O(----)+O(----) + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----
.......77...........76..........85........78........11......13.......17......19......22......26......33.......34.....38
.....TT........AA.........CC.......GG.......TT.....GG.....CC.....AA.....TT.....GG.....TT.....CC....AA

.....1.........1........1.........1........1
+ ---- + ---- + ---- + ---- + ---- + ---- ...........................................(6).
.......39......44......51......52.......55
....GG.....TT.....CC.....GG.....TT

3. Summary

Theoretically, you can curve-fit any given DNA-sequence provided you are awared of the following problems:

(1) DNA-sequences are unlike number sequences which are perfectly periodic. Even pi was found by the author to be periodic [3]. Because of these imperfections, you have to use fudging factors most of the time and these may increase with the lengths of the DNA- sequences. Most of the DNA-sequences for the generation of proteins are fairly short. The long ones might have subdomains which can be curve-fitted separately.

(2) The larger the intervals chosen, the more the flexibility in arranging the individual component sequences for the four nucleotides. There may be some merits to fix a large length to be used for most curve fittings. This sets a standard reference and may be easier to develop a software for the expansions of the generating functions.

(3) The use of fudging factors might reflect a lack of understanding of mathematical fundamentals of DNA-sequences. Since there are four order variables in A,C,G, and T, in sequence algebra, we should have called this a 4-dimensional sequence but this has been force fitted into a 1-dimensional sequence in line with conventional practice in molecular biology. In this process of simplication, we might have overlooked some important properties.

(4) Whilst the absolute orders of order-variables are very important in number theory, it is not so in DNA-sequence. The concatenation of A,C,G, and T symbols does not reflect 3-dimensional structures of these molecules. Fortunately, symbolic software do not display zero terms so that one can still preserve the DNA sequence order after modifications and expansions. But one must keep in mind that this is not the way we treat orders in number theory.>/A>

(5) On the question whether such formulations are useful to molecular biologists, that has to be left to its evolvement in time. One immediate benefit is in data compression of DNA-sequence information. Imagine sending your colleague the closed form generating function instead of the actual sequence and leave it to him to expand at his premise. Imagine these hefty volumes of DNA-sequences which now can be greatly reduced in the number of printed pages. Any amendments to DNA-sequences can be effected by simply publishing a new generating function.


4. Reference:

[Comments: Papers included in this section are relevant for background readings but not all are directly referenced in the main text. You can hyperlink to most of these papers from within this reference section.

========================================================
1. Huen Y.K.: A matrix map for primes and nonprimes, Int. J. Math.Educ.Sci.Technol., 1994, Vol.25, No.6, pp 913 - 920.

=======================================================
2. Huen Y.K.: Visual algebra and its applications, Int. J. Math. Educ. Sci. Technol., 1997, Vol.28, No.3, 333-344.

=======================================================
3. Huen Y.K.: Is Pie Periodic?, INT.J.MATH.EDUC.SCI. TECHNOL.,199?,VOL.??,NO.?,???-???, (in the press).

=======================================================
4. Editor: Clifford A. Pickover: Representation of Biological Sequences Using Point Geometry Analysis, by Huen Y.K. pp165-182.1995, World Scientific Publishing, Singapore.

=======================================================
5. Weaver R.F. & Hedrick P.W.: Basic Genetics, (second edition), WCB Publishers, 1995, Dubuque, pp 14.

=======================================================

6. A Simple Introduction To Sequence Algebra - by Huen Y.K. (date release: 15.3.97) (38 KBytes, 11*A4 pages).

========================================================

7. The Canonical Generating Function or CGF(z) ... - by Huen Y.K. (date released : 27.5..97) (24 KBytes, 7*A4s).

========================================================

8. Information Contents Of Number Theoretic Functions - by Huen Y.K. (date released : 29.5.97) (21.5 KBytes, 7*A4s).

========================================================

9. Visual Solutions Of Number Theoretic Problems ..... - by Huen Y.K. (date released : 3.6.97) (38.3 KBytes, 10*A4s).

========================================================

10. Information Contents Of Hypothetical DNA Sequences - by Huen Y.K. (date released : 27.6.97) (26.0 KBytes, 8*A4s).

========================================================

11. Generating Functions - Closed Forms vs Open Forms - by Huen Y.K. (date released : 1.10.97 ) (21 Kbytes).

========================================================

12. Generating Large Odd Composite With Two Prime Factors - by Huen Y.K. (date released : 3.10.97 ) (13.5 Kbytes).

========================================================

(13) Spontaneous Generation Of Number Sequences In A Primeval Number Soup - by Huen Y.K. (date released : 23.11.97) (bios.htm 1 K, bios.alx 5 K).

========================================================

(14) A Sketch Of Test-Tube Evolution In A Primeval Number Soup - by Huen Y.K. (date released : 25.11.97) (paper35.htm 1 K).

======================== END OF PAPER =========