Empirical Curve Fitting Of DNA Sequences
by
Huen Y.K.
CAHRC, P.O.Box 1003, Singapore 911101
http://web.singnet.com.sg/~activweb/
Related URL-sites: http://web.singnet.com.sg/~huens/
email: huens@mbox3.singnet.com.sg
(A short communication - 1st released: 14/12/97 )
Abstract
Sequence algebra has found utilities in the developments of generating functions for
predicting number sequences [1,2,3]. But there are sequences outside number theory which are
not so amenable to mathematical analyses. A glaring example is that of DNA sequences. In
spite of the gathering of much useful information from experimental observations followed
by subsequent theoretical explanations, so far the mathematical basis of DNA sequence has
eluded scientists. In this paper the author suggests how short DNA sequences can be
curve-fitted by generating functions. Empirical curve-fittings enables one to write down the
closed form generating functions for short DNA sequences. From the exercise the author
suggests a canonial form for DNA-generating function. At the moment this is still a trial-
and-error algorithm as an algebraic method which can deliver the whole generating function
without manual interventions has not been discovered.
1. Introduction
The paper will begin by describing the 4-alphabet language problem. A real example is that
of the four distinctive nucleotides C, G, A and T which are the building blocks of DNA
sequences. This language is much easier to handle than the English language with its 26
alphabets but the principle developed here can be extended to any other alphabet-based
languages, even hanyu pinyin Chinese provided one is willing to accept the increased
complexity. We liken DNA sequences to a linguistic string because genes are made of DNA,
and that each gene contains information for three functions which includes replication,
production of proteins and the accumulation of mutations [5].
The 4-alphabet language problem is based on the following premise:
"Any DNA sequence can be curve-fitted by the summation of four periodic sequences based
on the order variables A,C,G, and T plus a fudging expression which may or may not be
periodic."
The proposed canonical generating function for DNA-sequences is given by
equation (1).
.................1..............1.............1...........1
DNA := ------- + ------- + ------- + ------- + (Fudged Terms or Sequence Expression)....(1).
.................p..............q..............r............s
............AA - 1.....CC - 1....GG - 1....TT - 1
Indices p,q,r, and s are prime integers. Equation (1), when series expanded is expected to generate the required DNA-sequence
curve-fitted by the four periodic nucleotide sequences plus the fudged expression.
To solve this problem, it is necessary to establish a theorem based on periodicities between
two sequences which is stated as follows:
Theorem On Periodic Sequences: If two periodic sequences have uniform intervals of p
and q units which are primes, then the first occurrence of overlaps between these two
sequences will be at the (p*q)th term..
Proof: The first two periodic sequences are shown as F1(z) and F2(z) where p and q are
primes. Any term of F3(z) less than the (p*q)th term is not divisible by either p or q and
therefore such terms do not overlap. However at the (p*q)th term both sequences are in
exact multiples. F1(z) will have q repeats whilst F2(z) will have p repeats and so these two
sequences will overlap on that term (shown in bold font). There are further overlaps in
multiples of p*q but the range beyond p*q is not useful in curve-fitting.
F1(z) = 1 + 1/z^p + 1/z^(2*p) +1/z^(3*p) + ..........................
F2(z) = 1 + 1/z^q + 1/z^(2*q) +1/z^(3*q) + ..........................
F3(z) = 1 + 1/z^(p*q) + 1/z^(2*p*q) +1/z^(3*p*q) + ............ .......................(2). Q.E.D.
2. Method Of Curve-Fitting
The implication of the above proof when applied to a 4-alphabet string is that the longest
string which can be curve-fitted by the sum of nonoverlapping periodic sequences must be
below p*q where p and q are prime intervals belonging to the two sequences with the largest
intervals. The general procedure in curve-fitting a DNA-sequence is outlined as follows:
Step (1): Do a population counts of the nucleotides A, C, G, and T.
Step(2): Assign to the two nucleotides with the largest and second largest population
suitably large prime intervals p and q respectively. This determines the limit of the (p*q)th
term.
Step(3): For the remaining two nucleotide sequences, choose prime intervals r and s
with decreasing magnitudes. This reduces further the lower limit for the useable length of the
string to r*s. This is why in step (2) we must choose p and q suitably large to compensate for
this reduction.
Step (4): Assume the four nucleotides in order of population sizes to be A, C, G, and T.
Assume the prime intervals chosen to be 19,17,13, and 11. Layout A with successive prime
intervals of 19. Do similarly for C, G, and T making sure that the upper limit of 11*7 is not
exceeded as shown in figure 1. Suppose the real DNA sequence to be curve fitted is
(TGCATGTTCAGTCGT) but the sequence generating function in equation (2) generates:
(TGCATGTCAGTCGT) then you have to fudge the 5th term in equation (1) by addig one
more T-term which you can insert somewhere between T and C in figure 1 (zone underlined by
circumflexes).
===================A==================A===================A===
=================C=================C=================C========
============G============G============G===============G=======
==========T==========T==========T==========T=============T====
.........................................................................^^^
................Fig. 1 - Layout of The Proposed Periodic Sequences A,C,G, and T.
Therefore the generating function for the DNA-sequence is:
DNA:= 1/(AA^19-1)+ 1/(CC^17-1) + 1/(GG^13-1) + 1/(TT^11-1) - 1/TT^34.....(3).
When expanded this will yield the required DNA-sequence:
..........1........1........1............1
A := ---- + ---- + ---- + O(----)............................(4).
...........19.......38.......57..........76
........AA.....AA.....AA.........AA
.........1........1.........1
C := ---- + ---- + ---- +....................(5).
...........17......34.......51
.......CC.....CC......CC
..........1........1.........1.......1
G := ---- + ---- + ---- + ---- +.................(6).
...........13.......26.......39......52
........GG.....GG.....GG.....GG
.........1.........1........1........1........1
T := ---- + ---- + ---- + ---- + ---- + ......................................(7)
...........11.......22.......33......44......55
........TT......TT......TT......TT.....TT
When summed using equation (3), this will yield correctly the required DNA-sequence.
......1............1............1...........1..........1........1.........1.......1.........1........1........1........1.......1
O(----)+O(----)+O(----)+O(----) + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----
.......77...........76..........85........78........11......13.......17......19......22......26......33.......34.....38
.....TT........AA.........CC.......GG.......TT.....GG.....CC.....AA.....TT.....GG.....TT.....CC....AA
.....1.........1........1.........1........1
+ ---- + ---- + ---- + ---- + ---- + ---- ...........................................(6).
.......39......44......51......52.......55
....GG.....TT.....CC.....GG.....TT
3. Summary
Theoretically, you can curve-fit any given DNA-sequence provided you are awared of the
following problems:
(1) DNA-sequences are unlike number sequences which are perfectly periodic. Even pi was
found by the author to be periodic [3]. Because of these imperfections, you have to use
fudging factors most of the time and these may increase with the lengths of the DNA-
sequences. Most of the DNA-sequences for the generation of proteins are fairly short. The
long ones might have subdomains which can be curve-fitted separately.
(2) The larger the intervals chosen, the more the flexibility in arranging the individual
component sequences for the four nucleotides. There may be some merits to fix a large
length to be used for most curve fittings. This sets a standard reference and may be easier
to develop a software for the expansions of the generating functions.
(3) The use of fudging factors might reflect a lack of understanding of mathematical
fundamentals of DNA-sequences. Since there are four order variables in A,C,G, and T, in
sequence algebra, we should have called this a 4-dimensional sequence but this has been
force fitted into a 1-dimensional sequence in line with conventional practice in molecular
biology. In this process of simplication, we might have overlooked some important
properties.
(4) Whilst the absolute orders of order-variables are very important in number
theory, it is not so in DNA-sequence. The concatenation of A,C,G, and T symbols
does not reflect 3-dimensional structures of these molecules. Fortunately,
symbolic software do not display zero terms so that one can still preserve the
DNA sequence order after modifications and expansions. But one must keep in
mind that this is not the way we treat orders in number theory.>/A>
(5) On the question whether such formulations are useful to molecular biologists,
that has to be left to its evolvement in time. One immediate benefit is in data
compression of DNA-sequence information. Imagine sending your colleague the
closed form generating function instead of the actual sequence and leave it to him
to expand at his premise. Imagine these hefty volumes of DNA-sequences which
now can be greatly reduced in the number of printed pages. Any amendments
to DNA-sequences can be effected by simply publishing a new generating function.
4. Reference:
[Comments: Papers included in this section are relevant for background readings
but not all are directly referenced in the main text. You can hyperlink to most of these
papers from within this reference section.
========================================================
1. Huen Y.K.:
A matrix map for primes and nonprimes, Int. J. Math.Educ.Sci.Technol., 1994,
Vol.25, No.6, pp 913 - 920.
=======================================================
2. Huen Y.K.:
Visual algebra and its applications, Int. J. Math. Educ. Sci. Technol., 1997,
Vol.28, No.3, 333-344.
=======================================================
3. Huen Y.K.:
Is Pie Periodic?, INT.J.MATH.EDUC.SCI.
TECHNOL.,199?,VOL.??,NO.?,???-???, (in the press).
=======================================================
4. Editor: Clifford A. Pickover:
Representation of Biological Sequences Using Point Geometry Analysis,
by Huen Y.K. pp165-182.1995, World Scientific Publishing, Singapore.
=======================================================
5. Weaver R.F. & Hedrick P.W.:
Basic Genetics, (second edition), WCB Publishers, 1995, Dubuque, pp 14.
=======================================================
6. A Simple Introduction To Sequence
Algebra - by Huen Y.K.
(date release: 15.3.97) (38 KBytes, 11*A4 pages).
========================================================
7. The Canonical Generating Function
or CGF(z) ... - by Huen Y.K.
(date released : 27.5..97) (24 KBytes, 7*A4s).
========================================================
8. Information Contents Of Number
Theoretic Functions - by Huen Y.K. (date released : 29.5.97) (21.5 KBytes, 7*A4s).
========================================================
9. Visual Solutions Of Number Theoretic
Problems ..... - by Huen Y.K. (date released : 3.6.97) (38.3 KBytes, 10*A4s).
========================================================
10. Information Contents Of
Hypothetical DNA Sequences - by Huen Y.K. (date released : 27.6.97) (26.0 KBytes, 8*A4s).
========================================================
11. Generating Functions -
Closed Forms vs Open Forms
- by Huen Y.K. (date released : 1.10.97 ) (21 Kbytes).
========================================================
12. Generating Large
Odd Composite With Two Prime Factors
- by Huen Y.K. (date released : 3.10.97 ) (13.5 Kbytes).
========================================================
(13) Spontaneous
Generation Of Number Sequences In A Primeval Number Soup - by Huen Y.K.
(date released : 23.11.97) (bios.htm 1 K, bios.alx 5 K).
========================================================
(14) A Sketch Of Test-Tube
Evolution In A Primeval Number Soup - by Huen Y.K.
(date released : 25.11.97) (paper35.htm 1 K).
======================== END OF PAPER =========