[Bioclusters] A representation of gene expression rate data over spaced time intervals

MyungHo Kim bioclusters@bioinformatics.org
Tue, 27 May 2003 06:36:10 -0400


Currently DNA micro-array experiments have been done extensively in genomic
research and analyzing those data is a challenging problem. Here we would
like to suggest a representation method for the data of a single gene
expression over a certain period. This is the summary of the paper in full
text, available in www.biofront.biz or http://arxiv.org/abs/cs.CC/0305008 .

Note: DNA micro-array techniques convert the expression rates into densities
of stained images, which may be recorded as a series of numbers. All the
experiments and phenomenon infer that the numbers fluctuate over the time.

Step 1. Function representation fitting with data

Since we need a periodic-looking, fluctuating functions, it will be wise to
start with sin(t) or cos(t), while, for increasing and decreasing effects,
the exponential function would be the feasible choice. Consequently, a
possible function for representing the changes of the expression rate would
be of the form exp(kt)sin(mt) or exp(kt)cos(mt), where k and m are real
constants. The exponential functions have their shares in science,
especially, in modeling problems and theories, so it is not surprising that
exponential functions make their appearances here. Although we are
comfortable and familiar with the function, once in a while, one might ask
the following question: why do the exponential functions appear so often?
Although the answer to this question is not obvious, I would like to justify
my choice of the exponential function here. The clue could be in the
profound experimental fact, i.e., that the radioactive decay is measured in
terms of half-life ? the number of years required for half of the atoms in a
sample of radioactive material to decay.

Mathematically this is expressed as y' = ky

Here y represents the mass and k is the rate constant. Then the general type
of a solution looks like y = Cexp(kt), where t represents time and C is a
constant. This might be extended to observe some sort of life expectancy of
a certain phenomenon or behavior.

Step 2. Determining coefficients of the functional representation

Once we fix a candidate function, it remains to determine the coefficients C
¡¯s and k¡¯s etc., for each set of data. This may be achieved by using the
least square sum principle with high accuracy set to our own standard.
Commercial software, such SAS and SPSS, are available for such calculation,
namely, R-squared. The least square sum method, as the most popular one for
fitting a curve/function with experimental data, finds the coefficients of a
function of given type, by minimizing the sum of square of errors, or
deviations. More precisely, given a set of data points, (x1, y1), (x2,
y2)... (xn, yn) and a candidate function f with undetermined coefficients,
the unknowns in f would be determined so that the summation of squares of
difference of errors, f(x) and y, be minimized. Note that two is the
smallest and good for further manipulation, i.e., we could use many tools,
calculus, involving differentiation unlike the absolute value function, | |.

Step 3. Vector representation for machine learning method

From the first two steps, we have obtained a ¡°functional¡± representation
for observed data, i.e., a function fitting with the data. Consequently,
with respect to the fixed type, each function is represented as a set of
coefficients calculated in the step two. Suppose the data fits well with
model, y = Cexp(kt)sin(mt),
where y is the expression rate. Then we can say that the set of numbers (C,
k, m) represent the object, A. In other words, the object may be identified
with the triple (C, k, m), analogous to students and their corresponding ID
numbers. For the general case, i.e., the gene expression rates of n multiple
genes, we will get (C1, k1, m1), (C2, k2, m2),.., (Cn, kn, mn), which form a
vector.

Conclusion: Once we have the vector representation, by using SVMs, we would
get a criterion, which may be applied for diagnosis of a disease etc.