Sounds like curve clustering with nonlinear (or functional data) regression models. Simple to fit in R, http://www.r-project.org/. best, -tony "MyungHo Kim" <bio_front@hotmail.com> writes: > Currently DNA micro-array experiments have been done extensively in genom= ic > research and analyzing those data is a challenging problem. Here we would > like to suggest a representation method for the data of a single gene > expression over a certain period. This is the summary of the paper in full > text, available in www.biofront.biz or http://arxiv.org/abs/cs.CC/0305008= . > > Note: DNA micro-array techniques convert the expression rates into densit= ies > of stained images, which may be recorded as a series of numbers. All the > experiments and phenomenon infer that the numbers fluctuate over the time. > > Step 1. Function representation fitting with data > > Since we need a periodic-looking, fluctuating functions, it will be wise = to > start with sin(t) or cos(t), while, for increasing and decreasing effects, > the exponential function would be the feasible choice. Consequently, a > possible function for representing the changes of the expression rate wou= ld > be of the form exp(kt)sin(mt) or exp(kt)cos(mt), where k and m are real > constants. The exponential functions have their shares in science, > especially, in modeling problems and theories, so it is not surprising th= at > exponential functions make their appearances here. Although we are > comfortable and familiar with the function, once in a while, one might ask > the following question: why do the exponential functions appear so often? > Although the answer to this question is not obvious, I would like to just= ify > my choice of the exponential function here. The clue could be in the > profound experimental fact, i.e., that the radioactive decay is measured = in > terms of half-life ? the number of years required for half of the atoms i= n a > sample of radioactive material to decay. > > Mathematically this is expressed as y' =3D ky > > Here y represents the mass and k is the rate constant. Then the general t= ype > of a solution looks like y =3D Cexp(kt), where t represents time and C is= a > constant. This might be extended to observe some sort of life expectancy = of > a certain phenomenon or behavior. > > Step 2. Determining coefficients of the functional representation > > Once we fix a candidate function, it remains to determine the coefficient= s C > =A1=AFs and k=A1=AFs etc., for each set of data. This may be achieved by = using the > least square sum principle with high accuracy set to our own standard. > Commercial software, such SAS and SPSS, are available for such calculatio= n, > namely, R-squared. The least square sum method, as the most popular one f= or > fitting a curve/function with experimental data, finds the coefficients o= f a > function of given type, by minimizing the sum of square of errors, or > deviations. More precisely, given a set of data points, (x1, y1), (x2, > y2)... (xn, yn) and a candidate function f with undetermined coefficients, > the unknowns in f would be determined so that the summation of squares of > difference of errors, f(x) and y, be minimized. Note that two is the > smallest and good for further manipulation, i.e., we could use many tools, > calculus, involving differentiation unlike the absolute value function, |= |. > > Step 3. Vector representation for machine learning method > >>From the first two steps, we have obtained a =A1=B0functional=A1=B1 repre= sentation > for observed data, i.e., a function fitting with the data. Consequently, > with respect to the fixed type, each function is represented as a set of > coefficients calculated in the step two. Suppose the data fits well with > model, y =3D Cexp(kt)sin(mt), > where y is the expression rate. Then we can say that the set of numbers (= C, > k, m) represent the object, A. In other words, the object may be identifi= ed > with the triple (C, k, m), analogous to students and their corresponding = ID > numbers. For the general case, i.e., the gene expression rates of n multi= ple > genes, we will get (C1, k1, m1), (C2, k2, m2),.., (Cn, kn, mn), which for= m a > vector. > > Conclusion: Once we have the vector representation, by using SVMs, we wou= ld > get a criterion, which may be applied for diagnosis of a disease etc. > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters > --=20 A.J. Rossini / rossini@u.washington.edu / rossini@scharp.org Biomedical/Health Informatics and Biostatistics, University of Washington. Biostatistics, HVTN/SCHARP, Fred Hutchinson Cancer Research Center. FHCRC: 206-667-7025 (fax=3D4812)|Voicemail is pretty sketchy/use Email=20 CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be confidential and privileged. If you received this message in error, please destroy it and notify the sender. Thank you.