Let's data mining together: [Research] Assumtion of random distribution in motifs

Buhler, J. Chapter 3. Finding motifs using random projection. pp. 90–129. PhD thesis. 2002

This chapter of the thesis models a similar problem found in Bioinformatics: motif discovery. Buhler defines a consensus model and a weight matrix model to represent a set of genomic sequences by only one representative sequence. Intuitively, the consensus model put on each position of the sequence the most likely element, where likely means the most frequent base in position j. In that case, all the sequences are of same length. In our specific case, we want to model our problem in a similar way: given a set of time series of different lengths, we want to extract a time series that captures the most important attributes of this set and use it as a representative time series of the set. I was thinking in two methods to find these important and common attributes without taking random elements in the sequence: Fourier and Wavelet transforms. In persuit of a paper that have dealt with this problem, I looked at the next paper. Each random representation is hashed and stored in a bucket, the buckets with more elements are considered to stored similar vectors.

Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international Conference on Management of Data (Seattle, Washington, United States, June 01 - 04, 1998). A. Tiwary and M. Franklin, Eds. SIGMOD '98. ACM, New York, NY, 448-459.

Rather than choosing random elements from a vector, this paper propose wavelet coefficients to estimate the most important attributes in a traditional database scheme. Thus, since each sequence is defined as an average plus a set of detail coefficients, the authors want to estimate the most important atribues by removing some of the less significant details of the histogram representation of the attributes values. I like this approach since it considers locally changes in the set of vectors. Again, this work also considers vectors of similar length.

In sumary, Fourier and Wavelet approaches can similarly be considered to estimate the most important elements of a set of time series with variable length. We would like to generate the most representative time series and use it as a centroid. Then, new time series are compared with these representative time series.

Next activities:

- Measuring the Normality assumption of time series by using the QQplot method. We want to check if the time series are normally distributed as is assumed in random projection.

- Implement a method to generate representative time series from a set of variable time series.

Let's data mining together

Monday, March 2, 2009

[Research] Assumtion of random distribution in motifs

No comments:

Visitors