Let's data mining together: 2009

Monday, March 2, 2009

[Research] Assumtion of random distribution in motifs

Buhler, J. Chapter 3. Finding motifs using random projection. pp. 90–129. PhD thesis. 2002

This chapter of the thesis models a similar problem found in Bioinformatics: motif discovery. Buhler defines a consensus model and a weight matrix model to represent a set of genomic sequences by only one representative sequence. Intuitively, the consensus model put on each position of the sequence the most likely element, where likely means the most frequent base in position j. In that case, all the sequences are of same length. In our specific case, we want to model our problem in a similar way: given a set of time series of different lengths, we want to extract a time series that captures the most important attributes of this set and use it as a representative time series of the set. I was thinking in two methods to find these important and common attributes without taking random elements in the sequence: Fourier and Wavelet transforms. In persuit of a paper that have dealt with this problem, I looked at the next paper. Each random representation is hashed and stored in a bucket, the buckets with more elements are considered to stored similar vectors.

Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international Conference on Management of Data (Seattle, Washington, United States, June 01 - 04, 1998). A. Tiwary and M. Franklin, Eds. SIGMOD '98. ACM, New York, NY, 448-459.

Rather than choosing random elements from a vector, this paper propose wavelet coefficients to estimate the most important attributes in a traditional database scheme. Thus, since each sequence is defined as an average plus a set of detail coefficients, the authors want to estimate the most important atribues by removing some of the less significant details of the histogram representation of the attributes values. I like this approach since it considers locally changes in the set of vectors. Again, this work also considers vectors of similar length.

In sumary, Fourier and Wavelet approaches can similarly be considered to estimate the most important elements of a set of time series with variable length. We would like to generate the most representative time series and use it as a centroid. Then, new time series are compared with these representative time series.

Next activities:

- Measuring the Normality assumption of time series by using the QQplot method. We want to check if the time series are normally distributed as is assumed in random projection.

- Implement a method to generate representative time series from a set of variable time series.

Monday, January 26, 2009

[Research] Data streams as time series

I borrowed a book from the Library Service:

Learning from Data Streams: Processing Techniques in Sensor Networks. Joao Gama, M.M. Gaber. Springer, November 2007, 255 pages, ISBN 354073678

This book have interesting chapters about processing of time series and stream mining. I read many of them, but the most interesting are:

Chapter 3, Data Stream Processing. In this chapter the authors present the constrints of considering data streams such as efficient use of memory, limited storage, real time processing and provide tradicional techniques to deal with these problems such as landmark windows, sliding windows, tilted windows, sampling, data summary, hashing, wavelets, histograms

Chapter 9, Clustering techniques in Sensor Networks. This chapter presents some of the open problems in data stream clustering inclusing time series. Incremental versions of clustering techniques by avoiding assuming that we know the time series in all their extent, clustering of entire time series and not only subsequences, discovery of structures in data over time are some of the open problem within clustering of data streams. It is interesting to note that the definition of representatives such as centroids and medoids with the goal of reducing dimensionallity is not always natural. Hence, Random Projection seems a viable alternative for clutering time series.

J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.

Here Random projection is employed to reduce the dimensionality present a traditional information retrieval task like document classification. This paper was quickly read and only works for understanding the usefulness of Singular Value Decomposition into the random projection technique since both techniques are extensively used to reduce dimensionality in the vector space. It is interesting to note that orthogonal vectors to feed the RM index up are desirable, but they are expensive to achieve. However, since in high-dimensional space contains a larger number of almost orthogonal vectors than orthogonal vectors, the random vectors might be sufficiently close enough to orthogonal to offer a reasonable approximation of the original vectors. I plan to check that assumption in our time series to see if random projection can be used as a clustering method.

Monday, January 12, 2009

[Research] Incremental Clustering of Time Series and Motif Discovery

Iterative Incremental Clustering of Time Series. Jessica Lin, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos. Advances in Database Technology. 2004

In this paper, the authors present an interesting way to clustering time series. They mention some of the problems I had when dealing with clustering of human motion time series: low convergence, long times, variable clusters. As the authors said: Kmeans has a complexity of O(kNrD), k=number of clusters, N=number of abjetcs, r=number of iterations, and D=number of dimensions. However, all the algorithms (and even this paper) assume that time series have the same length. Reading this paper I understand that several comparisons of length-variable time series makes difficult the convergence and the obtaining of centroids. The approach introduced in this paper is based on the multiresolutions properties de Haar Wavelets and I think they can be more studied to preserve the convergence when clustering time series of different lengths.

Scaling and time warping in time series querying. Fu, A. W., Keogh, E., Lau, L. Y., and Ratanamahatana, C. A. 2005. In Proceedings of the 31st international Conference on Very Large Data Bases (Trondheim, Norway, August 30 - September 02, 2005). Very Large Data Bases. VLDB Endowment, 649-660.

This work studies the properties of DTW with respect to Euclidean distance in the context of time series queries. Since response time is the primary goal, the authors present some lower bound techniques (based on triangular inequality) such as: Sakoe-Chiba band and LB-Keaogh, to reduce the number of time the DTW distance is evaluared. The reduction in the dimensionality of the time series by component analysis is emphasized.

Detecting time series motifs under uniform scaling. Yankov, D., Keogh, E., Medina, J., Chiu, B., and Zordan, V. 2007. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 844-853.

This paper was recently published and especifically deal with detecting time series motifs with the technique Uniform Scaling. The basic idea is to resample the time series both in amplitude and also in frequency to be able to compare time series of different lengths. The authors do not use Fourier techniques for that purpose but only a sampling of the time series upto obtain a desired length. In this paper they also used random projection to detec motifs. I noticed that the random projection approach is basically a clustering approach that organize range of time series in different buckets by using a hashing function that is based on the random election of elements in the time series segmenet. The main assumption uis that the time series are uniformly distributed and then a random distribution of elements in the time series is obtained. I want to know if this assumption is valid in case of human motion time series from video data since we noticed that we obtain sort of regular shapes over time, but not random time series.

Chiu, B., Keogh, E., and Lonardi, S. 2003. Probabilistic discovery of time series motifs. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM, New York, NY, 493-498.

In this paper the authors introduce the importance of a random distribution in time series. By assuming that, they use statistical parameters (mean, standart deviation, correlation) to interpret missing or noisy values in the time series. Random projection is also used here, I wonder if other algoritms can be used since this method only works well in [10-20] dimensions since above that interval, a random distribution is not assumed.