Let's data mining together: January 2009

I borrowed a book from the Library Service:

Learning from Data Streams: Processing Techniques in Sensor Networks. Joao Gama, M.M. Gaber. Springer, November 2007, 255 pages, ISBN 354073678

This book have interesting chapters about processing of time series and stream mining. I read many of them, but the most interesting are:

Chapter 3, Data Stream Processing. In this chapter the authors present the constrints of considering data streams such as efficient use of memory, limited storage, real time processing and provide tradicional techniques to deal with these problems such as landmark windows, sliding windows, tilted windows, sampling, data summary, hashing, wavelets, histograms

Chapter 9, Clustering techniques in Sensor Networks. This chapter presents some of the open problems in data stream clustering inclusing time series. Incremental versions of clustering techniques by avoiding assuming that we know the time series in all their extent, clustering of entire time series and not only subsequences, discovery of structures in data over time are some of the open problem within clustering of data streams. It is interesting to note that the definition of representatives such as centroids and medoids with the goal of reducing dimensionallity is not always natural. Hence, Random Projection seems a viable alternative for clutering time series.

J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.

Here Random projection is employed to reduce the dimensionality present a traditional information retrieval task like document classification. This paper was quickly read and only works for understanding the usefulness of Singular Value Decomposition into the random projection technique since both techniques are extensively used to reduce dimensionality in the vector space. It is interesting to note that orthogonal vectors to feed the RM index up are desirable, but they are expensive to achieve. However, since in high-dimensional space contains a larger number of almost orthogonal vectors than orthogonal vectors, the random vectors might be sufficiently close enough to orthogonal to offer a reasonable approximation of the original vectors. I plan to check that assumption in our time series to see if random projection can be used as a clustering method.

Iterative Incremental Clustering of Time Series. Jessica Lin, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos. Advances in Database Technology. 2004

In this paper, the authors present an interesting way to clustering time series. They mention some of the problems I had when dealing with clustering of human motion time series: low convergence, long times, variable clusters. As the authors said: Kmeans has a complexity of O(kNrD), k=number of clusters, N=number of abjetcs, r=number of iterations, and D=number of dimensions. However, all the algorithms (and even this paper) assume that time series have the same length. Reading this paper I understand that several comparisons of length-variable time series makes difficult the convergence and the obtaining of centroids. The approach introduced in this paper is based on the multiresolutions properties de Haar Wavelets and I think they can be more studied to preserve the convergence when clustering time series of different lengths.

Scaling and time warping in time series querying. Fu, A. W., Keogh, E., Lau, L. Y., and Ratanamahatana, C. A. 2005. In Proceedings of the 31st international Conference on Very Large Data Bases (Trondheim, Norway, August 30 - September 02, 2005). Very Large Data Bases. VLDB Endowment, 649-660.

This work studies the properties of DTW with respect to Euclidean distance in the context of time series queries. Since response time is the primary goal, the authors present some lower bound techniques (based on triangular inequality) such as: Sakoe-Chiba band and LB-Keaogh, to reduce the number of time the DTW distance is evaluared. The reduction in the dimensionality of the time series by component analysis is emphasized.

Detecting time series motifs under uniform scaling. Yankov, D., Keogh, E., Medina, J., Chiu, B., and Zordan, V. 2007. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 844-853.

This paper was recently published and especifically deal with detecting time series motifs with the technique Uniform Scaling. The basic idea is to resample the time series both in amplitude and also in frequency to be able to compare time series of different lengths. The authors do not use Fourier techniques for that purpose but only a sampling of the time series upto obtain a desired length. In this paper they also used random projection to detec motifs. I noticed that the random projection approach is basically a clustering approach that organize range of time series in different buckets by using a hashing function that is based on the random election of elements in the time series segmenet. The main assumption uis that the time series are uniformly distributed and then a random distribution of elements in the time series is obtained. I want to know if this assumption is valid in case of human motion time series from video data since we noticed that we obtain sort of regular shapes over time, but not random time series.

Chiu, B., Keogh, E., and Lonardi, S. 2003. Probabilistic discovery of time series motifs. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM, New York, NY, 493-498.

In this paper the authors introduce the importance of a random distribution in time series. By assuming that, they use statistical parameters (mean, standart deviation, correlation) to interpret missing or noisy values in the time series. Random projection is also used here, I wonder if other algoritms can be used since this method only works well in [10-20] dimensions since above that interval, a random distribution is not assumed.

Let's data mining together

Monday, January 26, 2009

[Research] Data streams as time series

Monday, January 12, 2009

[Research] Incremental Clustering of Time Series and Motif Discovery

Visitors