Monday, January 26, 2009

[Research] Data streams as time series


I borrowed a book from the Library Service: 

Learning from Data Streams: Processing Techniques in Sensor Networks. Joao Gama, M.M. Gaber. Springer, November 2007, 255 pages, ISBN 354073678


This book have interesting chapters about processing of time series and stream mining. I read many of them, but the most interesting are:

Chapter 3, Data Stream Processing. In this chapter the authors present the constrints of considering data streams such as efficient use of memory, limited storage, real time processing and provide tradicional techniques to deal with these problems such as landmark windows, sliding windows, tilted windows, sampling, data summary, hashing, wavelets, histograms

Chapter 9, Clustering techniques in Sensor Networks. This chapter presents some of the open problems in data stream clustering inclusing time series. Incremental versions of clustering techniques by avoiding assuming that we know the time series in all their extent, clustering of entire time series and not only subsequences, discovery of structures in data over time are some of the open problem within clustering of data streams. It is interesting to note that the definition of representatives such as centroids and medoids with the goal of reducing dimensionallity is not always natural. Hence, Random Projection seems a viable alternative for clutering time series.

J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.

Here Random projection is employed to reduce the dimensionality present a traditional information retrieval task like document classification. This paper was quickly read and only works for understanding the usefulness of Singular Value Decomposition into the random projection technique since both techniques are extensively used to reduce dimensionality in the vector space. It is interesting to note that orthogonal vectors to feed the RM index up are desirable, but they are expensive to achieve. However, since in high-dimensional space contains a larger number of almost orthogonal vectors than orthogonal vectors, the random vectors might be sufficiently close enough to orthogonal to offer a reasonable approximation of the original vectors. I plan to check that assumption in our time series to see if random projection can be used as a clustering method.

No comments: