Monday, August 30, 2010

What is going on with P = NP?



P = NP is a major unsolved problem in Computer Science. Regarding the recent attempt to solve it this problem by Vinay Deolalikar (August 6, 2010), in this post we will briefly review what is this about P = NP, consequences if that statement is true, why many people was so excited over the possible solution announced recently, and some interesting observations on how our society reacted to it. The answer to this question seems to be P!=NP and it looks like the recent Vinay's proof is not correct. Moreover, it has some issues that make it hard to fix.

What P and NP means?

Given a problem, first we want to discover candidate solutions and then verify them. The goodness of these discovering and verification tasks commonly depend on the use of limited resources such as time and space. Thus, the time employed to discover a solution (or to verify it) can be efficient or inefficient. Having said that, P and NP are sets of computational problems. A P problem can always be solved in polynomial time in the size of the input. A NP problem can always be quickly verified in polynomial time. NP problems are interesting because although they can be quickly verified, they may not be quickly solved. Hence, P is always included in NP.

However, stating that P = NP means that all problems are both quickly solved and verified. If you find a problem difficult to solve, there will be a theoretical guarantee that there exists a efficient (at most polynomial) way to solve it. If you are trying to solve your Sudoku, you can be confident that there is an efficient way to solve it, even if you cannot find it. Here, we are not considering NP-hard problems, which are those whose verification is not polynomial.

What made Vinay's attempt special this time?

This is a good question. To frame the discussion, people propose proofs that P = NP or P!=NP all the time. For example, in arXiv.org you can easily find recent attempt to show one of those statements (http://arxiv.org/find/all/1/all:+AND+proof+AND+p+np/0/1/0/all/0/1). Thus, this is not a sporadic effort to solve this problem, it is a well-studied theoretical problem with many people trying to solve it. Why this time the propagation of this news was so fast then?. Here are my hints:

  1. The way how Vinay introduced his work to the community was unusual. Unlike the traditional way used in academy, where the submission/acceptance of a paper after a peer review process is the rule, he sent the manuscript to 22 experts in the field and they quickly realized a valid attempt to solve this problem.
  2. The method employed by Vinay was novel. The use of random k-SAT was a path that few people traversed before. If there is a solution to this problem, many people agree that it should consider the problem from a new perspective. After all, this is an old problem and many well-known techniques has shown to fail to prove P=NP.
  3. Internet time. It is now easier to receive news and share our own impressions. The proof provided by Vinay was made public and he received feedback the next day he sent his work. This is a summary.

August 6, Document sent to 22 people
August 7, Greag Baker's blog post "P != NP"
August 8, Slash dot
August 9, Wikipedia article about Vinay. Richard Lipton wrote a post "Issues in the proof that P != NP"
August 10, First discussions to create a Wikipedia article about Vinay's attempt to solve the problem (340 edits so far)
August 11, Wikinews: "Researcher claims solution to P vs NP math problem"
August 15: Richard Lipton blog posts saying that proof seems incorrect
August 17: Author said the paper was sent out for refereeing

The paper was less than one-week-old and it already produced blog posts and technical discussions in Wikipedia, Wikinews, and Slashdot. After one week the proof seemed to show some issues and people, including Vinay, tried to fix those problems to see if it could solve P=NP. This seems to indicate that although complex knowledge is currently fragmented, now we can collaborate in a very short time such that someone can ask a complex question and receive meaninful feedback very soon.

Comunity refereeing

Now the propagation of news is faster than anytime before. Once those 22 experts received the work, they spent time to analyze it. The manuscripts was made public by the author and many people around the world also read it, some of them discovered interesting issues in the proof and provide nice counter examples. I wonder whether we we in front of a new model to review academic works. Is community refereeing better than the traditional peer review process?..

Finally, I would say that Vinay did the right thing. He made an interesting work, discussed it with his colleges, send the work to experts within the field, received the feedback, and sent the work for publishing. Even if the proof results to be rejected, his work should be recognized. Everyone agrees that it is easier to verify the validity of an academic work than to solve it. Unless someone proves P=NP :).

(*) Thanks to Kenneth Clarkson, Ronald Fagin, and Ryan Williams at IBM Almaden Research Center for their wonderful talk. It was a nice way to get insight on the proof and understand its applications on real problems like databases and cryptography. Ron was one of the 22 that initially received the work for its consideration.

Thursday, June 17, 2010

Top 10 Data Mining Algorithms (in 2006)

ICDM 2006 December 18 - 22, 2006, Hong Kong

A colleague took my attention to the top-10 algorithms that have influenced greatly on the development and practice of science and engineering in the 20th century, according to Jack Dongarra and Francis Sullivan.

As expected, the list is kinda historical trying to cover quite wide areas used in Engineering. It is also nice to find Francis Sullivan as coauthor of that work considering him as one of the authors of the famous QR algorithm for computing the eigenvalues and eigenvectors of matrices. Please, read their top-10 document here and their corresponding discussions, it is worth it. Here is their list:
  1. The Monte Carlo method or Metropolis algorithm
  2. The simplex method of linear programming
  3. The Krylov Subspace Iteration method
  4. The Householder matrix decomposition
  5. The Fortran compiler
  6. The QR algorithm for eigenvalue calculation
  7. The Quicksort algorithm
  8. The Fast Fourier Transform
  9. The Integer Relation Detection Algorithm (given N real values X, is there a nontrivial set of integer coefficients A so that sum(A * X) = 0, over (1 <= i <= N)?
  10. The fast Multipole algorithm (to calculate gravitational forces in an N-body problem)

It would be interesting to have such a list for data mining, wouldn't it?. Indeed, we have it, the famous top-10 algorithms for data mining, which I quickly read some years ago and now it is the perfect excuse for this post. Unlike the previous work, this list was made by the consensus of 494 people in 2006. This is the story.

Nomination

During the
2006 IEEE International Conference on Data Mining held in Hong Kong some of the most representative algorithms used to data mining information were presented. Although some controversial rankings can be found on internet, in this occasion the list of algorithms were nominated by qualified professionals (ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners) related to two of the most important conferences in this area. Every winner listed up to 10 data mining algorithms, which they consider are widely used in Computer Science
.

Voting

To reduce the number of algorithms candidates, those algorithms with less than 50 citations in Google Scholar (October 2006) were filtered out. The remaining 18 algorithms were voted to know the top-10 algorithms by the Program Committee members of KDD-06, ICDM '06, and SDM '06 as well as winners from the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award, 494 votes in total. The results were presented in the 2006 edition of ICDM (which logo is at the beginning of this post). Here is the list and its corresponding votes.

  1. C4.5 (61 votes), presented by Hiroshi Motoda
  2. K-Means (60 votes), presented by Joydeep Ghosh
  3. SVM (58 votes), presented by Qiang Yang
  4. Apriori (52 votes), presented by Christos Faloutsos
  5. EM (48 votes), presented by Joydeep Ghosh
  6. PageRank (46 votes), presented by Christos Faloutsos
  7. AdaBoost (45 votes), presented by Zhi-Hua Zhou
  8. kNN (45 votes), presented by Vipin Kumar
  9. Naive Bayes (45 votes), presented by Qiang Yang
  10. CART (34 votes), presented by Dan Steinberg
For data mining people the above information will always look interesting in seek of hidden relationships, hard to avoid it sorry :).

The most and least voted algorithms are C4.5 and CART, respectively. This seems to indicate the importance of decision trees as a tool to classify information and perform regression in data mining, specially some decades ago. Both generate decision trees using rules. In case of CART, it provides error measures and C4.5 measures entropy reduction to select optimal split of nodes, this concept is more intuiive and that's probably the reason because more people have adopted C4.5 rathern than CART.

A different approach for the same problem is SVM, second in preference for the ranking authors. While SVM separates classes with non-linear boundaries using a kernel function, C4.5 is based on rules which commonly separates classes using hyperplanes as decisions. This yields to have accurate results for SVM over neural networks or decision trees. A more sophisticated choice to C4.5 would be the Random Forest technique. The remaining algorithms seem to be equally important for the consensus.

The above list looks more focused on a few particular aspects of computer science, studied in data mining, mostly oriented to knowledge discovery. Personally, I would have liked to see some data stream algorithms, like Lossy counting. This, however, makes sense considering the recent peak of this area just since 2000, an opposite situation to other well-known techniques. In data mining we mainly study 4 general problems, our 10 algorithms fall into one of them:
  1. Clustering
  2. Classification
  3. Regression
  4. Association rule learning
But it is interesting to see how some algorithms can solve two or more of these problems. A common evidence about how traversal are many problems in computer science. As an example, KNN queries are commonly used for information retrieval, but they are also employed to perform fast and incremental classification via KNN. Similarly, the notion of metric distance is used to perform a fast approximation to clustering. KNN queries also provide fast association rule mining via KNN in hash tables.

- Is there any other algorithm you may have wanted to be included there?.
- Is there any other ranking in your own area, let say computer vision, artificial intelligence, or bioinformatics?.

- Omar.

Tuesday, June 8, 2010

Hello, again.













Hello... many things happened from the time I wrote my last post. My major professor moved to a different university and I changed a little bit my research orientation. It is still data mining, but more related now to data bases. That change is to be expected since now I am in a data bases research group. However, I am still mixing up computer vision with data bases and data mining techniques. Hope you like this new orientation, don't worry, it is going to be fun for readers and also the author of this blog.

Generally, I do not like to say neither hello or bye as I always like to act like I know a person since a lot of time before and I do not like to think that I will never see that person, respectively. So this post is intended to be short :).

More posts are coming soon. Sorry, the previous ones were more technical reviews of interesting papers, now it is going to be more intuitive. My word.

- Omar.

Monday, March 2, 2009

[Research] Assumtion of random distribution in motifs

Buhler, J. Chapter 3. Finding motifs using random projection. pp. 90–129. PhD thesis. 2002 

This chapter of the thesis models a similar problem found in Bioinformatics: motif discovery. Buhler defines a consensus model and a weight matrix model to represent a set of genomic sequences by only one representative sequence. Intuitively, the consensus model put on each position of the sequence the most likely element, where likely means the most frequent base in position j. In that case, all the sequences are of same length. In our specific case, we want to model our problem in a similar way: given a set of time series of different lengths, we want to extract a time series that captures the most important attributes of this set and use it as a representative time series of the set. I was thinking in two methods to find these important and common attributes without taking random elements in the sequence: Fourier and Wavelet transforms. In persuit of a paper that have dealt with this problem, I looked at the next paper. Each random representation is hashed and stored in a bucket, the buckets with more elements are considered to stored similar vectors.

Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international Conference on Management of Data (Seattle, Washington, United States, June 01 - 04, 1998). A. Tiwary and M. Franklin, Eds. SIGMOD '98. ACM, New York, NY, 448-459. 

Rather than choosing random elements from a vector, this paper propose wavelet coefficients to estimate the most important attributes in a traditional database scheme. Thus, since each sequence is defined as an average plus a set of detail coefficients, the authors want to estimate the most important atribues by removing some of the less significant details of the histogram representation of the attributes values. I like this approach since it considers locally changes in the set of vectors. Again, this work also considers vectors of similar length. 

In sumary, Fourier and Wavelet approaches can similarly be considered to estimate the most important elements of a set of time series with variable length. We would like to generate the most representative time series and use it as a centroid. Then, new time series are compared with these representative time series.

Next activities:

- Measuring the Normality assumption of time series by using the QQplot method. We want to check if the time series are normally distributed as is assumed in random projection.

- Implement a method to generate representative time series from a set of variable time series.

Monday, January 26, 2009

[Research] Data streams as time series


I borrowed a book from the Library Service: 

Learning from Data Streams: Processing Techniques in Sensor Networks. Joao Gama, M.M. Gaber. Springer, November 2007, 255 pages, ISBN 354073678


This book have interesting chapters about processing of time series and stream mining. I read many of them, but the most interesting are:

Chapter 3, Data Stream Processing. In this chapter the authors present the constrints of considering data streams such as efficient use of memory, limited storage, real time processing and provide tradicional techniques to deal with these problems such as landmark windows, sliding windows, tilted windows, sampling, data summary, hashing, wavelets, histograms

Chapter 9, Clustering techniques in Sensor Networks. This chapter presents some of the open problems in data stream clustering inclusing time series. Incremental versions of clustering techniques by avoiding assuming that we know the time series in all their extent, clustering of entire time series and not only subsequences, discovery of structures in data over time are some of the open problem within clustering of data streams. It is interesting to note that the definition of representatives such as centroids and medoids with the goal of reducing dimensionallity is not always natural. Hence, Random Projection seems a viable alternative for clutering time series.

J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.

Here Random projection is employed to reduce the dimensionality present a traditional information retrieval task like document classification. This paper was quickly read and only works for understanding the usefulness of Singular Value Decomposition into the random projection technique since both techniques are extensively used to reduce dimensionality in the vector space. It is interesting to note that orthogonal vectors to feed the RM index up are desirable, but they are expensive to achieve. However, since in high-dimensional space contains a larger number of almost orthogonal vectors than orthogonal vectors, the random vectors might be sufficiently close enough to orthogonal to offer a reasonable approximation of the original vectors. I plan to check that assumption in our time series to see if random projection can be used as a clustering method.

Monday, January 12, 2009

[Research] Incremental Clustering of Time Series and Motif Discovery

Iterative Incremental Clustering of Time Series. Jessica Lin, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos. Advances in Database Technology. 2004

In this paper, the authors present an interesting way to clustering time series. They mention some of the problems I had when dealing with clustering of human motion time series: low convergence, long times, variable clusters. As the authors said: Kmeans has a complexity of O(kNrD), k=number of clusters, N=number of abjetcs, r=number of iterations, and D=number of dimensions. However, all the algorithms (and even this paper) assume that time series have the same length. Reading this paper I understand that several comparisons of length-variable time series makes difficult the convergence and the obtaining of centroids. The approach introduced in this paper is based on the multiresolutions properties de Haar Wavelets and I think they can be more studied to preserve the convergence when clustering time series of different lengths.

Scaling and time warping in time series querying. Fu, A. W., Keogh, E., Lau, L. Y., and Ratanamahatana, C. A. 2005. In Proceedings of the 31st international Conference on Very Large Data Bases (Trondheim, Norway, August 30 - September 02, 2005). Very Large Data Bases. VLDB Endowment, 649-660.

This work studies the properties of DTW with respect to Euclidean distance in the context of time series queries. Since response time is the primary goal, the authors present some lower bound techniques (based on triangular inequality) such as: Sakoe-Chiba band and LB-Keaogh, to reduce the number of time the DTW distance is evaluared. The reduction in the dimensionality of the time series by component analysis is emphasized.

Detecting time series motifs under uniform scaling. Yankov, D., Keogh, E., Medina, J., Chiu, B., and Zordan, V. 2007. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 844-853. 

This paper was recently published and especifically deal with detecting time series motifs with the technique Uniform Scaling. The basic idea is to resample the time series both in amplitude and also in frequency to be able to compare time series of different lengths. The authors do not use Fourier techniques for that purpose but only a sampling of the time series upto obtain a desired length. In this paper they also used random projection to detec motifs. I noticed that the random projection approach is basically a clustering approach that organize range of time series in different buckets by using a hashing function that is based on the random election of elements in the time series segmenet. The main assumption uis that the time series are uniformly distributed and then a random distribution of elements in the time series is obtained. I want to know if this assumption is valid in case of human motion time series from video data since we noticed that we obtain sort of regular shapes over time, but not random time series.

Chiu, B., Keogh, E., and Lonardi, S. 2003. Probabilistic discovery of time series motifs. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM, New York, NY, 493-498. 

In this paper the authors introduce the importance of a random distribution in time series. By assuming that, they use statistical parameters (mean, standart deviation, correlation) to interpret missing or noisy values in the time series. Random projection is also used here, I wonder if other algoritms can be used since this method only works well in [10-20] dimensions since above that interval, a random distribution is not assumed.

Tuesday, December 2, 2008

[Research] Distribution of spatiotemporal events to describe movement in video


Since I am taking Applied Spatial Statistics, I am starting to consider a statistical approach to model the presence of events in video. Rather than describing the method of Laptev, this approach will be a mechanisim to formally discuss in the paper the existence of gradients in human movement.


Gradients will be considered as random variables that could appear in any pixel of the video. I plan to model those events as a probability function with the goal of finding the maximum likelihood estimates. I am reading the Expectation-maximization algorithm and the book "Interactive Spatial Data Analysis" to classify types of movement based on probability distribution of events.

The idea of using a distribution of probabilities in videos can also be found in [1]. The authors use this approach to synchronize video recordings of the same scene, but with different viewpoints. A weak point in this approach is that the authors only compare 2 distributions (histograms) by directly subtracting each position in the histograms.


I have also defined the "conflictive" spatiotemporal gradients as the ones that fall in the [middle line +- 2*spatial variance]. These events are removed and I am generating them for all the videos.


[1] J. Yan, M. Pollefeys, Video Synchronization via Space-Time Interest Point Distribution, Advanced Concepts for Intelligent Vision Systems, 2004.