Let's data mining together: 2010

Monday, August 30, 2010

What is going on with P = NP?

P = NP is a major unsolved problem in Computer Science. Regarding the recent attempt to solve it this problem by Vinay Deolalikar (August 6, 2010), in this post we will briefly review what is this about P = NP, consequences if that statement is true, why many people was so excited over the possible solution announced recently, and some interesting observations on how our society reacted to it. The answer to this question seems to be P!=NP and it looks like the recent Vinay's proof is not correct. Moreover, it has some issues that make it hard to fix.

What P and NP means?

Given a problem, first we want to discover candidate solutions and then verify them. The goodness of these discovering and verification tasks commonly depend on the use of limited resources such as time and space. Thus, the time employed to discover a solution (or to verify it) can be efficient or inefficient. Having said that, P and NP are sets of computational problems. A P problem can always be solved in polynomial time in the size of the input. A NP problem can always be quickly verified in polynomial time. NP problems are interesting because although they can be quickly verified, they may not be quickly solved. Hence, P is always included in NP.

However, stating that P = NP means that all problems are both quickly solved and verified. If you find a problem difficult to solve, there will be a theoretical guarantee that there exists a efficient (at most polynomial) way to solve it. If you are trying to solve your Sudoku, you can be confident that there is an efficient way to solve it, even if you cannot find it. Here, we are not considering NP-hard problems, which are those whose verification is not polynomial.

What made Vinay's attempt special this time?

This is a good question. To frame the discussion, people propose proofs that P = NP or P!=NP all the time. For example, in arXiv.org you can easily find recent attempt to show one of those statements (http://arxiv.org/find/all/1/all:+AND+proof+AND+p+np/0/1/0/all/0/1). Thus, this is not a sporadic effort to solve this problem, it is a well-studied theoretical problem with many people trying to solve it. Why this time the propagation of this news was so fast then?. Here are my hints:

The way how Vinay introduced his work to the community was unusual. Unlike the traditional way used in academy, where the submission/acceptance of a paper after a peer review process is the rule, he sent the manuscript to 22 experts in the field and they quickly realized a valid attempt to solve this problem.
The method employed by Vinay was novel. The use of random k-SAT was a path that few people traversed before. If there is a solution to this problem, many people agree that it should consider the problem from a new perspective. After all, this is an old problem and many well-known techniques has shown to fail to prove P=NP.
Internet time. It is now easier to receive news and share our own impressions. The proof provided by Vinay was made public and he received feedback the next day he sent his work. This is a summary.

August 6, Document sent to 22 people

August 7, Greag Baker's blog post "P != NP"

August 8, Slash dot

August 9, Wikipedia article about Vinay. Richard Lipton wrote a post "Issues in the proof that P != NP"

August 10, First discussions to create a Wikipedia article about Vinay's attempt to solve the problem (340 edits so far)

August 11, Wikinews: "Researcher claims solution to P vs NP math problem"

August 15: Richard Lipton blog posts saying that proof seems incorrect

August 17: Author said the paper was sent out for refereeing

The paper was less than one-week-old and it already produced blog posts and technical discussions in Wikipedia, Wikinews, and Slashdot. After one week the proof seemed to show some issues and people, including Vinay, tried to fix those problems to see if it could solve P=NP. This seems to indicate that although complex knowledge is currently fragmented, now we can collaborate in a very short time such that someone can ask a complex question and receive meaninful feedback very soon.

Comunity refereeing

Now the propagation of news is faster than anytime before. Once those 22 experts received the work, they spent time to analyze it. The manuscripts was made public by the author and many people around the world also read it, some of them discovered interesting issues in the proof and provide nice counter examples. I wonder whether we we in front of a new model to review academic works. Is community refereeing better than the traditional peer review process?..

Finally, I would say that Vinay did the right thing. He made an interesting work, discussed it with his colleges, send the work to experts within the field, received the feedback, and sent the work for publishing. Even if the proof results to be rejected, his work should be recognized. Everyone agrees that it is easier to verify the validity of an academic work than to solve it. Unless someone proves P=NP :).

(*) Thanks to Kenneth Clarkson, Ronald Fagin, and Ryan Williams at IBM Almaden Research Center for their wonderful talk. It was a nice way to get insight on the proof and understand its applications on real problems like databases and cryptography. Ron was one of the 22 that initially received the work for its consideration.

Thursday, June 17, 2010

Top 10 Data Mining Algorithms (in 2006)

ICDM 2006 December 18 - 22, 2006, Hong Kong

A colleague took my attention to the top-10 algorithms that have influenced greatly on the development and practice of science and engineering in the 20th century, according to Jack Dongarra and Francis Sullivan.

As expected, the list is kinda historical trying to cover quite wide areas used in Engineering. It is also nice to find Francis Sullivan as coauthor of that work considering him as one of the authors of the famous QR algorithm for computing the eigenvalues and eigenvectors of matrices. Please, read their top-10 document here and their corresponding discussions, it is worth it. Here is their list:

The Monte Carlo method or Metropolis algorithm
The simplex method of linear programming
The Krylov Subspace Iteration method
The Householder matrix decomposition
The Fortran compiler
The QR algorithm for eigenvalue calculation
The Quicksort algorithm
The Fast Fourier Transform
The Integer Relation Detection Algorithm (given N real values X, is there a nontrivial set of integer coefficients A so that sum(A * X) = 0, over (1 <= i <= N)?
The fast Multipole algorithm (to calculate gravitational forces in an N-body problem)

It would be interesting to have such a list for data mining, wouldn't it?. Indeed, we have it, the famous top-10 algorithms for data mining, which I quickly read some years ago and now it is the perfect excuse for this post. Unlike the previous work, this list was made by the consensus of 494 people in 2006. This is the story.

Nomination

During the 2006 IEEE International Conference on Data Mining held in Hong Kong some of the most representative algorithms used to data mining information were presented. Although some controversial rankings can be found on internet, in this occasion the list of algorithms were nominated by qualified professionals (ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners) related to two of the most important conferences in this area. Every winner listed up to 10 data mining algorithms, which they consider are widely used in Computer Science.

Voting

To reduce the number of algorithms candidates, those algorithms with less than 50 citations in Google Scholar (October 2006) were filtered out. The remaining 18 algorithms were voted to know the top-10 algorithms by the Program Committee members of KDD-06, ICDM '06, and SDM '06 as well as winners from the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award, 494 votes in total. The results were presented in the 2006 edition of ICDM (which logo is at the beginning of this post). Here is the list and its corresponding votes.

C4.5 (61 votes), presented by Hiroshi Motoda
K-Means (60 votes), presented by Joydeep Ghosh
SVM (58 votes), presented by Qiang Yang
Apriori (52 votes), presented by Christos Faloutsos
EM (48 votes), presented by Joydeep Ghosh
PageRank (46 votes), presented by Christos Faloutsos
AdaBoost (45 votes), presented by Zhi-Hua Zhou
kNN (45 votes), presented by Vipin Kumar
Naive Bayes (45 votes), presented by Qiang Yang
CART (34 votes), presented by Dan Steinberg

For data mining people the above information will always look interesting in seek of hidden relationships, hard to avoid it sorry :).

The most and least voted algorithms are C4.5 and CART, respectively. This seems to indicate the importance of decision trees as a tool to classify information and perform regression in data mining, specially some decades ago. Both generate decision trees using rules. In case of CART, it provides error measures and C4.5 measures entropy reduction to select optimal split of nodes, this concept is more intuiive and that's probably the reason because more people have adopted C4.5 rathern than CART.

A different approach for the same problem is SVM, second in preference for the ranking authors. While SVM separates classes with non-linear boundaries using a kernel function, C4.5 is based on rules which commonly separates classes using hyperplanes as decisions. This yields to have accurate results for SVM over neural networks or decision trees. A more sophisticated choice to C4.5 would be the Random Forest technique. The remaining algorithms seem to be equally important for the consensus.

The above list looks more focused on a few particular aspects of computer science, studied in data mining, mostly oriented to knowledge discovery. Personally, I would have liked to see some data stream algorithms, like Lossy counting. This, however, makes sense considering the recent peak of this area just since 2000, an opposite situation to other well-known techniques. In data mining we mainly study 4 general problems, our 10 algorithms fall into one of them:

Clustering
Classification
Regression
Association rule learning

But it is interesting to see how some algorithms can solve two or more of these problems. A common evidence about how traversal are many problems in computer science. As an example, KNN queries are commonly used for information retrieval, but they are also employed to perform fast and incremental classification via KNN. Similarly, the notion of metric distance is used to perform a fast approximation to clustering. KNN queries also provide fast association rule mining via KNN in hash tables.

- Is there any other algorithm you may have wanted to be included there?.

- Is there any other ranking in your own area, let say computer vision, artificial intelligence, or bioinformatics?.

- Omar.

Tuesday, June 8, 2010

Hello, again.

Hello... many things happened from the time I wrote my last post. My major professor moved to a different university and I changed a little bit my research orientation. It is still data mining, but more related now to data bases. That change is to be expected since now I am in a data bases research group. However, I am still mixing up computer vision with data bases and data mining techniques. Hope you like this new orientation, don't worry, it is going to be fun for readers and also the author of this blog.

Generally, I do not like to say neither hello or bye as I always like to act like I know a person since a lot of time before and I do not like to think that I will never see that person, respectively. So this post is intended to be short :).

More posts are coming soon. Sorry, the previous ones were more technical reviews of interesting papers, now it is going to be more intuitive. My word.

- Omar.