Thursday, June 17, 2010

Top 10 Data Mining Algorithms (in 2006)

ICDM 2006 December 18 - 22, 2006, Hong Kong

A colleague took my attention to the top-10 algorithms that have influenced greatly on the development and practice of science and engineering in the 20th century, according to Jack Dongarra and Francis Sullivan.

As expected, the list is kinda historical trying to cover quite wide areas used in Engineering. It is also nice to find Francis Sullivan as coauthor of that work considering him as one of the authors of the famous QR algorithm for computing the eigenvalues and eigenvectors of matrices. Please, read their top-10 document here and their corresponding discussions, it is worth it. Here is their list:
  1. The Monte Carlo method or Metropolis algorithm
  2. The simplex method of linear programming
  3. The Krylov Subspace Iteration method
  4. The Householder matrix decomposition
  5. The Fortran compiler
  6. The QR algorithm for eigenvalue calculation
  7. The Quicksort algorithm
  8. The Fast Fourier Transform
  9. The Integer Relation Detection Algorithm (given N real values X, is there a nontrivial set of integer coefficients A so that sum(A * X) = 0, over (1 <= i <= N)?
  10. The fast Multipole algorithm (to calculate gravitational forces in an N-body problem)

It would be interesting to have such a list for data mining, wouldn't it?. Indeed, we have it, the famous top-10 algorithms for data mining, which I quickly read some years ago and now it is the perfect excuse for this post. Unlike the previous work, this list was made by the consensus of 494 people in 2006. This is the story.

Nomination

During the
2006 IEEE International Conference on Data Mining held in Hong Kong some of the most representative algorithms used to data mining information were presented. Although some controversial rankings can be found on internet, in this occasion the list of algorithms were nominated by qualified professionals (ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners) related to two of the most important conferences in this area. Every winner listed up to 10 data mining algorithms, which they consider are widely used in Computer Science
.

Voting

To reduce the number of algorithms candidates, those algorithms with less than 50 citations in Google Scholar (October 2006) were filtered out. The remaining 18 algorithms were voted to know the top-10 algorithms by the Program Committee members of KDD-06, ICDM '06, and SDM '06 as well as winners from the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award, 494 votes in total. The results were presented in the 2006 edition of ICDM (which logo is at the beginning of this post). Here is the list and its corresponding votes.

  1. C4.5 (61 votes), presented by Hiroshi Motoda
  2. K-Means (60 votes), presented by Joydeep Ghosh
  3. SVM (58 votes), presented by Qiang Yang
  4. Apriori (52 votes), presented by Christos Faloutsos
  5. EM (48 votes), presented by Joydeep Ghosh
  6. PageRank (46 votes), presented by Christos Faloutsos
  7. AdaBoost (45 votes), presented by Zhi-Hua Zhou
  8. kNN (45 votes), presented by Vipin Kumar
  9. Naive Bayes (45 votes), presented by Qiang Yang
  10. CART (34 votes), presented by Dan Steinberg
For data mining people the above information will always look interesting in seek of hidden relationships, hard to avoid it sorry :).

The most and least voted algorithms are C4.5 and CART, respectively. This seems to indicate the importance of decision trees as a tool to classify information and perform regression in data mining, specially some decades ago. Both generate decision trees using rules. In case of CART, it provides error measures and C4.5 measures entropy reduction to select optimal split of nodes, this concept is more intuiive and that's probably the reason because more people have adopted C4.5 rathern than CART.

A different approach for the same problem is SVM, second in preference for the ranking authors. While SVM separates classes with non-linear boundaries using a kernel function, C4.5 is based on rules which commonly separates classes using hyperplanes as decisions. This yields to have accurate results for SVM over neural networks or decision trees. A more sophisticated choice to C4.5 would be the Random Forest technique. The remaining algorithms seem to be equally important for the consensus.

The above list looks more focused on a few particular aspects of computer science, studied in data mining, mostly oriented to knowledge discovery. Personally, I would have liked to see some data stream algorithms, like Lossy counting. This, however, makes sense considering the recent peak of this area just since 2000, an opposite situation to other well-known techniques. In data mining we mainly study 4 general problems, our 10 algorithms fall into one of them:
  1. Clustering
  2. Classification
  3. Regression
  4. Association rule learning
But it is interesting to see how some algorithms can solve two or more of these problems. A common evidence about how traversal are many problems in computer science. As an example, KNN queries are commonly used for information retrieval, but they are also employed to perform fast and incremental classification via KNN. Similarly, the notion of metric distance is used to perform a fast approximation to clustering. KNN queries also provide fast association rule mining via KNN in hash tables.

- Is there any other algorithm you may have wanted to be included there?.
- Is there any other ranking in your own area, let say computer vision, artificial intelligence, or bioinformatics?.

- Omar.

No comments: