Boil boil toil and trouble – 3 Data Mining recipes
As part of the work we are doing to develop our one-day workshop titled ‘from data analyst to data scientist’, we have been researching the CRISP-DM methodology as it’s a core part of one of the workshop modules.
There are three main data mining methodologies that are generally used:
(CRoss-Industry Standard Process for Data Mining)
(Sample, Explore, Modify, Model, Assess)
(Knowledge Discovery in Databases)
There are lots of articles online that explain each of these methodologies at both a high level and in some detail, not to mention a raft of data mining books that describe them.
There is also this good comparison article “KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW” which explains the difference between the three methods.
At the moment, we are focussed on using the CRISP-DM method for a few reasons. There is a wealth of open documentation available for CRISP-DM. This includes the IBM – SPSS Modeler CRISP-DM Guide and Rapidminer – Data Mining for the Masses.
From this 2014 KDD Nuggets survey, CRISP-DM seems to still be the method most used for adoption.
As the SEMMA method is proprietary to SAS it’s not an option for us to teach using this method.
However the KDD method is one we haven’t done a lot of investigation into, so we will be exploring it as we develop the workshop content to see if it is a better fit for our Agile approach.
I’ll be sure to blog what we find.