Many teams think of using CRISP-DM for their data science project, and relatedly, many data science leaders explore CRISP-DM training and certification for their team. In fact, CRISP-DM is the most commonly considered data science process framework.
However, while CRISP-DM defines a set of phases for doing data science, it does not describe an actual process data science teams should use to execute data science projects.
Specifically, the CRoss-Industry Standard Process for Data Mining (CRISP-DM) defines is a set of tasks and deliverables (such as documentation and reports) for six key iterative phases of a data science project:
- Business Understanding: determine business objectives; assess situation; determine data mining goals; produce project plan
- Data Understanding: collect initial data; describe data; explore data; verify data quality
- Data Preparation (generally, the most time-consuming phase): select data; clean data; construct data; integrate data; format data
- Modeling: select modeling technique; generate test design; build model; assess model
- Evaluation: evaluate results; review process; determine next steps
- Deployment: plan deployment; plan monitoring and maintenance; produce final report; review project
CRISP-DM was defined in 1996, and in practitioner polls CRISP-DM has been consistently the most commonly used framework for analytics, data mining and data science projects. However, most of the polls people mention are old, such as the KDnuggets 2014 poll. Despite being close to 25 years old, and still well-known and popular, CRISP-DM has not been revised since its creation.
CRISP-DM’s focus on Business Understanding is helpful to align technical work with business needs and to steer data scientists away from jumping into a problem without properly understanding business objectives. Its final step Deployment likewise addresses important considerations to close out the project and transition to maintenance and operations. However, thinking about deployment only at the end of a project is not a good idea – as there are many potential issues in just giving the analysis to a devOps team to run (e.g., has the team thought about model updates and the frequency and method of data refreshes).
As shown in the image above, CRISP-DM’s flexible, cyclical nature can provide some of the benefits of agile in that the framework suggests that the team should keep looping through the project phases, each time gaining a deeper understanding of the data and the problem. For example, modeling might suggest additional data preparation that would be helpful, or evaluation of a model can lead to new business insights. This is a key aspect of agile data science – that the empirical knowledge gained from previous cycles should be fed into the following cycles.
While CRISP-DM enables a level of agility (in that the team does not progress linearly through a project), there are also aspects of CRISP-DM are a bit more waterfall-like. Specifically, while many people know about the six phases of a CRISP-DM project, the framework also documents specific reports that should be completed during each phase. For example, the CRISP-DM framework has 12 reports that should to be created prior to data collection. Furthermore, CRISP-DM does not address how the team should decide when to loop back, in that there is no structure or guidance on how a team should coordinate work or get feedback from stakeholders.
Enhancements to CRISP-DM
There have been several attempts to improve CRISP-DM. Below three are discussed:
- In 2015, IBM introduced the Analytics Solutions Unified Method for Data Mining/Predictive Analytics (ASUM-DM). ASUM-DM is compatible with CRISP-DM, but has more of a focus on iterative analysis as well as how to deploy the analysis. Specifically, it has five project phases, which are similar to CRISP-DM, but with a more tightly integrated iterative phase (Analyze, Design, Configure & Build), and another tightly integrated deployment phase (Deploy, and Operate & Optimize). To date, there is has been minimal adoption of ASUM-DM.
- In 2016, Nancy Grady of SAIC, expanded upon CRISP-DM to publish the Knowledge Discovery in Data Science (KDDS). KDDS expands upon CRISP-DM to address big data as well as providing some integration with management processes. Specifically, KDDS defines four distinct phases (assess, architect, build, and improve) and five process stages (plan, collect, curate, analyze, and act). KDDS can be a useful expansion of CRISP-DM for big data teams. However, KDDS only addresses some of the shortcomings of CRISP-DM. For example, it is not clear how a team should iterate when using KDDS. To date, there is has been minimal adoption of KDDS.
- Also, in 2016, Microsoft introduced the Team Data Science Projects (TDSP). Check out a more in-depth discussion of TDSP on our TDSP web page.
For more info on CRISP-DM, take a look at the published CRISP-DM guide