The Need for Data Science Project Management
The Data Science Process Alliance views data science as encompassing the full project life cycle required to gain insight from data (i.e., understanding the business requirements; performing data collection, preparation, analysis, and visualization; and then storing the relevant datasets and analytical models) and believes that a data science project will be, and is already increasingly becoming, a team effort, with people contributing to the project across a range of roles and responsibilities. In addition, the ethical conundrums that data science efforts need to consider will continue to grow in number and increase in importance (e.g., bias in predictive analytics) and that one of the best ways to ensure ethical oversight is to integrate ethics evaluation within the data science team’s process framework.
However, due to the newness of the field of data science, most of these themes are just starting to be explored, and there is minimal training and certification relating to data science process (sometimes known as data science project management). A simple analogy is software development, in that, during the past fifty years, the software development process has evolved from an individual, star-programmer approach, to a team-focused approach. DSPA believes that there will be a similar evolution for data science projects, and that we are currently at the ‘star-data analyst’ phase of the evolution. For example, the lead data scientist, on many projects, is often implicitly responsible to ensure that all team members are working efficiently, that the “client” is probably engaged and that there is no bias in any of the predictive analytical results.
Without an effective framework to communicate and collaborate, project execution challenges are bound to arise, some of these potential challenges are summarized below:
- Ambiguous Project Goals and Priorities
The data science team needs to collectively understand the goal of the project. This needs to be as a team, since stakeholders need to understand what type of analysis is possible, what might require significant upfront time/cost, and what might be some metrics to assess the results of the project. Stated another way, the product owner/stakeholder needs help to clearly define requirements and how to prioritize different possible approaches.
- When is an analysis “good enough”
While a team should evaluate their project in terms of analytical results (e.g., predictive accuracy), one needs to put that result into a business context. For example, is a 1% improvement in prediction worth the time and energy of the team, or are there other projects of higher potential value. In other words, teams need to be mindful of how to prioritize their time and how their actionable insight impacts the organization.
- Implementation Challenges
The time to think about how to deploy / operationalize a machine learning algorithm is not at the end of the project (i.e., there can be many challenges when a “data science group” just gives the results to the “IT group” after the model has been developed. Ensuring proper thought about implementation challenges (such as data refreshes and incremental model learning) need to be discussed throughout the project.
- The need for Agility
While many frameworks used to structure a data science project describe the data science process as a series of phases (such as CRISP-DM and TDSP), it is important to realize that iteration is a key enabler of a team’s efficiency and effectiveness. To best address a problem, the team needs to iteratively explore different potential machine learning models and data attributes. In other words, expecting the team to “know” the most important data attributes, as well as the best machine learning models, at the start of a project. So, when defining/selecting a framework, it is important to make sure there is a well defined process for when and how the team decides how to “loop back” and then, how do they collectively define their next iteration.
- Ensuring Stakeholder Engagement
Many times, the product owner/senior stakeholder/manager think of a new analysis to be done, and asks the team to do this additional analysis while still continuing the data science team’s existing work. While this person is thinking “it shouldn’t be too much work”, in real-life, the problem is that the impact of doing this analysis is not clearly understood. For example, it would take more time than was thought (what seems easy by a stakeholder in data science might not actually be easy) and the impact on other team deliverables might not be clear. While it should be encouraged that all team members suggest new ideas, there needs to be effective group communication and prioritization of those ideas – so the full knowledge of the team is leveraged and that the most important tasks are completed quickly.