Data Science Project Example Walk-through: Predictive Analytics on London Cycle Hire Data using the CRISP-DM and TDSP process models: #2 Waterfall v Agile (Scrum v Kanban)

By | 27th September 2017

In Part #1 – Backgrounder we covered a basic introduction to CRISP-DM and Microsoft TDSP, in this article we put our project framework in place.

As Benjamin Franklin said, “By failing to prepare, you are preparing to fail”; I wasn’t expecting to talk about the project management side in a lot if any detail but after I started writing this I realised it’s just as important as the process model itself, I’ll talk about Business Understanding in the next article. Planning is key to any project, in Data Science and Data Warehousing you often don’t know a lot of things up front, it’s really difficult to accurately plan a project because things will undoubtedly crop up unexpectedly. Choosing the correct project methodology is a key component to achieving a successful outcome, ironically a successful outcome in Data Science may be that the task at hand is not doable, but that may save a shed load of time, resources and money!

Project framework – Agile or Waterfall?

Waterfall is sequenced project delivery, so for our two process models we would complete the Business Understanding phase then Data Understanding then the other phases in sequence – we design up-front, we build the entire house, then decorate it, and then live in it.

Agile gives a structured framework to an iterative “feedback and enhance” approach; when applied to TDSP / CRISP-DM it allows cross phase improvements for instance we might find in the Data Understanding phase that something is not doable or requires more clarity so the Business Understanding needs to be revisited – we modify as we go, we build the house room by room decorating/re-building and living in it until it’s complete – as you can understand we have rework but that’s expected and captured early on making it cheaper to fix rather than in waterfall knocking the entire house down and starting again.

When performing Data Science or even Business Intelligence it’s often difficult to fully factor what you are trying to do, in both disciplines you take data from a feed, the quality of that data will not be known until you’ve got it loaded, in fact loading it in the first place may be problematic! Take the London Cycle usage data, there are 118 CSV files amounting to 5.39GB of raw data, they are all a similar structure but some CSV files have text quoted others not, there is one that has an additional column in the middle, some files contain text in what you’d think are numeric columns – you would not know that information on the outset of the project, only once you had loaded the data (Data Understanding phase).

Waterfall for me is a non-starter for Data Science and Data Warehousing, the iterative nature of Agile wins hands down, it also reflects the iterative nature of CRISP-DM and TDSP, for example it may be the case that once you get insights from the data which may be data quality issues then the remit of the project may change which requires revisiting the Business Understanding and Data Understanding phases – you don’t want to be too far down the track when issues are discovered, a project can be canned much earlier and remember a lot of DS may not amount to something that would give you benefit.

As you can probably guess we will be using Agile for this project, either Scrum or Kanban which we’ll discuss now.

Agile – Scrum v Kanban?

A subset of Agile, Scrum is a process framework for implementing Agile. Fixed length “sprints” which are usually one or two weeks are used to contain the workload and planning is done around those sprints.

Kanban is a visual card based system where the cards represents tasks/stories and are positioned on a board made up of columns and swim lanes for example the columns To-do, Blocked, In-Progress, Review, Complete and Closed with the swim lanes aligned with our process model phases i.e. Business Understanding, Data Understanding etc. although the swim lanes aren’t necessarily important it just helps you focus where the stories sit.

Which flavour of Agile you use will entirely depend on your organisation although researching the topic I’ve found that research suggests that Kanban Boards are better suited to Data Science projects because they offer more dynamism in task prioritisation, remember with DS we will often have tasks created as we go through the process – those often can’t wait between sprints. Below I’ve given you some links to background reading, for the purpose of this project I will use Kanban Board.

https://www.atlassian.com/agile/kanban
https://www.atlassian.com/agile/scrum

Microsoft Visual Studio Team System: https://docs.microsoft.com/en-us/vsts/work/kanban/ or for Scrum: https://docs.microsoft.com/en-us/vsts/work/scrum/index

Team Data Science Process on Channel 9: https://channel9.msdn.com/Shows/Cloud+Cover/Episode-227-Team-Data-Science-Process

I will also be using Microsoft Visual Studio Team System Services – Agile Tools for my project management, I have created a project and using the Kanban board on “Stories” set up my columns and swim lanes, I have the columns To-Do, Blocker, In-Progress (split Doing/Done), Review (split Doing/Done), Complete and Closed, I then have the swim lanes Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation with the default lane as Assignable. Stories contain one or more tasks and it is the story that moves between columns. I did not make the columns match the process model phases because I am using stories to group tasks to be done within each phase of the process workflow.

Phases

We can categorise the phases in both CRISP-DM and TDSP into two components – the first is mainly {planning, information gathering, project feasibility, data set availability}, the second is mainly {implementation of the design, data munging, statistics, programming, deployment}, those two categories have very different skill sets but with some cross over. Consider the separation of phases when you create your Kanban Stories and subsequent tasks. I’ve posted an IDEF-0 diagram showing the phase interaction, I’ve deliberately missed out most the detail i.e. the mechanisms/control (folk/tools doing the phases) and also the input/outputs.

Roles required in the Project

First and foremost and a lot of this follows Data Warehouse principles – you need a Project Sponsor, somebody in the business who believes in what you are doing, remember a lot of Data Science projects will either fail, fail to provide business benefit or be put to one side because the business is only doing it because it’s trendy and believes it ought to be. Do not treat a DS project like any other piece of development, treat it like building a Data Warehouse – by that I mean the profile needs to be raised within the business so make sure there is visibility, make sure you know the people in the business who can help you and with what, make sure you strike up a conversation, educate them to what DS is and what it isn’t, perhaps do an introduction to DS where you at a high level talk about the steps involved – CRISP-DM / TDSP and in what phases you’ll need their help and with what tasks. Be on good terms with the folk providing your data, and when there are issues tell them, be helpful, one of the side effects of the cleansing and data prep side of the DS project is to highlight issues with source systems.

The role of Project Manager is to keep track of the Kanban Board and make sure the process is followed, smooth out any politics getting the source data or getting access to business resource for business and data domain knowledge, the term business domain is the knowledge about how the business runs, the process flow etc. the Business Taxonomy flows from this, data domain knowledge on the other hand is more related to a Database Designer/Developer, it’s what the source data set structures mean, the Data Dictionary flows from this.

Other roles on the project will be a Data Scientist and a Data Engineer, those are very different roles, the Data Engineer gets and prepares the data whereas the Data Scientist creates various models, has the statistics background to understand sample populations, model verification etc.

Summary

I’ve briefly discussed the project structure, we are going to be using the Kanban Board and I’ll be using Visual Studio Team Services Agile Tools for managing that aspect of the project.

The next article we will get into actually doing something! We will tackle the Business Understanding phase.

Leave a Reply

Your email address will not be published. Required fields are marked *