I’m embarking on a series of articles covering an end to end example of applying Data Science to a problem, we don’t just throw at data into a tool, like any other form of development work we follow a standard process, in the data space there are just a handful of process models to guide us through our project these are CRISP-DM (Cross Industry Standard Process for Data Mining), SEMMA (Sample – Explore – Modify – Model – Asses), KDD (Knowledge Discovery and Data Mining) and finally Microsofts’ own TDSP (Team Data Science Process), because I want to use one of the fuller models I’ll be using CRISP-DM and Microsoft’s Team Data Science Process. TDSP appears to have been seeded from CRISP-DM, by looking at the two hopefully we’ll get a better understanding on how TDSP better suits today’s environment.
What is Data Science?
Well it’s not taking your raw data, pushing it straight through a R, Python or Azure ML library and then start predicting stuff, there are many parts to this which is why it’s good to follow CRISP-DM / TDSP because it keeps you honest.
Data Science is not just pushing data through R and Python Machine Learning libraries nor through the Azure Machine Learning Studio and doing a visualisation over the top, in fact if you don’t have an understanding of the domain you are modelling, aren’t doing tasks such as data cleansing, have reasoning around model choice and doing model validation, choosing samples correctly, knowing that correlation does not equal causation (google that last one it’s really amusing) then you’ve a lot of learning to do; Data Science is a super-set of roles/techniques such as Domain Knowledge, Project Management, Machine Learning, Analytics, Visualisations, Data Mining and Business Analysis. Don’t let yourself be fooled by the hype around Data Science because you can do Data Science with just Excel, the practice has been around for decades and is not tied to Big Data (ok, may be Variety because you will end up pulling in multiple data sets from different sources internal and third party), notice no mention of Hadoop because Big Data doesn’t require Hadoop. Applying Data Science requires domain experience, able to apply statistical techniques and able to chop and refine data – you aren’t restricted to R or Python, those languages are common because of the freely available libraries. It’s unusual for somebody to have all the skills required to be a Data Scientist which is where you need a team of people.
Deliverables
Anyway what are we doing in this series of articles? Adopting CRISP-DM (Cross Industry Process for Data Mining) and TDSP process models (hopefully) perform some Predictive Analytics off the Transport for London Cycle Usage data, we’ll hopefully throw out some additional insights along the way. I’ll be using a number of different data modelling techniques (relational and graph) because remember we don’t just throw the data in a database we have to model it, the cycle data lends itself to some graph type data processing (likely use a mix of Cosmos-DB, Neo4J and SQL 2017 Graph Data Types for comparative purposes) as well as traditional relational (we can talk through good ways of holding the data and using SQL efficiently), I’ll be using SSIS for ingesting the data and possibly something else like Perl or Python, for the prediction bits we’ll likely stick to R, F# and Python as well as Azure ML of course!
Hopefully this will give you insight into how you’d go about doing this in your own company, it will give you a fully worked example of using CRISP-DM and TDSP.
Once we have the model we can talk about how we are going to implement it, perhaps write a little Android app at which point we can bring in DevOps and AnalyticOps and discuss why your model isn’t static and you’ll need to be able to update and release as you would any other code.
More than happy for people to get in touch and help me with this, just get in touch, perhaps take the data and do your own processing inside the CRISP-DM / TDSP framework and if you write it up I’ll stick some summary text in this series linking out to your own.
Note: I’m doing this independently of Transport for London using their public data API under the general terms and conditions of use, see their site: https://tfl.gov.uk/info-for/open-data-users/.
What is CRISP-DM (Cross Industry Process for Data Mining)?
Devised in the late nineties to provide a standard framework for Data Mining (the forerunner to Data Science) in fact it’s one of the aspects of DS so it is natural CRISP-DM is useful to us while working on Data Engineering and Data Science projects.
Wikipedia gives a lot of good links and a very good introduction description: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining.
You will need the actual process model guide but sadly the original crisp-dm.org website is no more, there are plenty of sites with the PDF on, e.g. https://www.the-modeling-agency.com/crisp-dm.pdf.
What are the basics of CRISP-DM?
We’ve all no doubt been there, the user asks for ‘x’, you give them ‘x’ (which is actually y) but they really wanted ‘z’. Translating business requirements/understanding into technical implementation isn’t an art it’s a technique, some people are good at it, some aren’t. With time pressures on the business user gaining clarity is always a difficult task so following a standard iterative based framework is always going to ease that process.
A common understanding on both sides will improve the odds of a successful outcome, especially in Data Science where slight misunderstandings can mean the team going off in the wrong direction.
The first phase we come to in CRISP-DM is “Business Understanding” where we nail the project objectives and requirements from the viewpoint of the Business and then converting that into the Data Science problem definition a preliminary plan. Remembering this is an iterative process our first cut at this would be the Business would like to improve the availability of cycles so that hirers don’t visit an empty bike point, so our task is to predict the availability of bikes at a given bike point at specific times of the day with a view to a) providing the user with a visualisation of busy times and b) provide the business with information on stocking levels.
The next “Data Understanding” phase plays to what datasets are available for us to use, remember our model accuracy depends on having data suitable to the task defined in “Business Understanding”, TFL provide the cycle usage data which gives us details of each time a bike is taken out of a bike point and to which bike point it is returned and the duration the bike was hired. There are a ton of other datasets that may help in our task here, what things would effect cycle hire? Perhaps things like the weather, perhaps if the bike point is near a station, travel problems, the day of the week (working/none-working days), seasonal periods etc. Data Understanding is about finding this data and making it available for the next phase. This is our play about stage, for instance I’ve already discovered that on one of the many cycle usage data CSV files is in a different format to the others, also in some files we have start dates set to 1901 with an end_station_id of 0 – there are reasons for this and we need to understand what they are.
“Data Preparation” requires us to cleanse the data, get it in a state ready to be used by our chosen algorithms for instance we may need to convert some items from continuous to categorical data etc. In this stage we also need to start thinking about the training and validation samples, again we need to understand our data for instance your data might be skewed, for instance if 90% of the collected data is Female and we are modelling for a Male specific construct then we’ll likely want to pick from the Male population rather than just picking 50,000 random rows from the dataset (you’ll likely end up 90% female).
“Modelling” phase, well – that’s the modelling bit, what techniques to use against your data, then we have “Evaluation” which deals with verification of the model and finally we have the “Deployment” phase.
I’ll talk about each phase as we do them.
What is TDSP (Team Data Science Process)?
Expanding on CRISP-DM, TDSP is a collection of process flow, tools and utilities to assist not only you but your team provide the Data Science component of your Enterprise Data Platform. Microsoft have a Team Data Science Process site on GitHub which contains all the bits you need, start with the root page (https://github.com/Azure/Microsoft-TDSP) but the actual detail is in the README.md: https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/README.md.
What are the basics of TDSP?
Reviewing TDSP, CRISP-DM, KDD and SEMMA process models you will see the commonality – start with business understanding [of the problem in hand], understand what data you have or may need, prepare that data – clean it and put it in a form suitable for your each of the models you will try, then Model it, Evaluate it and finally Deploy it. CRISP-DM and TDSP are both task based, the process model defines the tasks you should be doing and in which phases.
At this point if you’ve looked at https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md you will see the phases match those of CRISP-DM.
Summary
Hopefully I’ve set the scene of what we are going to do over the coming weeks, feel free to interact with me in the comments or feedback to me tonyrogerson @ trover.net. The next article will deal with Project Initiation and Business Understanding.