UK SQL Server User Group https://sqlserverfaq.com Community of Microsoft Data Platform Professionals Wed, 27 Sep 2017 19:55:33 +0000 en-GB hourly 1 https://wordpress.org/?v=4.8.3 https://sqlserverfaq.com/wp-content/uploads/2016/01/cropped-UKSQLFAQLogo-1-32x32.png UK SQL Server User Group https://sqlserverfaq.com 32 32 Data Science Project Example Walk-through: Predictive Analytics on London Cycle Hire Data using the CRISP-DM and TDSP process models: #2 Waterfall v Agile (Scrum v Kanban) https://sqlserverfaq.com/tonyrogerson/2017/09/27/data-science-project-example-walk-through-predictive-analytics-on-london-cycle-hire-data-using-the-crisp-dm-and-tdsp-process-models-2-waterfall-v-agile-scrum-v-kanban/ Wed, 27 Sep 2017 19:55:33 +0000 https://sqlserverfaq.com/tonyrogerson/?p=209 Read More »]]> In Part #1 – Backgrounder we covered a basic introduction to CRISP-DM and Microsoft TDSP, in this article we put our project framework in place.

As Benjamin Franklin said, “By failing to prepare, you are preparing to fail”; I wasn’t expecting to talk about the project management side in a lot if any detail but after I started writing this I realised it’s just as important as the process model itself, I’ll talk about Business Understanding in the next article. Planning is key to any project, in Data Science and Data Warehousing you often don’t know a lot of things up front, it’s really difficult to accurately plan a project because things will undoubtedly crop up unexpectedly. Choosing the correct project methodology is a key component to achieving a successful outcome, ironically a successful outcome in Data Science may be that the task at hand is not doable, but that may save a shed load of time, resources and money!

Project framework – Agile or Waterfall?

Waterfall is sequenced project delivery, so for our two process models we would complete the Business Understanding phase then Data Understanding then the other phases in sequence – we design up-front, we build the entire house, then decorate it, and then live in it.

Agile gives a structured framework to an iterative “feedback and enhance” approach; when applied to TDSP / CRISP-DM it allows cross phase improvements for instance we might find in the Data Understanding phase that something is not doable or requires more clarity so the Business Understanding needs to be revisited – we modify as we go, we build the house room by room decorating/re-building and living in it until it’s complete – as you can understand we have rework but that’s expected and captured early on making it cheaper to fix rather than in waterfall knocking the entire house down and starting again.

When performing Data Science or even Business Intelligence it’s often difficult to fully factor what you are trying to do, in both disciplines you take data from a feed, the quality of that data will not be known until you’ve got it loaded, in fact loading it in the first place may be problematic! Take the London Cycle usage data, there are 118 CSV files amounting to 5.39GB of raw data, they are all a similar structure but some CSV files have text quoted others not, there is one that has an additional column in the middle, some files contain text in what you’d think are numeric columns – you would not know that information on the outset of the project, only once you had loaded the data (Data Understanding phase).

Waterfall for me is a non-starter for Data Science and Data Warehousing, the iterative nature of Agile wins hands down, it also reflects the iterative nature of CRISP-DM and TDSP, for example it may be the case that once you get insights from the data which may be data quality issues then the remit of the project may change which requires revisiting the Business Understanding and Data Understanding phases – you don’t want to be too far down the track when issues are discovered, a project can be canned much earlier and remember a lot of DS may not amount to something that would give you benefit.

As you can probably guess we will be using Agile for this project, either Scrum or Kanban which we’ll discuss now.

Agile – Scrum v Kanban?

A subset of Agile, Scrum is a process framework for implementing Agile. Fixed length “sprints” which are usually one or two weeks are used to contain the workload and planning is done around those sprints.

Kanban is a visual card based system where the cards represents tasks/stories and are positioned on a board made up of columns and swim lanes for example the columns To-do, Blocked, In-Progress, Review, Complete and Closed with the swim lanes aligned with our process model phases i.e. Business Understanding, Data Understanding etc. although the swim lanes aren’t necessarily important it just helps you focus where the stories sit.

Which flavour of Agile you use will entirely depend on your organisation although researching the topic I’ve found that research suggests that Kanban Boards are better suited to Data Science projects because they offer more dynamism in task prioritisation, remember with DS we will often have tasks created as we go through the process – those often can’t wait between sprints. Below I’ve given you some links to background reading, for the purpose of this project I will use Kanban Board.

https://www.atlassian.com/agile/kanban
https://www.atlassian.com/agile/scrum

Microsoft Visual Studio Team System: https://docs.microsoft.com/en-us/vsts/work/kanban/ or for Scrum: https://docs.microsoft.com/en-us/vsts/work/scrum/index

Team Data Science Process on Channel 9: https://channel9.msdn.com/Shows/Cloud+Cover/Episode-227-Team-Data-Science-Process

I will also be using Microsoft Visual Studio Team System Services – Agile Tools for my project management, I have created a project and using the Kanban board on “Stories” set up my columns and swim lanes, I have the columns To-Do, Blocker, In-Progress (split Doing/Done), Review (split Doing/Done), Complete and Closed, I then have the swim lanes Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation with the default lane as Assignable. Stories contain one or more tasks and it is the story that moves between columns. I did not make the columns match the process model phases because I am using stories to group tasks to be done within each phase of the process workflow.

Phases

We can categorise the phases in both CRISP-DM and TDSP into two components – the first is mainly {planning, information gathering, project feasibility, data set availability}, the second is mainly {implementation of the design, data munging, statistics, programming, deployment}, those two categories have very different skill sets but with some cross over. Consider the separation of phases when you create your Kanban Stories and subsequent tasks. I’ve posted an IDEF-0 diagram showing the phase interaction, I’ve deliberately missed out most the detail i.e. the mechanisms/control (folk/tools doing the phases) and also the input/outputs.

Roles required in the Project

First and foremost and a lot of this follows Data Warehouse principles – you need a Project Sponsor, somebody in the business who believes in what you are doing, remember a lot of Data Science projects will either fail, fail to provide business benefit or be put to one side because the business is only doing it because it’s trendy and believes it ought to be. Do not treat a DS project like any other piece of development, treat it like building a Data Warehouse – by that I mean the profile needs to be raised within the business so make sure there is visibility, make sure you know the people in the business who can help you and with what, make sure you strike up a conversation, educate them to what DS is and what it isn’t, perhaps do an introduction to DS where you at a high level talk about the steps involved – CRISP-DM / TDSP and in what phases you’ll need their help and with what tasks. Be on good terms with the folk providing your data, and when there are issues tell them, be helpful, one of the side effects of the cleansing and data prep side of the DS project is to highlight issues with source systems.

The role of Project Manager is to keep track of the Kanban Board and make sure the process is followed, smooth out any politics getting the source data or getting access to business resource for business and data domain knowledge, the term business domain is the knowledge about how the business runs, the process flow etc. the Business Taxonomy flows from this, data domain knowledge on the other hand is more related to a Database Designer/Developer, it’s what the source data set structures mean, the Data Dictionary flows from this.

Other roles on the project will be a Data Scientist and a Data Engineer, those are very different roles, the Data Engineer gets and prepares the data whereas the Data Scientist creates various models, has the statistics background to understand sample populations, model verification etc.

Summary

I’ve briefly discussed the project structure, we are going to be using the Kanban Board and I’ll be using Visual Studio Team Services Agile Tools for managing that aspect of the project.

The next article we will get into actually doing something! We will tackle the Business Understanding phase.

]]>
Data Science Project Example Walkthrough: Predictive Analytics on London Cycle Hire Data using the CRISP-DM and TDSP process models: #1 Backgrounder https://sqlserverfaq.com/tonyrogerson/2017/09/15/crisp_dm_microsoft_tdsp_data_science/ Fri, 15 Sep 2017 06:57:16 +0000 https://sqlserverfaq.com/tonyrogerson/?p=191 Read More »]]> I’m embarking on a series of articles covering an end to end example of applying Data Science to a problem, we don’t just throw at data into a tool, like any other form of development work we follow a standard process, in the data space there are just a handful of process models to guide us through our project these are CRISP-DM (Cross Industry Standard Process for Data Mining), SEMMA (Sample – Explore – Modify – Model – Asses), KDD (Knowledge Discovery and Data Mining) and finally Microsofts’ own TDSP (Team Data Science Process), because I want to use one of the fuller models I’ll be using CRISP-DM and Microsoft’s Team Data Science Process. TDSP appears to have been seeded from CRISP-DM, by looking at the two hopefully we’ll get a better understanding on how TDSP better suits today’s environment.

What is Data Science?

Well it’s not taking your raw data, pushing it straight through a R, Python or Azure ML library and then start predicting stuff, there are many parts to this which is why it’s good to follow CRISP-DM / TDSP because it keeps you hon

Data Science is not just pushing data through R and Python Machine Learning libraries nor through the Azure Machine Learning Studio and doing a visualisation over the top, in fact if you don’t have an understanding of the domain you are modelling, aren’t doing tasks such as data cleansing, have reasoning around model choice and doing model validation, choosing samples correctly, knowing that correlation does not equal causation (google that last one it’s really amusing) then you’ve a lot of learning to do; Data Science is a super-set of roles/techniques such as Domain Knowledge, Project Management, Machine Learning, Analytics, Visualisations, Data Mining and Business Analysis. Don’t let yourself be fooled by the hype around Data Science because you can do Data Science with just Excel, the practice has been around for decades and is not tied to Big Data (ok, may be Variety because you will end up pulling in multiple data sets from different sources internal and third party), notice no mention of Hadoop because Big Data doesn’t require Hadoop. Applying Data Science requires domain experience, able to apply statistical techniques and able to chop and refine data – you aren’t restricted to R or Python, those languages are common because of the freely available libraries. It’s unusual for somebody to have all the skills required to be a Data Scientist which is where you need a team of people.

Deliverables

Anyway what are we doing in this series of articles? Adopting CRISP-DM (Cross Industry Process for Data Mining) and TDSP process models (hopefully) perform some Predictive Analytics off the Transport for London Cycle Usage data, we’ll hopefully throw out some additional insights along the way. I’ll be using a number of different data modelling techniques (relational and graph) because remember we don’t just throw the data in a database we have to model it, the cycle data lends itself to some graph type data processing (likely use a mix of Cosmos-DB, Neo4J and SQL 2017 Graph Data Types for comparative purposes) as well as traditional relational (we can talk through good ways of holding the data and using SQL efficiently), I’ll be using SSIS for ingesting the data and possibly something else like Perl or Python, for the prediction bits we’ll likely stick to R, F# and Python as well as Azure ML of course!

Hopefully this will give you insight into how you’d go about doing this in your own company, it will give you a fully worked example of using CRISP-DM and TDSP.

Once we have the model we can talk about how we are going to implement it, perhaps write a little Android app at which point we can bring in DevOps and AnalyticOps and discuss why your model isn’t static and you’ll need to be able to update and release as you would any other code.

More than happy for people to get in touch and help me with this, just get in touch, perhaps take the data and do your own processing inside the CRISP-DM / TDSP framework and if you write it up I’ll stick some summary text in this series linking out to your own.

Note: I’m doing this independently of Transport for London using their public data API under the general terms and conditions of use, see their site: https://tfl.gov.uk/info-for/open-data-users/.

What is CRISP-DM (Cross Industry Process for Data Mining)?

Devised in the late nineties to provide a standard framework for Data Mining (the forerunner to Data Science) in fact it’s one of the aspects of DS so it is natural CRISP-DM is useful to us while working on Data Engineering and Data Science projects.

Wikipedia gives a lot of good links and a very good introduction description: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining.

You will need the actual process model guide but sadly the original crisp-dm.org website is no more, there are plenty of sites with the PDF on, e.g. https://www.the-modeling-agency.com/crisp-dm.pdf.

What are the basics of CRISP-DM?

We’ve all no doubt been there, the user asks for ‘x’, you give them ‘x’ (which is actually y) but they really wanted ‘z’. Translating business requirements/understanding into technical implementation isn’t an art it’s a technique, some people are good at it, some aren’t. With time pressures on the business user gaining clarity is always a difficult task so following a standard iterative based framework is always going to ease that process.

A common understanding on both sides will improve the odds of a successful outcome, especially in Data Science where slight misunderstandings can mean the team going off in the wrong direction.

The first phase we come to in CRISP-DM is “Business Understanding” where we nail the project objectives and requirements from the viewpoint of the Business and then converting that into the Data Science problem definition a preliminary plan. Remembering this is an iterative process our first cut at this would be the Business would like to improve the availability of cycles so that hirers don’t visit an empty bike point, so our task is to predict the availability of bikes at a given bike point at specific times of the day with a view to a) providing the user with a visualisation of busy times and b) provide the business with information on stocking levels.

The next “Data Understanding” phase plays to what datasets are available for us to use, remember our model accuracy depends on having data suitable to the task defined in “Business Understanding”, TFL provide the cycle usage data which gives us details of each time a bike is taken out of a bike point and to which bike point it is returned and the duration the bike was hired. There are a ton of other datasets that may help in our task here, what things would effect cycle hire? Perhaps things like the weather, perhaps if the bike point is near a station, travel problems, the day of the week (working/none-working days), seasonal periods etc. Data Understanding is about finding this data and making it available for the next phase. This is our play about stage, for instance I’ve already discovered that on one of the many cycle usage data CSV files is in a different format to the others, also in some files we have start dates set to 1901 with an end_station_id of 0 – there are reasons for this and we need to understand what they are.

“Data Preparation” requires us to cleanse the data, get it in a state ready to be used by our chosen algorithms for instance we may need to convert some items from continuous to categorical data etc. In this stage we also need to start thinking about the training and validation samples, again we need to understand our data for instance your data might be skewed, for instance if 90% of the collected data is Female and we are modelling for a Male specific construct then we’ll likely want to pick from the Male population rather than just picking 50,000 random rows from the dataset (you’ll likely end up 90% female).

“Modelling” phase, well – that’s the modelling bit, what techniques to use  against your data, then we have “Evaluation” which deals with verification of the model and finally we have the “Deployment” phase.

I’ll talk about each phase as we do them.

What is TDSP (Team Data Science Process)?

Expanding on CRISP-DM, TDSP is a collection of process flow, tools and utilities to assist not only you but your team provide the Data Science component of your Enterprise Data Platform. Microsoft have a Team Data Science Process site on GitHub which contains all the bits you need, start with the root page (https://github.com/Azure/Microsoft-TDSP) but the actual detail is in the README.md: https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/README.md.

What are the basics of TDSP?

Reviewing TDSP, CRISP-DM, KDD and SEMMA process models you will see the commonality – start with business understanding [of the problem in hand], understand what data you have or may need, prepare that data – clean it and put it in a form suitable for your each of the models you will try, then Model it, Evaluate it and finally Deploy it. CRISP-DM and TDSP are both task based, the process model defines the tasks you should be doing and in which phases.

At this point if you’ve looked at https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md you will see the phases match those of CRISP-DM.

Summary

Hopefully I’ve set the scene of what we are going to do over the coming weeks, feel free to interact with me in the comments or feedback to me tonyrogerson @ trover.net. The next article will deal with Project Initiation and Business Understanding.

]]>
How to Repair Corrupt MDF File in SQL Server Database http://sqlserver-qa.net/2017/07/31/repair-corrupt-mdf-file/ Mon, 31 Jul 2017 10:30:09 +0000 http://sqlserver-qa.net/?p=1574 Read more →

]]>


Learn to Repair Corrupt MDF File with Perfection and Success

To get a damaged MDF file is not rare. However, the main cause behind it is that the file is prone to corruption. When this occurs, all data saved in the server database becomes inaccessible and therefore, it leads to data loss. In such circumstances, it is important to repair corrupt MDF file of server database storage.

This segment will discuss how to execute the repair process. In addition, it will also give you in-depth information of MDF file, reasons, and way to repair damaged MDF file to store it in healthy form. Read further to know more.

Quick Glance on MDF File

SQL Server MDF file is a relational management system of the database. Moreover, it is a primary data file of database, which saves all the server data. Thus, you can even state it as the master database file of MS SQL Server. Every database of SQL Server would enclose at least one .mdf file.

It saves components like XML Indexes, Views, Indexes, Stored Procedures, Tables, Triggers, Rules, Keys, User Defined Functions, sparse columns, data types, file stream, column set property data types. The .mdf file can be said as a primary element for managing SQL Server database.

Reasons of MDF File Corruption

There are various causes, which are answerable for damaging the primary file type of server.

  • Unexpected power failure.
  • Various bugs in server.
  • Defective Operating System.
  • Sudden shutdown of the machine.
  • Issues with hard drive
  • Virus outbreaks.

Thus, issues for your .mdf file revolving corrupt can be everything from malfunctioning of hardware to software and so it is necessary to fix corrupted MDF file.

Technique to Repair Corrupt MDF File

Method 1: In-built Tool

There are some tools that make easy for users to fix corrupted MDF file of server. Thus, make the saved data available. In fact, all these tools are series of commands in T-SQL programming language, which is called DBCC (Database Console Commands). The purpose of these statements in DBCC is to test physical and logical uniformity of MS SQL Server database files and fix troubling problems that continues.

DBCC CHECKDB is a command via which you can simply check logical and physical integrity of all objects in precise MS SQL Server database. You can do it by implementing the mentioned operations successively:

  • Run DBCC CHECKALLOC command in the database.
  • Run DBCC CHECKCATALOG command in the database.
  • Run DBCC CHECKTABLE on every view & table in the database.
  • Verifying content of every indexed view present in the database.
  • Validating link-level constancy among table Metadata, file system directories files while saving varbinary (max) data utilizing FILESTREAM in file system.
  • Confirming Service Broker data in database.

Once the above steps to repair damaged MDF file is completed then, if the application finds any issues of corruption or any errors, it recommends users to have the usage of various recover options for fixing troublesome problems. The recover or repair options are:

  • REPAIR_FAST

It preserves syntax for compatibility of backward only; no repair actions are executed in definite. The syntax for this, Repair option is: DBCC CHECKDB (‘DB Name’, REPAIR_FAST).

  • REPAIR_REBUILD

This option of repair implements repair process, which scarcely has potentials of data loss. This can do quick repairs like repair of missing rows in non-clustered indexes, even time-consuming repairs like rebuilding of indexes. The syntax is DBCC CHECKDB (‘DB Name’, REPAIR_REBUILD).

  • REPAIR_ALLOW_DATA_LOSS

This command creates an effort to fix all the issues, which are reported. However, it can root cause the loss of data as specified in repair command itself. The syntax is: DBCC CHECKDB (‘DB Name’, REPAIR_ALLOW_DATA_LOSS).

Limitations:
1. The specific database should be in single-user mode to be able to execute either of three commands of repair.
2. DBCC repair commands are not authenticated for memory-optimized tables.

Method 2: Trouble-Free Solution

There can be a possibility that when your SQL Server database MDF file is corrupt and when you try to connect to SQL Server you will find that it is marked as SUSPECT. During such scenarios or the above discussed, you will not be able to connect to the database. To repair corrupt SQL Server MDF database file best option is to restore from a recent full database backup available.

If no recent backup available in such cases best possible approach is to repair corrupted MDF file of Microsoft SQL Server 2000, 2005, 2008, 2012 and 2016 with SQL Database Repair Tool. One of the most highly used and recommended software to repair corrupt SQL MDF file with the assurance of 99% guaranteed recovery.

Observational Verdict

In this article we have gone through the possible solution which you need to follow to recover a database in case MDF file get corrupted or database marked as SUSPECT. We serve you with the best of the best ways to repair corrupt MDF file. So, Hurry! What are you waiting for? You are raising the bar and we are taking it forward with the all in one and probably the best of solution repair damaged MDF database file.

]]>
Compare The Market – Data Science vs Analytical Skills http://sqlserver-qa.net/2016/10/19/compare-the-market-data-science-vs-analytical-skills/ Wed, 19 Oct 2016 10:06:44 +0000 http://sqlserver-qa.net/?p=1556 Read more →

]]>


“You can have data without information, but you cannot have information without data.” – Daniel Keys Moran

By far, from my experience it is essential to build your argument/position having a supportive data at hand, we can’t just win the arguments ‘..I think..‘ rather ‘…here is the proof…’. If you cannot explain it simply, you don’t understand it well enough!

The data management is a key holder in any business, which differentiates today’s thriving organizations.

Data in all forms & sizes is being generated faster than ever before

At many organisations there is a process to setup the struggle between IT and Business have to run a huge portfolio of apps, the business always wants more apps but IT are struggling to just keep running what you have. In part this is also to cement that you are an expert and you understand their challenges. So the ideal aspect would be not to deal data management as another project, rather design the solution as an evolving process.

  • What makes is so special that data science is becoming popular?
  • How could you elevate data mining/machine learning skills for data science?
  • Where will the statistical and operational research can help you to accomplish a stepping stone career in data science?

At current times the new titles within job industry is a buzz word, as the core job roles and responsibilities have been associated. By far whoever is dealing with data either collecting or analysing, they would be called a data analyst. You will need to draw a line (or a virtual wall) between data analysis and business intelligence developer. Based on Big Data University reference these are essential skills and tools for the data analysts need to have a baseline understanding of the following:

  • Skills: statistics, data munging, data visualization, exploratory data analysis
  • Tools: Excel, SPSS, SPSS Modeler, SAS, SAS Miner, SQL, Access, Tableau, SSAS.

What differentiates Data Engineers is that who prepares the ‘data’ infrastructure that will be analysed by the Data Scientists. Not to mention that software engineers are essential to design, build and integrate data from different resources. Having to write complex queries on the data, make it possible to access, analyse and process the data in optimising the business performance, this is what Big Data ecosystem is.

Over a period of time from the maturity of RDBMS platform both data warehouse and business intelligence have been evolved as a key route for organisation success and business growth. Making this as a baseline to the core, the IT skills must build several analytical disciplines that can help the organisation to grow within Data Platform.

So the mathematical skills are essential in this discipline that will create differences and denominators, by design. There a multiple categories that how best data science and data scientists domain is increasing, see here.

 

A key skill to develop in Analytics is to build knowledge in whole spectrum of business acumen and domain expertise. So there is no doubt that if you have mathematical skills build upon your academics will help the individual to step into data science world at a better place. The data science will sprawl across multiple disciplines and domains with a dominance. Based on my research and collection the following are highlights of where one can begin their data science journey:

  • Computer Science – branched into multiple sectors of software, hardware, application and business arenas. The new concepts are data plumbing (in-memory analytics), machine learning programming, modeling (Python, R etc.) and RFID/Streaming data analytics.
  • Statistician – a baseline to perform series of experiments by testing, cross validating, sampling and programming methods.
  • Data Mining – there is an evidence that both data mining and machine learning overlaps between them. Either of these will land you in the core of data science
  • Research – operational research, building optimisations and techniques will let you to encompass into data analytics & data science.
  • Business Intellience/Data Warehous – from the matured RDBMS world that both of these aspects have better benchmarking in desgining, creating, generating KPIs, database schemas, dashboard design and visualisations based on the data-driven strategies to build/optimize/abstract better decisions & ROI.
  • Machine Learning – there is need to sustain the new changes in the IT field with this discipline which is closely related to the data mining. This trade is very specific in building algorithms and design automated prototypes based on data-sets.  A further dive into building core algorithms include clustering and supervised classification, rule systems, and scoring techniques is a hot-trade now and a flavour of AI (artificial intelligence) is bonus for you. This is where Pyton and R balances.

Few more references from the world wide web (mainly from Analytical Bridge website):

  •  Data mining: This discipline is about designing algorithms to extract insights from rather large and potentially unstructured data (text mining), sometimes called nugget discovery, for instance unearthing a massive Techniques include pattern recognition, feature selection, clustering, supervised classification and encompasses a few statistical techniques (though without the p-values or confidence intervals attached to most statistical methods being used). Data mining is applied computer engineering, rather than a mathematical science. Data miners use open source and software such as Rapid Miner.
  •  Predictive modeling: Not a discipline per se, this modeling projects occur in all industries across all disciplines. Predictive modeling applications aim at predicting future based on past data, usually but not always based on statistical modeling. Predictions often come with confidence intervals. Roots of predictive modeling are in statistical science.
  •  Statistics. Currently, statistics is mostly about surveys (typically performed with SPSS software), theoretical academic research, bank and insurance analytics (marketing mix optimization, cross-selling, fraud detection, usually with SAS and R), statistical programming, social sciences, global warming research (and space weather modeling), economic research, clinical trials (pharmaceutical industry), medical statistics, epidemiology, bio-statistics and government statistics.

Jobs requiring a security clearance are well paid and relatively secure, but the well paid jobs in the pharmaceutical industry (the golden goose for statisticians) are threatened by a number of factors – outsourcing, company mergings, and pressures to make healthcare affordable.

  • Mathematical optimization. Solves business optimization problems with techniques such as the simplex algorithm, Fourier transforms (signal processing), differential equations, and software such as Matlab. These applied mathematicians are found in big companies such as IBM, research labs, NSA (cryptography) and in the finance industry (sometimes recruiting physics or engineer graduates). Mathematical optimization is however closer to operations research than statistics, the choice of hiring a mathematician rather than another practitioner (data scientist) is often dictated by historical reasons, especially for organizations such as NSA or IBM.
  •  Actuarial sciences. A key, just a subset of statistics focusing on insurance (car, health, etc.) using survival models: predicting when you will die, what your health expenditures will be based on your health status (smoker, gender, previous diseases) to determine your insurance premiums. They have seen their average salary increase nicely over time: access to profession is restricted and regulated just like for lawyers, for no other reasons than protectionism to boost salaries and reduce the number of qualified applicants to job openings. Actuarial sciences is indeed data science (a sub-domain).
  •  HPC. High performance computing, not a discipline per se, but should be of concern to data scientists, big data practitioners, computer scientists and mathematicians, as it can redefine the computing paradigms in these fields. HPC should not be confused with Hadoop and Map-Reduce. HPC is hardware-related, Hadoop is software-related (though heavily relying on Internet bandwidth and servers configuration and proximity).
  •  Six sigma. It’s more a way of thinking (a business philosophy, if not a cult) rather than a discipline, used for quality control and to optimize engineering processes. Applied, simple statistics are used (simple stuff works must of the time, I agree), and the idea is to eliminate sources of variances in business processes, to make them more predictable and improve quality.
  • Artificial intelligence. It’s coming back. The intersection with data science is pattern recognition (image analysis) and the design of automated (some would say intelligent) systems to perform various tasks, in machine-to-machine communication mode, such as identifying the right keywords (and right bid) on Google AdWords (pay-per-click campaigns involving millions of keywords per day).
  • Data engineering. New kid on the block, performed by software engineers (developers) or architects (designers) in large organizations (sometimes by data scientists in tiny companies), this is the applied part of computer science that allow all sorts of data to be easily processed in-memory or near-memory, and to flow nicely to (and between) end-users, including heavy data consumers such as data scientists.
  • Business intelligence. Abbreviated as BI. Focuses on dashboard creation, metric selection, producing and scheduling data reports (statistical summaries) sent by email or delivered/presented to executives, competitive intelligence (analyzing third party data), as well as involvement in database schema design (working with data architects) to collect useful, actionable business data efficiently.
  • Data analysis. This is the new term for business statistics since at least 1995, and it covers a large spectrum of applications including fraud detection, advertising mix modeling, attribution modeling, sales forecasts, cross-selling optimization (retails), user segmentation, churn analysis, computing long-time value of a customer and cost of acquisition, and so on.
  • Business analytics. Same as data analysis, but restricted to business problems only. Tends to have a bit more of a financial, marketing or ROI flavor.

The first step is to discover yourself as an analyst ‘by nature’ or developer by inclination within the IT world. Sometimes the job title will mislead, so it is better to read the definition of the role and list out where you will excel. The four pillars to gain the excellence are: university degree, technical skills, business skills (new requirement) and professional certification.

Finally, networking is essential to know the latest-happenings in the world and see how a simple business is attempting to make big change in day-to-day life. If you are a ‘geek’ then participate in ‘hackathon’ type of events or as a developer you could contribute to the technical community as open source projects (search for github).

Microsoft has started the new professional program called Data Science Degree program and  requirements beyond the current published schedule and Course schedule is now available.

]]>
Leeds SQL Server User Group https://sqlserverfaq.com/blog/2016/10/09/leeds-sql-server-user-group/ Sat, 08 Oct 2016 23:00:37 +0000 https://sqlserverfaq.com/?p=11278  

SQL Server User Group based in Leeds with presentations on SQL Server related topics covering all areas such as Azure, Big Data, BI, Programming, Performance Tuning, Management, Architecture and Design.

Coming Events

]]>
Who did what to my database and when… http://dataidol.com/davebally/2016/08/22/who-did-what-to-my-database-and-when/ Mon, 22 Aug 2016 12:26:36 +0000 http://dataidol.com/davebally/?p=2507 Continue reading ]]> One of the most popular questions on forums / SO etc is, “How can i find out who dropped a table, truncated a table, dropped a procedure….” etc.   Im sure we have all been there,  something changes ( maybe schema or data ) and we have no way of telling who did it and when.

SQL auditing can get you there, and i’m not suggesting you shouldn’t use that,  but what if that is not setup for monitoring the objects you are interested in ?

If you look back through my posts, I have been quite interested in ScriptDom and TSQL parsing,  so if you have a trace of all the actions that have taken place over a time period, you can parse the statements to find the activity.

Drawing inspiration ( cough, stealing ) from Ed Elliot’s dacpac explorer,  I have created a git repo for a simple parser that is created using T4 templates. T4 templates are an ideal use here as the full language domain can be exploded out automatically and then you as a developer can add in your code to cherry pick the parts you are interested in.

At this time the project has hooks to report on SchemaObjectName objects , so any time any object is referenced be that in a CREATE, SELECT , DROP, MERGE …  will fall into the function OnSchemaObjectName and  OnDropTableStatement that will be hit when DROP TABLE is used.

This is not intended to be a full end to end solution, not least as the SQL to be parsed could be coming from and number of sources,  but if you have the skills to run it as is , you probably have enough to tailor it for your requirements.

As ever,  let me know what you think and any comments gratefully received.

The github repo is   :   https://github.com/davebally/TSQLParse

 

]]>
Solving The Issue Of SQL Server Physical File Fragmentation http://sqlserver-qa.net/2016/08/01/solving-physical-file-fragmentation-issue/ Mon, 01 Aug 2016 07:18:58 +0000 http://sqlserver-qa.net/?p=1546 Read more →

]]>


SQL Server physical file fragmentation causes major performance issue in database, it happens when data is deleted from a drive and left small gaps to be filled by new data files. In File fragmentation the logical sequential pages are not exist in physical sequence. When there is physical file fragmentation, auto-growing files will not get the sufficient continuous space, therefore the files get scattered throughout the hard drive.

The physical file fragmentation cause slow access or seek time as for the time taken for accessing the data is increased and also, system needs to find all fragments of file before opening the file.

In addition, the data file pages are Out-of-order that also increases the seek time. To lessen the seek time, user can defrag the fragmented file. In this article we will discuss the problem of SQL Server physical file fragmentation and the way to defrag the file.

Problem

Usually DBA do not consider the SQL Server physical file fragmentation as a big issue. However, it takes lots of time to access fragmented file as compare to the file stored in continuous storage space.

If the auto grow option is enabled in the file and the file is heavily fragmented, in that case the files can not grow beyond a certain limit, which may cause error 665 in the system.

Cause of File Fragmentation

  • If DBA performs backup operation repeatedly, this could leads physical file fragmentation in the SQL server.
  • If DBA shares database server space with other applications such as web server, Sharepoint, etc.causes disk file fragmentation as the space allocated to these applications is not continuous.

Solution

SQL Server physical file fragmentation can be fixed with he help of Windows utilities, there is a tool called Sysinternal’s Contig (contig.exe) tool which is a free utility from Microsoft. This tool will create a new files that are contiguous in nature.

It is a great tool that will show fragmentation of files and allow them to be defragmentated.

DBA can easily deploy this tool, to analyze the fragmentation of a specific file, DBA can use contig-a option.

SQL-Server-Physical-File-Fragmentation

To defrag the file, DBA can run simple command Contig

Note: To defrag any database, it must be in Offline state.

User Can Follow The Given Steps To Defrag The Database:

  • In order to defrag the database, user needs to bring it OFFLINE
  • ALTER DATABASE [Database name] SET OFFLINE

  • Use Contig [Filename] command, to defrag the file
  • Again bring back the database in ONLINE state
  • ALTER DATABASE [Database name] SET ONLINE

Other Practices That Resist Fragmentation

  • By keeping data files and log file on different physical disk arrays.
  • User can fix the problem of out of order page by reorganizing the index with altered index statements or with the help of SQL server maintenance plan. This problem is arises when data file pages are Out of order.
  • The database file should be sized well and autogrowth must be set to suitable value.
  • Monitor fragmentation with the help of Microsoft tools.
  • Set up plans for the SQL server maintenance.

Conclusion

The issue of SQL Server physical file fragmentation is a curable problem, DBA can easily fix this problem with the help of Microsoft tools.

]]>
Guidelines For SQL Database Performance Monitoring http://sqlserver-qa.net/2016/07/19/database-performance-monitoring/ Tue, 19 Jul 2016 13:31:06 +0000 http://sqlserver-qa.net/?p=1538 Read more →

]]>


Overview

Monitoring SQL Server databases and instances provide the necessary information to diagnose and troubleshoot the performance issues. Once its performance is tuned then, it has to be constantly monitored as everyday data schema. Its configuration changes may lead to a situation where additional and manual tuning is required. In the following section, we will be discussing the SQL database performance monitoring.

What are the Metrics to Monitor on SQL Server?

The monitoring of the metrics depends upon the performance goals. However, there is a wide range of usually monitored metrics, which offers the information sufficient for basic troubleshooting. It could be monitored in a way to find the root cause that is based on its values, additional or more metrics that are specific memory and processor usage, disk activity, and network traffic. SQL Server offers two built-in monitoring features, i.e. Active Monitoring and Data Collectors.

Active Monitoring

It tracks the most useful performance metrics of SQL Server. To get them, it executes the queries in every 10 seconds that are against its host SQL Server instance. The performance is monitored when the activity monitored is open that makes it a lightweight solution without overhead. The metrics are shown as discussed in five collapsible panes:

  • Overview: The pane that views the processor number of waiting tasks, time percentage, number of batch requests, and database I/O operations.
  • Processes: It shows the presently running SQL Server processes for each database on the instance. It shows the information of application, login, task state, wait time, command, host used, etc. It filters the specific column value from the information table. It offers useful feature of troubleshooting and deeper analysis. It is tracing the selected procedure in SQL Server profiler.
  • Resource Waits: It shows waits for various resources, i.e. memory, network, complication, etc. It displays the wait time, cumulative wait time, recent wait time, and average waiter counter.
  • Data File I/O: It shows a list of all database files: NDF, LDF, and MDF, their paths, names, recent read and write activity, and response time.
  • Recent Expensive Queries: It shows the queries that are executed in last 30 seconds, which used most of the hardware resources such as memory disk, network, and processor. It enables opening the query in query tab of management studio and opening its plan for execution.
  • Way to Utilize Active Monitoring

    It can be opened with SQL Server Management Studio toolbar’s activity monitor icon, context menu in SQL Server instance, keyboard Ctrl +Alt + A in object explorer. It only tracks a pre-defined set of the important metrics in SQL Server. The metrics cannot be monitored but the monitored one cannot be removed. Therefore, only the monitoring of real time is possible. It does not have any option to store the history for future usage. Therefore, Activity Monitoring is useful for the present monitoring and basic troubleshooting, threshold defined values, monitoring tool where metrics for monitoring can be selected, etc. data storage is necessary.

    Data Collectors

    It is another built-in performance for monitoring and tuning features of Management studio. It collects the metrics performance from Server instances and stores it in local repository, so that it can be used for further analysis. It uses SQL Server Agent, Integration Services, and Data Warehousing. It allows the user to specify the metrics that user will monitor. It provides three built-in monitor data collectors with the important and commonly monitored performance metrics. If user needs to monitor additional metrics performance, then custom data collectors can be created by using T-SQL code or API.

    Method to use Data Collector

    Users can follow the steps the mentioned steps but before that user must check that Data collection, Management Data Warehouse, and SQL Server Agent are enabled. Along with this, SQL Server Integration Services are installed.

  • In Management studio object explorer, expand the management.
  • Choose the configure management data warehouse option in data collection.
  • Choose an option to set up data collection.
  • Choose the server instance name and database, which will host the management data warehouse, and the local folder in which the collected data can be cached.
  • Select the next option » Review Settings » Finish.

Data collection offers three pre-defined sets that are available in object explorer management in the system data collection sets folder, Data collection note: Query Statistics, Disk Usage, and Server Activity. Each has its built-in report as discussed.

  • The Disk usage data collection is set of collection of data for database data file, I/O statistics, and transaction log file.
  • The reports of Disk usage are available in context menu of data collection. It shows the usage of space by growth trends, data base files, and average day growth.
  • Query Statistics set the collection of query code, activity, and query execution plans for the ten expensive queries.
  • Server Activity of data collection is a set of data about disk I/O, processor, memory, and usage of network. The report displays CPU, disk I/O, network usage, memory, Server instances, Server waits, and Operating system activity.

Note: Data collection has to be configured and start data capturing. Whereas in Active Monitoring, no option for real-time graphs. However, the data that is captured can be stored for specified number of days.

Conclusion

In the above discussion, SQL Performance Monitoring is described. Its two main metrics are discussed that helps to monitor on the Server, i.e. Active Monitoring and Data Collectors. Both have different functionalities to perform the monitoring on the SQL Server database.

]]>
SQL Server – Making Backup Compression work with Transparent Data Encryption by using Page Compression https://sqlserverfaq.com/tonyrogerson/2016/07/15/sql-server-making-backup-compression-work-with-transparent-data-encryption-by-using-page-compression/ Fri, 15 Jul 2016 15:17:49 +0000 https://sqlserverfaq.com/tonyrogerson/?p=166 Read More »]]> Encrypted data does not compress well if at all so using the BACKUP with COMPRESSION feature will be ineffective on an encrypted database using Transparent Data Encryption (TDE), this post deals with a method of combining Page Compression with TDE and getting the best of both worlds.

Transparent Data Encryption (TDE) feature encrypts data at rest i.e. the SQL Server Storage Engine encrypts on write to storage and decrypts on reading – data resides in the buffer pool decrypted.

Page compression is a table level feature that provides page level dictionary and row/column level data type compression, pages read from storage reside in the buffer pool in their compressed state until a query reads them and only at that point are expanded which gives better memory utilisation and reduces IO. You need to be aware both Encryption and Page Compression add to the CPU load on the box, the additional load will be dependent on your access patterns – basically you need to test and base your decision on that evidence, without drawing this into a discussion around storage tuning you tend to find that using Page Compression moves a query bottleneck from storage and more into CPU simply because less data is being transferred to/from storage so latency is dramatically reduced – don’t be put off if your box is regularly consuming large amounts of CPU – it’s better to have a query return in 10 seconds with 100% CPU rather than 10 minutes with 10% CPU and storage the bottleneck!

Why is data not held encrypted in the Buffer Pool? TDE encrypts the entire page of data at rest on storage and decrypts on load into memory (https://msdn.microsoft.com/en-us/library/bb934049.aspx), if it were to reside in the Buffer Pool it would require page header information in a decrypted state so that things like Checkpoint and inter-page linkage work as such there would be security risk. Page Compression does not compress page header information which is why pages can reside in the Buffer Pool in their compressed state (see https://msdn.microsoft.com/en-us/library/cc280464.aspx).

Coupling Page Compression with TDE gives you the benefit of encryption (because that phase is done on write/read to/from storage) and compression (the page is compressed when it either read or written to storage via the TDE component in the Storage Engine – basically your data is always decrypted when coming to compressing/decompressing.

The table below shows space comparison and timings between normal (no TDE or Compression), Page Compression on it’s own and TDE coupled with Page Compression. You will see that only Data in the Database is compressed – log writes are not compressed so there is only a marginal improvement in using Page Compression but you can see TDE has no effect on Page Compression to the Transaction log either.

PageCompressionWithTDE

If you use this approach then compress all the database tables with Page Compression, also – stop using the COMPRESSION option of BACKUP because it will save resource – SQL Server won’t be trying to compress something already compressed!

Example

Prepare the test database, we use FULL recovery to show the performance and space to the transaction log:

 

CREATE DATABASE [TEST_PAGECOMP]
 CONTAINMENT = NONE
 ON  PRIMARY 
( NAME = N'TEST_PAGECOMP', 
  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.S2016\MSSQL\DATA\TEST_PAGECOMP.mdf' , 
  SIZE = 11534336KB , 
  MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB )
 LOG ON 
( NAME = N'TEST_PAGECOMP_log', 
  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.S2016\MSSQL\DATA\TEST_PAGECOMP_log.ldf' , 
  SIZE = 10GB , FILEGROWTH = 65536KB )
GO

alter database TEST_PAGECOMP set recovery FULL
go

backup database TEST_PAGECOMP to disk = 'd:\temp\INITIAL.bak' with init;
go

 

Test with SQL Server default of no-compression and no-TDE.
 

use TEST_PAGECOMP
go


--
--	No compression
--
create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) ;

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
begin
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
	begin
		commit tran;
		begin tran;

	end
	
end

if @@TRANCOUNT > 0
	commit tran;
go


dbcc sqlperf(logspace) 
--	Size		Percent Full
-- 6407.992		99.49185

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_NOComp.trn'
--Processed 811215 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 3.
--BACKUP LOG successfully processed 811215 pages in 15.170 seconds (417.772 MB/sec).

checkpoint
go

backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_NOComp.bak' with init;
go

--Processed 718128 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 4 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 718132 pages in 12.157 seconds (461.495 MB/sec).

Test with Page Compression only.

--
--	Page will use dictionary so really strong compression
--
drop table test
go

create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) with ( data_compression = page );

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
begin
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
	begin
		commit tran;
		begin tran;

	end
	
end

if @@TRANCOUNT > 0
	commit tran;
go

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_Comp.trn' with init;
--Processed 706477 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP LOG successfully processed 706477 pages in 12.035 seconds (458.608 MB/sec).

checkpoint
go

backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_Comp.bak' with init;
--Processed 9624 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 2 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 9626 pages in 0.207 seconds (363.274 MB/sec).

go

Test with both Page Compression and TDE:

drop table test
go

checkpoint
go


USE master;  
GO  
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '**************';  
go  
CREATE CERTIFICATE MyServerCert WITH SUBJECT = 'My DEK Certificate';  
go  

USE TEST_PAGECOMP;  
GO  
CREATE DATABASE ENCRYPTION KEY  
WITH ALGORITHM = AES_128  
ENCRYPTION BY SERVER CERTIFICATE MyServerCert;  
GO  

ALTER DATABASE TEST_PAGECOMP  
SET ENCRYPTION ON;  
GO  

/* The value 3 represents an encrypted state   
   on the database and transaction logs. */  
SELECT *  
FROM sys.dm_database_encryption_keys  
WHERE encryption_state = 3;  
GO  



--
--	Now repeat the page compression version
--

create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) with ( data_compression = page );

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
begin
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
	begin
		commit tran;
		begin tran;

	end
	
end

if @@TRANCOUNT > 0
	commit tran;
go

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_CompTDE.trn' with init;
--Processed 722500 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP LOG successfully processed 722500 pages in 12.284 seconds (459.502 MB/sec).

checkpoint
go

backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_CompTDE.bak' with init;
--Processed 55296 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 7 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 55303 pages in 1.118 seconds (386.450 MB/sec).

go

 

]]>
SSMS July 2016 – get on with it and SQL Server 2014 SP2 – available now http://sqlserver-qa.net/2016/07/13/ssms-july-2016-get-on-with-it-and-sql-server-2014-sp2-on-the-way/ Tue, 12 Jul 2016 08:06:57 +0000 http://sqlserver-qa.net/?p=1530 Read more →

]]>


Being a summer holiday season in most of western side of the world, in general it is better to keep an eye what’s new coming in the data platform world. When it comes to SQL Server, you may have seen a flurry of announcements since SQL Server 2016 has been RTM’d.

In this month alone there are few announcements that you may need to keep an eye, such as SQL Server Management Studio July 2016 release with an enhancements/additions like:

  1.  Improved support for SQL Server 2016 (1200 compatibility level) tabular databases in the Analysis Services Process dialog and the Analysis Services Deployment wizard..
  2. Support for Azure SQL Data Warehouse in SSMS.
  3. Significant updates to the SQL Server PowerShell module. This includes a new SQL PowerShell module and new CMDLETs for Always Encrypted, SQL Agent, and SQL Error Logs. You can find out more in the SQL PowerShell update blogpost.
  4. Support for PowerShell script generation in the Always Encrypted wizard.
  5. Significantly improved connection times to Azure SQL databases.
  6. New ‘Backup to URL’ dialog to support the creation of Azure storage credentials for SQL Server 2016 database backups. This provides a more streamlined experience for storing database backups in an Azure storage account.
  7. New Restore dialog to streamline restoring a SQL Server 2016 database backup from the Microsoft Azure storage service. The dialog eliminates the need to memorize or save the Shared Access signature for an Azure storage account in order to restore a backup.
  8. ….any few more bug fixes, see  the SSMS download page for additional details, and to see the full changelog.

Another news to make a note (be prepared when you are back from holidays) that announcement of Service Pack2 for SQL Server 2014 version. The Engineering team at the Microsoft is working to bring you SQL Server 2014 Service Pack 2 (SP2), it will equipped with a rollup of released hotfixes contains 20+ improvements centered around performance, scalability and diagnostics based on the feedback from customers and SQL community. SQL Server 2014 SP2 will include

  • All fixes and CUs for SQL 2014 released to date.
  • Key performance, scale and supportability improvements.
  • New improvements based on connect feedback items filed by the SQL Community.
  • Improvements originally introduced in SQL 2012 SP3, after SQL 2014 SP1 was released

Here is the SQL Server 2014 Service Pack2 download page and don’t forget to read Microsoft SQL Server 2014 SP2 Release Notes  page

To feedback or raise any question then you can contribute/search at  Microsoft Connect page and little more more enthusiastic make sure you can use twitter to tweet Engineering Manager at @sqltoolsguy on Twitter

 

]]>