Tony Rogerson SQL Server Data Platform Ramblings Microsoft Data Platform ramblings within the context of Microsoft SQL Server Fri, 20 Jul 2018 07:54:03 +0000 en-GB hourly 1 105592414 Azure UK Data Centres – Azure SQL Database – Features not available because of Poor Deployment practice Fri, 20 Jul 2018 07:52:16 +0000 Read More »]]> We use the UK South data centre for our Azure SQL Databases, we use S3 pricing tier at rest because we require Columnstore indexes and that’s the minimum tier that supports it, we up-scale over night when doing our main ETL run.

We’ve reported this time after time with Microsoft and even contacted the Product Team myself but the issues are still unresolved so here’s one to watch out for if you are going to be using Columnstore (possibly other features) in the UK datacentres.

Ordinarily you get the error message below when you try and select from a table that has a clustered columnstore index on it and you are using an instance of SQL Server that doesn’t support columnstore indexing, so less that S3 for instance.

Msg 35340, Level 16, State 9, Procedure vwPurchaseOrder, Line 7 [Batch Start Line 0]
LOB columns are disabled in columnstore.
Msg 4413, Level 16, State 1, Line 39
Could not use view or function ‘Tabular.vwPurchaseOrder’ because of binding errors.

When you change the pricing tier (up/down scale) you are essentially moving from one SQL instance to another, sadly you will find that on occasion you will reach an instance that hasn’t been deployed properly, the switches that enable columnstore aren’t set correctly.

The solution – down scale then scale back to the tier you actually want in the hope it works.

If you get this scenario yourself then make sure you report it, frankly I think it’s really poor that what should be a straightforward deployment process can’t be done correctly.

]]> 0 242
Data Science Project Example Walk-through: Predictive Analytics on London Cycle Hire Data using the CRISP-DM and TDSP process models: #2 Waterfall v Agile (Scrum v Kanban) Wed, 27 Sep 2017 19:55:33 +0000 Read More »]]> In Part #1 – Backgrounder we covered a basic introduction to CRISP-DM and Microsoft TDSP, in this article we put our project framework in place.

As Benjamin Franklin said, “By failing to prepare, you are preparing to fail”; I wasn’t expecting to talk about the project management side in a lot if any detail but after I started writing this I realised it’s just as important as the process model itself, I’ll talk about Business Understanding in the next article. Planning is key to any project, in Data Science and Data Warehousing you often don’t know a lot of things up front, it’s really difficult to accurately plan a project because things will undoubtedly crop up unexpectedly. Choosing the correct project methodology is a key component to achieving a successful outcome, ironically a successful outcome in Data Science may be that the task at hand is not doable, but that may save a shed load of time, resources and money!

Project framework – Agile or Waterfall?

Waterfall is sequenced project delivery, so for our two process models we would complete the Business Understanding phase then Data Understanding then the other phases in sequence – we design up-front, we build the entire house, then decorate it, and then live in it.

Agile gives a structured framework to an iterative “feedback and enhance” approach; when applied to TDSP / CRISP-DM it allows cross phase improvements for instance we might find in the Data Understanding phase that something is not doable or requires more clarity so the Business Understanding needs to be revisited – we modify as we go, we build the house room by room decorating/re-building and living in it until it’s complete – as you can understand we have rework but that’s expected and captured early on making it cheaper to fix rather than in waterfall knocking the entire house down and starting again.

When performing Data Science or even Business Intelligence it’s often difficult to fully factor what you are trying to do, in both disciplines you take data from a feed, the quality of that data will not be known until you’ve got it loaded, in fact loading it in the first place may be problematic! Take the London Cycle usage data, there are 118 CSV files amounting to 5.39GB of raw data, they are all a similar structure but some CSV files have text quoted others not, there is one that has an additional column in the middle, some files contain text in what you’d think are numeric columns – you would not know that information on the outset of the project, only once you had loaded the data (Data Understanding phase).

Waterfall for me is a non-starter for Data Science and Data Warehousing, the iterative nature of Agile wins hands down, it also reflects the iterative nature of CRISP-DM and TDSP, for example it may be the case that once you get insights from the data which may be data quality issues then the remit of the project may change which requires revisiting the Business Understanding and Data Understanding phases – you don’t want to be too far down the track when issues are discovered, a project can be canned much earlier and remember a lot of DS may not amount to something that would give you benefit.

As you can probably guess we will be using Agile for this project, either Scrum or Kanban which we’ll discuss now.

Agile – Scrum v Kanban?

A subset of Agile, Scrum is a process framework for implementing Agile. Fixed length “sprints” which are usually one or two weeks are used to contain the workload and planning is done around those sprints.

Kanban is a visual card based system where the cards represents tasks/stories and are positioned on a board made up of columns and swim lanes for example the columns To-do, Blocked, In-Progress, Review, Complete and Closed with the swim lanes aligned with our process model phases i.e. Business Understanding, Data Understanding etc. although the swim lanes aren’t necessarily important it just helps you focus where the stories sit.

Which flavour of Agile you use will entirely depend on your organisation although researching the topic I’ve found that research suggests that Kanban Boards are better suited to Data Science projects because they offer more dynamism in task prioritisation, remember with DS we will often have tasks created as we go through the process – those often can’t wait between sprints. Below I’ve given you some links to background reading, for the purpose of this project I will use Kanban Board.

Microsoft Visual Studio Team System: or for Scrum:

Team Data Science Process on Channel 9:

I will also be using Microsoft Visual Studio Team System Services – Agile Tools for my project management, I have created a project and using the Kanban board on “Stories” set up my columns and swim lanes, I have the columns To-Do, Blocker, In-Progress (split Doing/Done), Review (split Doing/Done), Complete and Closed, I then have the swim lanes Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation with the default lane as Assignable. Stories contain one or more tasks and it is the story that moves between columns. I did not make the columns match the process model phases because I am using stories to group tasks to be done within each phase of the process workflow.


We can categorise the phases in both CRISP-DM and TDSP into two components – the first is mainly {planning, information gathering, project feasibility, data set availability}, the second is mainly {implementation of the design, data munging, statistics, programming, deployment}, those two categories have very different skill sets but with some cross over. Consider the separation of phases when you create your Kanban Stories and subsequent tasks. I’ve posted an IDEF-0 diagram showing the phase interaction, I’ve deliberately missed out most the detail i.e. the mechanisms/control (folk/tools doing the phases) and also the input/outputs.

Roles required in the Project

First and foremost and a lot of this follows Data Warehouse principles – you need a Project Sponsor, somebody in the business who believes in what you are doing, remember a lot of Data Science projects will either fail, fail to provide business benefit or be put to one side because the business is only doing it because it’s trendy and believes it ought to be. Do not treat a DS project like any other piece of development, treat it like building a Data Warehouse – by that I mean the profile needs to be raised within the business so make sure there is visibility, make sure you know the people in the business who can help you and with what, make sure you strike up a conversation, educate them to what DS is and what it isn’t, perhaps do an introduction to DS where you at a high level talk about the steps involved – CRISP-DM / TDSP and in what phases you’ll need their help and with what tasks. Be on good terms with the folk providing your data, and when there are issues tell them, be helpful, one of the side effects of the cleansing and data prep side of the DS project is to highlight issues with source systems.

The role of Project Manager is to keep track of the Kanban Board and make sure the process is followed, smooth out any politics getting the source data or getting access to business resource for business and data domain knowledge, the term business domain is the knowledge about how the business runs, the process flow etc. the Business Taxonomy flows from this, data domain knowledge on the other hand is more related to a Database Designer/Developer, it’s what the source data set structures mean, the Data Dictionary flows from this.

Other roles on the project will be a Data Scientist and a Data Engineer, those are very different roles, the Data Engineer gets and prepares the data whereas the Data Scientist creates various models, has the statistics background to understand sample populations, model verification etc.


I’ve briefly discussed the project structure, we are going to be using the Kanban Board and I’ll be using Visual Studio Team Services Agile Tools for managing that aspect of the project.

The next article we will get into actually doing something! We will tackle the Business Understanding phase.

]]> 0 209
Data Science Project Example Walkthrough: Predictive Analytics on London Cycle Hire Data using the CRISP-DM and TDSP process models: #1 Backgrounder Fri, 15 Sep 2017 06:57:16 +0000 Read More »]]> I’m embarking on a series of articles covering an end to end example of applying Data Science to a problem, we don’t just throw at data into a tool, like any other form of development work we follow a standard process, in the data space there are just a handful of process models to guide us through our project these are CRISP-DM (Cross Industry Standard Process for Data Mining), SEMMA (Sample – Explore – Modify – Model – Asses), KDD (Knowledge Discovery and Data Mining) and finally Microsofts’ own TDSP (Team Data Science Process), because I want to use one of the fuller models I’ll be using CRISP-DM and Microsoft’s Team Data Science Process. TDSP appears to have been seeded from CRISP-DM, by looking at the two hopefully we’ll get a better understanding on how TDSP better suits today’s environment.

What is Data Science?

Well it’s not taking your raw data, pushing it straight through a R, Python or Azure ML library and then start predicting stuff, there are many parts to this which is why it’s good to follow CRISP-DM / TDSP because it keeps you honest.

Data Science is not just pushing data through R and Python Machine Learning libraries nor through the Azure Machine Learning Studio and doing a visualisation over the top, in fact if you don’t have an understanding of the domain you are modelling, aren’t doing tasks such as data cleansing, have reasoning around model choice and doing model validation, choosing samples correctly, knowing that correlation does not equal causation (google that last one it’s really amusing) then you’ve a lot of learning to do; Data Science is a super-set of roles/techniques such as Domain Knowledge, Project Management, Machine Learning, Analytics, Visualisations, Data Mining and Business Analysis. Don’t let yourself be fooled by the hype around Data Science because you can do Data Science with just Excel, the practice has been around for decades and is not tied to Big Data (ok, may be Variety because you will end up pulling in multiple data sets from different sources internal and third party), notice no mention of Hadoop because Big Data doesn’t require Hadoop. Applying Data Science requires domain experience, able to apply statistical techniques and able to chop and refine data – you aren’t restricted to R or Python, those languages are common because of the freely available libraries. It’s unusual for somebody to have all the skills required to be a Data Scientist which is where you need a team of people.


Anyway what are we doing in this series of articles? Adopting CRISP-DM (Cross Industry Process for Data Mining) and TDSP process models (hopefully) perform some Predictive Analytics off the Transport for London Cycle Usage data, we’ll hopefully throw out some additional insights along the way. I’ll be using a number of different data modelling techniques (relational and graph) because remember we don’t just throw the data in a database we have to model it, the cycle data lends itself to some graph type data processing (likely use a mix of Cosmos-DB, Neo4J and SQL 2017 Graph Data Types for comparative purposes) as well as traditional relational (we can talk through good ways of holding the data and using SQL efficiently), I’ll be using SSIS for ingesting the data and possibly something else like Perl or Python, for the prediction bits we’ll likely stick to R, F# and Python as well as Azure ML of course!

Hopefully this will give you insight into how you’d go about doing this in your own company, it will give you a fully worked example of using CRISP-DM and TDSP.

Once we have the model we can talk about how we are going to implement it, perhaps write a little Android app at which point we can bring in DevOps and AnalyticOps and discuss why your model isn’t static and you’ll need to be able to update and release as you would any other code.

More than happy for people to get in touch and help me with this, just get in touch, perhaps take the data and do your own processing inside the CRISP-DM / TDSP framework and if you write it up I’ll stick some summary text in this series linking out to your own.

Note: I’m doing this independently of Transport for London using their public data API under the general terms and conditions of use, see their site:

What is CRISP-DM (Cross Industry Process for Data Mining)?

Devised in the late nineties to provide a standard framework for Data Mining (the forerunner to Data Science) in fact it’s one of the aspects of DS so it is natural CRISP-DM is useful to us while working on Data Engineering and Data Science projects.

Wikipedia gives a lot of good links and a very good introduction description:

You will need the actual process model guide but sadly the original website is no more, there are plenty of sites with the PDF on, e.g.

What are the basics of CRISP-DM?

We’ve all no doubt been there, the user asks for ‘x’, you give them ‘x’ (which is actually y) but they really wanted ‘z’. Translating business requirements/understanding into technical implementation isn’t an art it’s a technique, some people are good at it, some aren’t. With time pressures on the business user gaining clarity is always a difficult task so following a standard iterative based framework is always going to ease that process.

A common understanding on both sides will improve the odds of a successful outcome, especially in Data Science where slight misunderstandings can mean the team going off in the wrong direction.

The first phase we come to in CRISP-DM is “Business Understanding” where we nail the project objectives and requirements from the viewpoint of the Business and then converting that into the Data Science problem definition a preliminary plan. Remembering this is an iterative process our first cut at this would be the Business would like to improve the availability of cycles so that hirers don’t visit an empty bike point, so our task is to predict the availability of bikes at a given bike point at specific times of the day with a view to a) providing the user with a visualisation of busy times and b) provide the business with information on stocking levels.

The next “Data Understanding” phase plays to what datasets are available for us to use, remember our model accuracy depends on having data suitable to the task defined in “Business Understanding”, TFL provide the cycle usage data which gives us details of each time a bike is taken out of a bike point and to which bike point it is returned and the duration the bike was hired. There are a ton of other datasets that may help in our task here, what things would effect cycle hire? Perhaps things like the weather, perhaps if the bike point is near a station, travel problems, the day of the week (working/none-working days), seasonal periods etc. Data Understanding is about finding this data and making it available for the next phase. This is our play about stage, for instance I’ve already discovered that on one of the many cycle usage data CSV files is in a different format to the others, also in some files we have start dates set to 1901 with an end_station_id of 0 – there are reasons for this and we need to understand what they are.

“Data Preparation” requires us to cleanse the data, get it in a state ready to be used by our chosen algorithms for instance we may need to convert some items from continuous to categorical data etc. In this stage we also need to start thinking about the training and validation samples, again we need to understand our data for instance your data might be skewed, for instance if 90% of the collected data is Female and we are modelling for a Male specific construct then we’ll likely want to pick from the Male population rather than just picking 50,000 random rows from the dataset (you’ll likely end up 90% female).

“Modelling” phase, well – that’s the modelling bit, what techniques to use  against your data, then we have “Evaluation” which deals with verification of the model and finally we have the “Deployment” phase.

I’ll talk about each phase as we do them.

What is TDSP (Team Data Science Process)?

Expanding on CRISP-DM, TDSP is a collection of process flow, tools and utilities to assist not only you but your team provide the Data Science component of your Enterprise Data Platform. Microsoft have a Team Data Science Process site on GitHub which contains all the bits you need, start with the root page ( but the actual detail is in the

What are the basics of TDSP?

Reviewing TDSP, CRISP-DM, KDD and SEMMA process models you will see the commonality – start with business understanding [of the problem in hand], understand what data you have or may need, prepare that data – clean it and put it in a form suitable for your each of the models you will try, then Model it, Evaluate it and finally Deploy it. CRISP-DM and TDSP are both task based, the process model defines the tasks you should be doing and in which phases.

At this point if you’ve looked at you will see the phases match those of CRISP-DM.


Hopefully I’ve set the scene of what we are going to do over the coming weeks, feel free to interact with me in the comments or feedback to me tonyrogerson @ The next article will deal with Project Initiation and Business Understanding.

]]> 0 191
SQL Server – Making Backup Compression work with Transparent Data Encryption by using Page Compression Fri, 15 Jul 2016 15:17:49 +0000 Read More »]]> Encrypted data does not compress well if at all so using the BACKUP with COMPRESSION feature will be ineffective on an encrypted database using Transparent Data Encryption (TDE), this post deals with a method of combining Page Compression with TDE and getting the best of both worlds.

Transparent Data Encryption (TDE) feature encrypts data at rest i.e. the SQL Server Storage Engine encrypts on write to storage and decrypts on reading – data resides in the buffer pool decrypted.

Page compression is a table level feature that provides page level dictionary and row/column level data type compression, pages read from storage reside in the buffer pool in their compressed state until a query reads them and only at that point are expanded which gives better memory utilisation and reduces IO. You need to be aware both Encryption and Page Compression add to the CPU load on the box, the additional load will be dependent on your access patterns – basically you need to test and base your decision on that evidence, without drawing this into a discussion around storage tuning you tend to find that using Page Compression moves a query bottleneck from storage and more into CPU simply because less data is being transferred to/from storage so latency is dramatically reduced – don’t be put off if your box is regularly consuming large amounts of CPU – it’s better to have a query return in 10 seconds with 100% CPU rather than 10 minutes with 10% CPU and storage the bottleneck!

Why is data not held encrypted in the Buffer Pool? TDE encrypts the entire page of data at rest on storage and decrypts on load into memory (, if it were to reside in the Buffer Pool it would require page header information in a decrypted state so that things like Checkpoint and inter-page linkage work as such there would be security risk. Page Compression does not compress page header information which is why pages can reside in the Buffer Pool in their compressed state (see

Coupling Page Compression with TDE gives you the benefit of encryption (because that phase is done on write/read to/from storage) and compression (the page is compressed when it either read or written to storage via the TDE component in the Storage Engine – basically your data is always decrypted when coming to compressing/decompressing.

The table below shows space comparison and timings between normal (no TDE or Compression), Page Compression on it’s own and TDE coupled with Page Compression. You will see that only Data in the Database is compressed – log writes are not compressed so there is only a marginal improvement in using Page Compression but you can see TDE has no effect on Page Compression to the Transaction log either.


If you use this approach then compress all the database tables with Page Compression, also – stop using the COMPRESSION option of BACKUP because it will save resource – SQL Server won’t be trying to compress something already compressed!


Prepare the test database, we use FULL recovery to show the performance and space to the transaction log:


  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.S2016\MSSQL\DATA\TEST_PAGECOMP.mdf' , 
  SIZE = 11534336KB , 
  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.S2016\MSSQL\DATA\TEST_PAGECOMP_log.ldf' , 
  SIZE = 10GB , FILEGROWTH = 65536KB )

alter database TEST_PAGECOMP set recovery FULL

backup database TEST_PAGECOMP to disk = 'd:\temp\INITIAL.bak' with init;


Test with SQL Server default of no-compression and no-TDE.


--	No compression
create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) ;

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
		commit tran;
		begin tran;


if @@TRANCOUNT > 0
	commit tran;

dbcc sqlperf(logspace) 
--	Size		Percent Full
-- 6407.992		99.49185

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_NOComp.trn'
--Processed 811215 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 3.
--BACKUP LOG successfully processed 811215 pages in 15.170 seconds (417.772 MB/sec).


backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_NOComp.bak' with init;

--Processed 718128 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 4 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 718132 pages in 12.157 seconds (461.495 MB/sec).

Test with Page Compression only.

--	Page will use dictionary so really strong compression
drop table test

create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) with ( data_compression = page );

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
		commit tran;
		begin tran;


if @@TRANCOUNT > 0
	commit tran;

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_Comp.trn' with init;
--Processed 706477 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP LOG successfully processed 706477 pages in 12.035 seconds (458.608 MB/sec).


backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_Comp.bak' with init;
--Processed 9624 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 2 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 9626 pages in 0.207 seconds (363.274 MB/sec).


Test with both Page Compression and TDE:

drop table test


USE master;  



/* The value 3 represents an encrypted state   
   on the database and transaction logs. */  
FROM sys.dm_database_encryption_keys  
WHERE encryption_state = 3;  

--	Now repeat the page compression version

create table test (
	id int not null identity primary key clustered,

	spacer varchar(1024) not null
) with ( data_compression = page );

set nocount on;

declare @i int = 0;

begin tran;

while @i <= 5 * 1000000
	insert test ( spacer ) values( replicate( ' a', 512 ) )

	set @i = @i + 1;

	if @i % 1000 = 0
		commit tran;
		begin tran;


if @@TRANCOUNT > 0
	commit tran;

backup log TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_CompTDE.trn' with init;
--Processed 722500 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP LOG successfully processed 722500 pages in 12.284 seconds (459.502 MB/sec).


backup database TEST_PAGECOMP to disk = 'd:\temp\TEST_PAGECOMP_CompTDE.bak' with init;
--Processed 55296 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP' on file 1.
--Processed 7 pages for database 'TEST_PAGECOMP', file 'TEST_PAGECOMP_log' on file 1.
--BACKUP DATABASE successfully processed 55303 pages in 1.118 seconds (386.450 MB/sec).



]]> 0 166
Tableau Kerberos Delegation to SQL Server / SSAS – Part 2 – Setting it up Mon, 02 May 2016 18:27:23 +0000 Read More »]]> The ability to use Kerberos delegation against SQL Server and SQL Server Analysis Services (SSAS) was introduced in Tableau Server 8.3; we covered theory in the post Tableau Kerberos Delegation to SQL Server / SSAS Part 1 – The Theory (Kerberos Tickets, Service Principal Names and Token Size) – make sure you have reviewed that before continuing because this article just contains the instructions to get it all working.

Required information prior to starting

Make sure you get all this information to before starting, validate it, set up permissions before starting.

  1. Domain Account to run Tableau Server, it doesn’t require any permissions to SQL Server / SSAS because it will simply be delegating using the client credentials (this demo uses tonydemo\svcTableau for Tableau Server and tonydemo\testuser for the local user). If possible use an account that doesn’t have access to the SQL Server for security reasons. The Tableau Server account must be on the same domain as the SQL Server / SSAS machine. Users can be on any domains.
  2. Fully Qualified Domain Name to the SQL Server / SSAS machine (for this demo is
  3. TCP Port that the SQL Server / SSAS service is listening on – make sure this is a fixed port which I’ll cover later (for this demo is 14331). I’ll reiterate – make sure you use fixed ports because it makes setting the SPN’s up easier but separately from a security and governance perspective it’s good practice! Also – make sure you check SQL Server / SSAS is currently running on those ports – ask for actual evidence!
  4. A login that has the correct permissions on SQL Server / SSAS in order for you to create the Tableau Data Source the clients will use (for this demo I’ll use tonydemo\Administrator).
  5. A login to test everything is working, the login needs to have permission into Tableau Server and also SQL Server to run the query set up previously for the Data Source (for this demo I’ll use tonydemo\testuser).
  6. Active Directory group you will use as part of the Tableau single sign on and mapped to a Tableau role, note: this has nothing at all to do with Delegation as the permissions are set in SQL Server / SSAS, remember – Tableau is simply delegating using the credentials from the user.

Test the Data Source

The connection originates from the middle tier machine i.e. the one running Tableau Server, but instead of using the Tableau Server service account it uses the user credentials (via the delegated TGT to the SQL Server/SSAS). Make sure a test user in the AD domain you are giving access to SQL Server / SSAS for can actually login to SQL Server / SSAS and run the query you want to generate for your Data Source – test the connection from the Tableau Server itself – the easiest way to do that is to allow the test user remote desktop access to the middle tier server and try connecting using ODBC admin tool – basically something you can test connectivity.

As I’ve said earlier I strongly recommend you use static ports, if you don’t then forget using TCP Sockets as a method for the middle tier to connect to SQL Server – you’ll need to use Named Pipes for which I’m not going to cover because you for numerous reasons ought to have static ports in production for SQL Server and SSAS.

I’ve added instruction below on how to configure static ports for SQL Server and SSAS.

Configuring SQL Server to listen on a static port

Using SQL Server Configuration Manager select SQL Server Network Configuration – make sure you get the correct instance name, set the TCP Port under IPAll to the port you desire for example we will use port 14331 for this SQL Server Database Engine instance. By using the port number directly you can stop and disable the SQL Browser because it is not required, well – unless you have other instances/users using the instance name rather than port directly.

Make sure you restart SQL Server, more importantly – make sure the firewall allows inbound connections to that port from the middle tier server – it’s only the middle tier server (ESTER) that connects to database machine (POPPY).


The SQL Server errorlog (run the xp master..xp_readerrorlog through SSMS) will tell you what ports SQL Server is listening on, for my demo environment the output looks like this:

2016-04-30 14:52:08.110 Server Server is listening on [ ‘any’ <ipv6> 14331].
2016-04-30 14:52:08.110 Server Server is listening on [ ‘any’ <ipv4> 14331].

Configuring SSAS to listen on a static port

Using SQL Server Management Studio (SSMS) select “Connect..” and choose “Analysis Services”, connect to the relevant SSAS instance.

Right click on the server name and choose New Query… XMLA and run the modified piece of XMLA for your environment. Change the ID, Name and ServerProperty Port – Value.

Now restart SSAS.

<Alter AllowCreate="true" ObjectExpansion="ObjectProperties" 
  <Object />
    <Server xmlns:xsd="" 

You can check which port SSAS is running on in SSMS by connecting to the SSAS instance, right click properties and check the “Running Value” for the Port setting – see below:


Make sure the PING’s work

The middle tier server connects to the database so remote desktop to the middle tier server using an administrator account and ping the fully qualified domain name also do a reverse ping and make sure the IP address return correctly, your output should look like that below:


Pinging [] with 32 bytes of data:
Reply from bytes=32 time<1ms TTL=128
Reply from bytes=32 time<1ms TTL=128

C:\>ping -a

Pinging POPPY [] with 32 bytes of data:
Reply from bytes=32 time<1ms TTL=128
Reply from bytes=32 time<1ms TTL=128

Add the SPN’s

When requiring a multi-hop connection to SSAS you must use the Instance Name – you cannot connect to SSAS through the port directly; SQL Browser is used (connection is over TCP 2382) to determine the correct port for the instance name which the client then communicates with SSAS over – that requirement means you need to set up SPN’s for SQL Browser as well. For SSAS add SPN’s for both the FQDN as well as the local machine name.

Add the SPN’s that relate the Tableau Server service account (tonydemo\svcTableau) with the remote resource it’s providing delegation services to, you will need a Domain Administrator to run the next bit, it can be ran on any machine in the domain, fire up a CMD prompt and run this:

C:\>setspn -S MSSQLSvc/ svcTableau
Checking domain DC=tonydemo,DC=net

Registering ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net
Updated object

C:\>setspn -S MSOLAPSvc.3/poppy:SQL2008R2 svcTableau
Checking domain DC=tonydemo,DC=net

Registering ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net
Updated object

C:\>setspn -S MSOLAPSvc.3/ svcTableau
Checking domain DC=tonydemo,DC=net

Registering ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net
Updated object

C:\>setspn -S MSOLAPDisco.3/poppy svcTableau
Checking domain DC=tonydemo,DC=net

Registering ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net
Updated object

C:\>setspn -S MSOLAPDisco.3/ svcTableau
Checking domain DC=tonydemo,DC=net

Registering ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net
Updated object


For a SQL Database Engine instance you should have a single SPN, for a SSAS box you should have 4 SPN’s – triple check you have done these correctly!

Active Directory Accounts and Delegation Configuration

Now the SPN’s are set up the “Service Class” will appear as a choice in the AD config for the machine and user, again we need a Domain Administrator, there are two things to do here – first allow the machine that does the delegation i.e. Tableau Server (ESTER) delegation permission to the service classes it will use, this is actually done against the middle tier machine (ESTER) but using the Services the account Tableau Server runs under. If you remember the SPN setting up – that tied a service to the Tableau Server account (svcTableau) – this is where it starts coming together.

Middle Tier Machine

Open up Active Directory Users and Computers, in the Computers tree select the server running Tableau Server (ESTER) and right click Properties, select the delegation tab; we should secure by default so restrict to specified services only (which is termed constrained delegation). On the Add Services / Users or Computers make sure you select the account you will run the Tableau Server under i.e. tonydemo\svcTableau.


Click OK and if you have set up the SPN’s correctly and the above you should now have a pop-up with the services listed below, remember the MSOLAPDisco.3 is SQL Browser (port should be blank), MSOLAPSvc.3 is your SSAS (port should be your instance name and not a port number), MSSQLSvc is SQL Server Database Engine (port should be the port number).

You likely won’t see the HTTP one, that is actually added by Tableau but we haven’t completed that yet – I’ve structured this demo to do all the Tableau bits last. The importance piece here is you should have the SQL Server / SSAS services shown, choose “Select All” and click OK and you should have a screen like below, click “Apply” and “Ok” and this piece is done and you are ready to do the same but for the account Tableau Server runs under.

Tableau Server Service Account

After creating the SPN’s against tonydemo\svcTableau the delegation tab will appear when you manage the user through AD Users and Computers, if you have reached this part of this list of instructions and the Delegation tab doesn’t appear then you have either set up the SPN’s against a different account or perhaps you are waiting for AD replication to take place i.e. you are looking at a different AD node to the one the SPN’s have just been created in.

In Users, find the account you are going to run the Tableau Server under, pick the Delegation tab and add the MSSQLSvc, MSOLAPDisco.3 and MSOLAPSvc.3 Services as you did before for the Computer.


The account above should also be part of the “Act as part of the operating system” and “Impersonate a client after authentication” policies so use the Local Security Policy on the middle server (ESTER) running Tableau Server and add that permissions (see below):

Tableau Server Configuration

Open the “Configure Tableau Server” on the server, start by adding the AD account that you will run Tableau Server with – do not use a local account, needs to be a domain account.

Set the “User Authentication” mode to “User Active Directory”, specify the Active Directory Domain name and check the “Automatic logon (Kerberos) tab.


The next part requires creating a script, running it and importing a “keytab” file, I recommend creating a directory on the middle tier machine running Tableau Server to store it all in – makes life easier. Click the “Kerberos” tab, check “Enable Kerberos for single sign-on”, you need to click the “Export Kerberos Configuration Script” which creates a .BAT file that needs to be ran by a Domain Administrator – we’ll discuss that next, drop the file in the directory I recommended you create.


The .BAT file generated from Step 1 is shown below with comments stripped, the important pieces have been highlighted and need to be checked before the Domain Admin account executes it.

@echo off
setlocal EnableDelayedExpansion

set /p adpass= "Enter password for the Tableau Server Run As User (used in generating the keytab):"
set adpass=!adpass:"=\"!

echo Creating SPNs...
setspn -s HTTP/ester TONYDEMO\svcTableau
setspn -s HTTP/ TONYDEMO\svcTableau

echo Creating Keytab files in %CD%\keytabs
mkdir keytabs
ktpass /princ HTTP/ /pass !adpass! /ptype KRB5_NT_PRINCIPAL /out keytabs\kerberos.keytab

Get a Domain Administrator to log onto the Tableau Server machine, bring a CMD prompt up and run the .BAT file, it creates a keytab file using the Windows ktpass utility in a new sub-directory of “keytabs”.

Import the keytabs file, finally click the “Test Configuration” and you should get the message “SPNs are correctly configured: SPN Configuration Complete”.

You may need to stop and restart Tableau Server for the options to take effect.

Creating a Data Source for Clients to Use

  1. Fire up the Tableau Desktop and create a New Sheet.
  2. Select “Connect to Data”, select “Microsoft SQL Server” and enter the server name e.g.,14331 and use “Trusted Connection”.
  3. Give the Data Source a name, for example “Viewer Credentials Blog Demo” and select the Database and Table – set up your query bits, it should be a “Live Connection” – see example below where I have connected to the SQL Server machine we set up earlier, the “play” database and the “somedata” table.KerberosP2_TableauDS1
  4. Click on “Sheet 1” which is where we can finalise our Data Source and publish it to Tableau Server with the “Viewer Credentials” setting that brings all this together. Select the “Server” option and “Publish data source” and select your data source as below, a login prompt will ask for the Tableau Server you want to connect to, select the server you set up earlier e.g.
  5. The publish data source to Tableau Server pop-up will appear, click “Edit” under “Authentication” and select “Viewer credentials”, this is the piece that will force Tableau Server to delegate the authentication of the user credentials.KerberosP2_TableauDS3
  6. Once you have published successfully you can use the Data Source and Tableau Server will pass the local user credentials and use those against SQL Server / SSAS, example below; notice the login name is the local user and not the Tableau Server account tonydemo\svcTableau, also note the hostname is ESTER which is the Tableau Server – in this test I access the Data Source from the user machine HAZEL logged in as tonydemo\testuser.KerberosP2_TableauSQLTest

    For completeness this is a Profiler output from a SSAS connection through delegation, you can see the NTUserName is the login we used on HAZEL which is our client, the ClientHostName shows the request came from, on doing a reverse lookup we can see that IP address is indeed the middle tier machine ESTER.KerberosP2_SSASProfiler


As you can see from above it works, fairly straightforward to set up but there are a lot of moving parts to get it working; I’m writing a post around diagnosing when you get problems or issues getting it set up.

If you do need help setting this up then give me a ping via email.


Tableau Server Help


]]> 3 95
Tableau Kerberos Delegation to SQL Server / SSAS – Part 1 – The Theory (Kerberos Tickets, Service Principal Names and Token Size) Sun, 01 May 2016 19:16:34 +0000 Read More »]]> Active Directory is by default geared to a two tier Client/Server architecture, the Client can only use resources either locally or on the Server it directly connects to. For multi-tier Client/Server environments it’s more tricky because the user doesn’t authenticate directly with the target server perhaps not even able to connect because of firewall or routing constraints, the middle tier needs to perform the action of “Delegation” that is to say the user has given authority for the middle tier to present itself using the user credentials to the third tier machine.

This article works through everything you need to know to get Tableau Server working with Kerberos delegation to a Microsoft SQL Server or Microsoft Analysis Services (SSAS) data source. I’ve chopped the post into three parts because it was getting rather long, this first part deals with all the theory – I recommend you study this first even if you understand Kerberos, the second part actual set up instructions and the third post will be around troubleshooting.

The diagram below shows what we are going to achieve, HAZEL uses the browser to connect to Tableau Server running on ESTER which gets data from the SQL Server POPPY using the logon credentials from HAZEL.

Tableau to SQL Server Kerberos Delegation - Environment


On successfully logging onto the domain the user gets an encrypted token called a “Ticket to get Tickets” (covered here: Kerberos Explained), the TGT is the users identity and contains all the domains and groups the user belongs to and permissions granted.

When the user tries to authenticate with a resource for instance SQL Server the user sends the TGT to the server which then knows what groups the user has permission to and what permissions – importantly without having to query the AD, incidentally that is why there is often a delay between granting or revoking a permission and it actually taking effect and why logging off/back on fixes it – you get a new TGT! The ticket only contains SID (System Identifiers) – names aren’t held locally, to resolve them the local machine would have to look them up against the AD which is why when the AD isn’t available you’ll see a string something similar to ‘S-1-5-23-23424234-23423423432-234234234-433’ instead of ‘Some Tableau AD group’.

Ticket Size Issues

A logged in user has a TGT which can be passed around for servers to check against before allowing access to resource such as SQL Server, as you probably realise the ticket contains quite a lot of information, the size of the ticket has nothing at all to do with delegation which I’ll talk about shortly, the default maximum ticket size varies depending on the OS the various machines in the chain are using – for Windows XP/Windows Server 2000 it’s 8,000 bytes, from Windows 7/Windows Server 2008 it was 12,000 bytes and in Windows 8/Windows Server 2012 it’s 48,000 bytes, so if your desktop is using say Windows 8 but your Server is running Windows Server 20008 then the client can pass a ticket of 48,000 but you’re server by default can only handle 12,000.

Exceeding the maximum ticket size will cause authentication to fail, the user will get one of a number of error messages. Realistically the user needs to be in many hundreds of AD groups here, researching the issue it would appear every group takes 40 bytes – this article covers it in detail: Problems with Kerberos authentication when a user belongs to many groups. All is not lost – make sure your middle server tier has the MaxTokenSize set to 48,000 which looks to be the maximum for general use for desktop applications and browsers connecting with IIS.

Delegation (Multi-Hop)

Kerberos TGT path on Multi-hop Delegation

Step 1: The local users machine (HAZEL) contains a ticket (TGT) obtained at logon (step 1), the ticket contains all the groups and permissions they have.

Step 2: Local user machine sends the TGT to the middle server (ESTER), the middle server (ESTER) needs to access a remote resource (POPPY) but using the credentials from the local user on HAZEL.

Step 3: Middle server (ESTER) passes the TGT to the remote server (POPPY).

Step 4: Remote server (POPPY) authenticates the local user (HAZEL) and runs anything in the context of that local user, network traffic flow from POPPY to HAZEL through ESTER.

The middle server (ESTER) is acting and will appear to the remote server (POPPY) as the local user – yes, you heard correctly – the middle server acts like the local user so the resource running on the middle server the local user has connected to has all the permissions the local user has – so you need to trust that middle server! Active Directory in Windows secures by default so by default the middle server will not be allowed to send the local users TGT to the remote server – it tries but it’s actually got nothing in it!

Delegation authority needs to be permissioned for the the middle server “computer” as well as the logon that runs the service on the middle server that is connecting out to the remote resource. Before enabling that in AD we need to create a number of “Service Principal Names” (SPN) which I’ll discuss next.

It’s worth pointing out that at no time does the Client talk directly to the third tier server, all network communications route through the middle tier, in fact if you the middle tier machine is connecting to a SQL Server machine then you will see the hostname as ESTER but the login credentials will be whatever they have used to get onto HAZEL.

Service Principle Names (SPN)

SPN’s are used by Kerberos clients (the user on HAZEL in this case) to uniquely identify an instance of a service on a target server (in our case POPPY). The way I like to think about it is that it helps security, for SQL Server we create a SPN that binds to the login that will delegate for the user, in our demo the login originates from HAZEL, the delegation occurs on ESTER through the login svcTableau (that’s the service account running Tableau Server in my environment) against the service instance on POPPY.

Basically to form the SPN we need a) the service account that runs the Tableau Server, b) the fully qualified domain name for the server running SQL Server and c) the port SQL runs on and possibly the instance name depending on how you connect.

The example SPN’s set up for this test environment are shown below (using SETSPN -L svcTableau):

C:\Windows\System32>setspn -L svcTableau
Registered ServicePrincipalNames for CN=Tableau,CN=Users,DC=tonydemo,DC=net:

The SPN structure is {Service Class}/{Host}:Port where :Port is optional.

{Service Class} for SQL Server

SQL Server Database Engine: MSSQLSvc
SQL Server Analysis Server: MSOLAPsvc.3
SQL Server Reporting Services: HTTP because SSRS actually runs as a webservice through IIS.
SQL Server Browser (only used when the process connects to SSAS using an instance name – see Microsoft KB SPN Required for SSAS): MSOLAPDisco.3

I would strongly recommend using fixed port numbers for named instances otherwise you’ll have to cover off all likely ports SQL Server will use when it restarts and dynamically assigns one – yes – that’s a lot! So – just fix the port number in SQL Server Configuration Manager… Network Settings but I’ll cover that in Part 2.


None of this will work unless the DNS is set up correctly so make sure your fully qualified domain name ping’s properly i.e. pinging and doing a reverse ping on the IP returns the correct box – we’ll cover that in Part 2 as we go through set up. You only need an entry for the fully qualified domain name and port the instance is listening on eg. MSSQLSvc/


That covers the theory, in a nutshell – The user logs into the domain on their workstation, AD gives them back an encrypted TGT which is held on their local machine, that TGT holds all the Domains and AD Groups they have permissions to; the user runs something on the middle tier (a worksheet requiring a Data Source for instance), the middle tier connects to the remote resource by passing on the TGT from the user, the third server now things the user is connecting directly; as part of the delegation process checks are made that the middle tier machine and the login used to start the service on the middle tier is allowed to delegate for the service the user requires.

Part 2 covers how this is actually implemented using Tableau Server on the middle tier.


Windows Authentication Overview

Kerberos Survival Guide

Kerberos for the Busy Admin

Understanding Kerberos Double Hop

qUICKLY Explained: Service Principal Name: Registration, Duplication

]]> 1 70
IT Community Roadshow – Real-world stories and demos from local community experts – May 2016 Tue, 26 Apr 2016 19:45:23 +0000 Read More »]]> MVPRoadShow2016Q2

The IT Community Roadshow is a series of free training events focusing on next-generation technologies including Microsoft Azure, Windows 10, and DevOps. The sessions will focus on real-world challenges, along with a comprehensive set of solutions and expert guidance.

What you’ll get:

These unique events are led by the community, and designed to offer all who attend the skills and real world knowledge to work with tomorrow’s technology. All sessions will feature live demos, and you may receive a free Microsoft Azure pass for some courses.

Hear about the latest technology from some of the best technical community experts in the country:

In just one free, half or full day technical training event, you’ll discover how to build a more flexible infrastructure with Open Source (OSS), Microsoft and other technologies to scale resources up and down as needed. Maximize security, manageability, and productivity across all devices and platforms and bring your enterprise datacenter into the cloud, without sacrificing security, control, reliability and scalability.

Upcoming events:

When Title City  
May 6, 2016 What’s New in Windows 10 Enterprise York Register
May 12, 2016 DevOps, Windows 10 & Enterprise Mobility Reading Register
May 18, 2016 What’s New in Cloud Infrastructure Edinburgh Register
]]> 0 67
Transaction Log Concepts – Part 2: Full Database Backups and Mirroring/Availability Groups Tue, 08 Mar 2016 17:55:45 +0000 Read More »]]> In Transaction Log Concepts – Part 1: ACID, Write Ahead Logging, Structure of the Log (VLF’s) and the Checkpoint/Lazywriter/Eagerwriter process I talked about Write Ahead Logging and how the Checkpoint process interacts with the Transaction Log; To recap – the transaction log is an append only log of all changes to the Database, an incremental Log Sequence Number (LSN) is used to track the order of changes and determine when the contents of the log is in sync with the contents of the data files – remember data is written to the log first then to the data files.

In this part we will look at how the Transaction Log behaves when using Full Database Backups and Database Mirroring (or Availability Groups).

Once you understand what a Log Sequence Number (LSN) is and how the various database features interact with the log you’ll have it nailed!

Full Database Backups

Keeping the backup process in it’s simplest terms – the Database backup reads from the beginning to the end of the database exporting the used portion into an external file.

Consider this: you have a highly active database but it’s the same piece of the database (Extents) that are constantly modified, you kick a database backup off at 9pm, the full backup takes precisely 2 hours – how do you keep track of changes to the database while your backup is running i.e. over those 2 hours?

You need to keep track of all changes to the database since the backup has started, sound familiar? Remember – the Transaction Log keeps track of all changes to a Database, so the backup process is now simple – you do a start to end read of the used portion of the Database to an external file and then once complete you append the Transaction Log to the external backup file.

When the backup is started the minimum LSN of the active portion of the transaction log is captured, from therein SQL Server cannot truncate past that log marker regardless of how often you backup the transaction log – you can still and should backup the transaction log and the log since the last transaction log backup LSN will be exported to the external file but the VLF’s in the log will not be marked for reuse – you just end up with more and more active log even though you’ve backed it up.

Example – the concept

Log File actual size = 1GiByte, Transaction log backup has just happened so all bar one of the VLF’s are inactive.

Database is constantly updated and creates 100MiByte of log data per minute, the database backup takes 1 hour, there is a log backup every 10 minutes.

Full Backup starts at LSN 1000.
Log backup 1 - last log backup LSN 1000, Full Backup start LSN 1000, used log is 1GiB, log backup = 1GiB
Log backup 2 - last log backup LSN 2000, Full Backup start LSN 1000, used log is 2GiB, log backup = 1GiB
Log backup 3 - last log backup LSN 3000, Full Backup start LSN 1000, used log is 3GiB, log backup = 1GiB
Log backup 4 - last log backup LSN 4000, Full Backup start LSN 1000, used log is 4GiB, log backup = 1GiB
Log backup 5 - last log backup LSN 5000, Full Backup start LSN 1000, used log is 5GiB, log backup = 1GiB
Log backup 6 - last log backup LSN 6000, Full Backup start LSN 1000, used log is 6GiB, log backup = 1GiB
Full backup completes, appends 6GiB of log to the external file.
Log backup 7 - last log backup LSN 7000, Full Backup start LSN 1000, used log is 1GiB - rest has been purged because of completion of the Database Backup.

As you can see only the used portion of the log since the last log backup is backed up so you don’t repeat even though the log is growing.

Example – seen through SQL Server

The output below was created using a 23GiB copy of a TPC-H database build, you need to put the database into Full recovery first and take a Full backup first prior to playing, the log is still backed up at the end of a Full database backup when the database is in Simple recovery, it’s just easier to see the effect when in Full recovery.

Connection A:

set nocount on;

declare @c_custkey bigint = ( select top 1 c_custkey from customer )

while 1 = 1
	update customer
		set c_comment = cast( newid() as varchar(118) )
	where c_custkey = @c_custkey	


Connection B:

set nocount on;

while 1 = 1
	waitfor delay '00:00:10'

	dbcc sqlperf(logspace)


	backup log TPCH to disk = 'D:\TEMP\TPCH.trn' with init

	dbcc sqlperf(logspace)

	print ''
	print ''


Connection C:

backup database TPCH to disk = 'D:\TEMP\TPCH.bak' with init


In the results below you can clearly see that a) the Used portion of the log keeps growing despite successive transaction log backups until the last log backup which has been done after the Full backup has completed, secondly b) you can see that each log backup is approximately the same actual size and the complete log isn’t backed up over and over again.

If we tot up the number of log pages backed up (54,599 = 436MiB) it is in the same ball park if we compare the full backup size with the activity below to one with no activity (49,312 pages = 394MiB).

Log Size (MB) Log Space Used (%) - Before LOG backup
------------- ------------------
361.0547      31.07892

 Processed 13295 pages for database 'TPCH', file 'tpch_log' on file 1.
 BACKUP LOG successfully processed 13295 pages in 0.501 seconds (207.306 MB/sec).

Log Size (MB) Log Space Used (%) - After LOG backup
------------- ------------------
361.0547      29.1829
Log Size (MB) Log Space Used (%) - Before LOG backup
------------- ------------------
361.0547      50.86552

 Processed 10499 pages for database 'TPCH', file 'tpch_log' on file 1.
 BACKUP LOG successfully processed 10499 pages in 0.230 seconds (356.600 MB/sec).

Log Size (MB) Log Space Used (%) - After LOG backup
------------- ------------------
361.0547      51.31126
Log Size (MB) Log Space Used (%) - Before LOG backup
------------- ------------------
361.0547      72.37045

 Processed 9991 pages for database 'TPCH', file 'tpch_log' on file 1.
 BACKUP LOG successfully processed 9991 pages in 0.267 seconds (292.319 MB/sec).

Log Size (MB) Log Space Used (%) - After LOG backup
------------- ------------------
361.0547      72.88692
Log Size (MB) Log Space Used (%) - Before LOG backup
------------- ------------------
361.0547      94.66149

 Processed 10331 pages for database 'TPCH', file 'tpch_log' on file 1.
 BACKUP LOG successfully processed 10331 pages in 0.278 seconds (290.320 MB/sec).

Log Size (MB) Log Space Used (%) - After LOG backup
------------- ------------------
361.0547      95.2349
Log Size (MB) Log Space Used (%) - Before LOG backup
------------- ------------------
431.0547      98.30132

 Processed 10483 pages for database 'TPCH', file 'tpch_log' on file 1.
 BACKUP LOG successfully processed 10483 pages in 0.199 seconds (411.537 MB/sec).

Log Size (MB) Log Space Used (%) - After LOG backup
------------- ------------------
431.0547      1.471115

Warning over Open Transactions

If you have a long lasting transaction, perhaps a process has accidentally started a transaction, then the log will not shrink down – you can back up the log and the behaviour is as above in terms of only backing up new stuff however the VLF’s will not free up so your log will keep growing and growing until the open transaction is committed or rolled back. You can use DBCC SQLPERF(LOGSPACE) to see the size of the log, you can use DBCC OPENTRAN( … ) to see the open transactions in a database.

Database Mirroring / Availability Groups

I’ll simply start by saying – Ditto the above! It uses the same technique around tracking the LSN but instead of a fixed point i.e. the start of a backup, the minimum LSN follows the state of mirroring/availability group replication.

Both Mirroring and Availability Groups use the Transaction Log, the log is read sequentially and the information is compressed and streamed across to the mirror or secondary replica’s.

Ever had a AG or Mirror disconnect? Notice that your transaction log just keeps growing and growing….

The LSN is not moved forward until the transaction is hardened off on the mirror/secondary replica, that means not just reaching the transaction log on the other machine but actually it being written to the data files (recovery) albeit the databases are permanently in a form of recovery mode.


When taking Full Backups you need to minimise the amount of transaction log activity – that means scheduling your index maintenance outside that window, schedule batch jobs accordingly etc.

When using mirroring or availability groups then size your logs with possible failure of the stream in mind, make sure your log has space to grow if you don’t want to pre-size your transaction log. If you size your log based on everything working fine then you are setting yourself up for failure. How big will your log grow if the network to your remote site is down for an hour?

Make sure transactions are short lived!

]]> 0 53
Transaction Log Concepts – Part 1: ACID, Write Ahead Logging, Structure of the Log (VLF’s) and the Checkpoint/Lazywriter/Eagerwriter process Tue, 19 Jan 2016 20:12:45 +0000 Read More »]]> I’ve put together a number of posts that will hopefully both inform and put to bed some misconceptions around the use of a Transaction Log within a database context and how SQL Server makes use of it across a number of features such as normal day-to-day transactions, Backups, Checkpoints/Lazywriter/Eagerwriter, Mirroring, Internal structure of the transaction log, DELAYED_DURABILITY feature and In-Memory tables.

Part 1: ACID, Write Ahead Logging, Structure of the Log (VLF’s) and the Checkpoint/Lazywriter/Eagerwriter process.
Part 2: Mirroring/Availability Groups and Full Database Backups
Part 3: Database DELAYED_DURABILITY feature (in the planning)
Part 4: Why my log won’t shrink – reasons to understand Parts 1 to 3 (in the planning)
Part 5: In-Memory Tables

ACID (Atomicity, Consistency, Isolation, Durability) 

Databases store data, applications update data. What happens if in the middle of your update the server or your application crashes – what is your expected result when you revisit your data?

I would hope the answer to that question would be – I’d expect my data to be as it was before I started my update i.e. in a known state rather than either part updated or left in a corrupted state.

A database itself is not ACID compliant, data changes are applied inside Transactions – a database product will provide a means of applying the ACID rules through a Transaction. SQL Server by default provides transactions that satisfy the ACID rules but it also supplies a means of satisfying just the ACI rules – without Durability e.g. DELAYED_DURABILITY and SCHEMA_ONLY option for in-memory tables.

The transaction log is used to apply the Atomicity (all or nothing) and Durability (when it’s written it’s definitely written) rules in ACID, the next section on Write Ahead Logging (WAL) explains how.

For more information on ACID see:

Write Ahead Logging (WAL)

As a programming task how would resolve the following requirement: you have 50GB of data, 1GB of memory and your update will cause all the 50GB to change, if the update should fail then according to the AD rules in ACID the original data (what is termed the known state) is returned, e.g. 50% of the 50GB has been updated but the program crashes – how do you get back to the known state?

The transaction log is a file separate to the main data, it contains a “log” of updates, in SQL Server anytime you cause a page to be modified data is written to the transaction log. Each database has it’s own transaction log file, it can have multiple log files however they are written to one at a time unlike data files which stream across using the proportional fill algorithm. The other thing to note is when a transaction log is first built or auto-grows the newly created piece of file needs to be zero initialised i.e. unlike data files which can be initialised using “instant file initialisation” the entire new piece of the file needs to be initialised on the file system which may take sometime.

What do we mean by Write Ahead Logging?

Say you execute the command UPDATE mytable SET x = 10 WHERE key = 1234 what happens in SQL Server when applying the AD rules in ACID?

The page containing the existing data row is fetched into the Buffer Pool, a transaction start marker is written to the transaction log to indicate a transaction for this session has begun; the data is then modified in the Buffer Pool, the modified page is written to the transaction log and on commit a transaction end marker placed in the log. At this point there has been no writes to the data file, the modified data is physically on storage in the transaction log file and in memory in the Buffer Pool.

What happens if SQL Server now fails i.e. before the committed data has had a chance to be written to the actual data files? When SQL Server starts up and the Database goes through the recovery process the transaction log is read sequentially bringing the data files up-to-date with any uncommitted transactions rolled back and any committed transactions rolled forward, the database is now in a consistent state.

See Write ahead logging animation

Write ahead logging in SQL Server

Structure of the Transaction Log (VLF’s)

When creating a database the log file is specified using the LOG ON section of CREATE DATABASE, when adding files it’s specified in the ADD LOG FILE section of ALTER DATABASE.

The physical log file is segmented into “Virtual Log Files”, the size of the VLF’s are dependant on the initial and auto-growth increment sizes, small VLF’s can impact database recovery as well as backup procedures so make sure you size them properly. VLF’s are important because it is at that level the active portion of the log is defined, VLF’s in the active portion of the log cannot be truncated so as your database receives more transactions the log will continue to grow in size until the VLF’s in the active portion of the log are dealt with by moving forward the Minimum LSN (Log Sequence Number).

We now have a container to place our transactions; a transaction is made of one or more operations, each transaction has a unique number and so does each transaction log entry, the latter we call “Log Sequence Number”. The LSN is a key piece of information and critical to the consistency of data within the Database, you will even find a LSN in the page header records of data.

Remember the purpose of the transaction log is provide a write ahead log so the rules of A (Atomicity) D (Durability) of ACID can be maintained. The Atomicity rule means for UPDATE’s the old and new rows both need to be logged because on recovery it may be required to roll the transaction back (use the old version) or roll the transaction forward (use the new version).

For more information see: TechNet – SQL Server Transaction Log Architecture and Management

Checkpoint/Lazywriter/Eagerwriter process

I said earlier that when you are doing an operation that modifies a page e.g. an INSERT, the modification is performed in the Buffer Pool (creating or modifying an existing dirty page) and with a corresponding physical write to storage but the Transaction Log (VLF) rather than the Data file itself.

On database recovery, because we have the initial data files that have not been modified yet and every page modification done against the database resides in the transaction log we can bring the database back to a consistent state. This has a number of downsides, if you run out of free pages in the Buffer Pool because they are all “dirty” and you aren’t modifying existing “dirty pages” then you can’t do any more page modifications, also, the transaction log can become extremely large meaning you may face the possibility of running out of storage or the recovery time will also suffer because when the database starts up it has to read a very large transaction log during the “recovery” process.

There is a restriction: the transaction log cannot be reduced in size i.e. VLF’s freed up until dirty pages are written to Data files on storage. You can backup the transaction log until you are blue in the face, the backup can use the LSN to determine what it last backed up so that it doesn’t back up the same part of the log, for example if you have a 50GB used portion of the log and have not done a log backup (assumes Full recovery here) then the first log backup will be 50GB i.e. the part of the log never backed up, the second log backup will only backup what’s new so if no new modifications then it will be very small – to be clear, if your used portion of the transaction log is consistently 50GB over 5 transaction log backups it is not the case that each back up file will be 50GB, 50GB, 50GB, 50GB and 50GB, instead it will be more like 50GB, 10KB, 10KB, 10KB and 10KB for instance (note 10KB is just to illustrate and is not an actual determined value in SQL Server).

Dirty Pages are linked to the transaction log through the Page header, each page header contains a min_lsn number which reflects the most recent LSN where the data on that page exists within the transaction log (remember – WAL). That gives us a number we can use to determine which pages have been written to the data files and as such we have the minimum LSN in the transaction log that needs to be active in order to provide database recovery. For example, we do a modification that changes pages 1 – 128, remember WAL requires that those modifications be written to the transaction log, so we now have 128 dirty pages for example 1 – 128 in the Buffer Pool and for the purpose of this example 128 operations on the transaction log for example LSN 10000 – 10128. To provide database recovery we need LSN 10001 – 10128 because none of the dirty pages have been written to the Data file storage – say we now write pages 1 – 64 to the Data file, the minimum LSN can now move to 65 so we need to keep LSN 10065 – 10128 to provide database recovery because our Data files contain the data that has log operations at LSN 10001 – 10064, the VLF’s holding LSN 10001 – 10064 can be marked inactive if they don’t contain operations for LSN above 10064, the inactive VLF’s can now be purged for reuse (taking requirements for backups aside – will cover that in the next post).


The checkpoint process looks for dirty pages in the Buffer Pool, each page header contains the min_LSN, if the page min_LSN is greater than the last checkpoint LSN (stored in the database boot page) then the page should be written to the Data File storage and the page marked “clean”. The pages are left in the buffer pool, it’s job is really to cut down on the time taken in case the database requires recovery – the more recent the checkpoint the less operations in the log to roll forward or roll backwards.

See an animation of the overall concept of the Checkpoint process.


The lazywriter also writes dirty pages to storage however unlike the Checkpoint process it removes the pages, it focuses on older less used pages with the goal to keep a reasonable number of free pages in the buffer cache to pull data from storage/create new data. The checkpoint is more aggressive at bulking up dirty pages and writing them out to storage although in more recent versions it is less problematic and doesn’t entirely swamp the IO subsystem.


For minimally logged operations for example BULK INSERT the eagerwriter writes dirty pages to storage without waiting for the operations to complete which might starve the available buffers.


Write ahead logging provides a method of applying the Atomicity and Durability rules of ACID. Data is written in the Buffer Pool then to the transaction log, the transaction log contains details of every operation that modifies database pages (and thus anything in the database), the transaction log can be used to recover the Data Files to a known state – rolling forward transactions or rolling them back.


]]> 2 6