Merge – Incremental Soft Delete


Sometimes we may have situations where we only load a day’s worth of data into our data warehouse. We may also need to keep the destination data in line with the incoming data.

For instance, imagine we had a shopping basket filled with items one day, which we load into our data warehouse, but the next day (for whatever reason) two of the products are no longer valid. How can we remove these records without deleting all of the other values in our data warehouse for our other trolleys that aren’t in the incoming data set?

The answer is with an incremental MERGE: 

--Declare temp tables to store data
PackageId INTEGER,
JournalId INTEGER,
Amount MONEY,

PackageId INTEGER,
JournalId INTEGER,
Amount MONEY

--Insert some dummy data
INSERT INTO @Source VALUES (1,1,10),(1,2,20),(1,3,30),(2,1,40),(2,2,50)

--Now merge it into the destination table.
	@Source AS SOURCE
	ON TARGET.PackageId = SOURCE.PackageId
	AND TARGET.JournalId = SOURCE.JournalId

--When we match on the merge key, then update the destination table with the new value
		TARGET.Amount = Source.Amount

--Whne we don't match, then we're a new record, so insert the new record into the destination

--Now show our results.
SELECT * FROM @Source;
SELECT * FROM @Target;

--Now let's get rid of some data and carry out an update to show how we would handle a merge for
--an incremental load
DELETE FROM @Source WHERE PackageId = 1 OR (PackageId = 2 AND JournalId = 1);
UPDATE @Source SET Amount = Amount + 20

--Use a CTE to limit the destination rows we want to match against, becuase if we've only loaded
--1 day worth of values in from our source system, we could end up deleting lots of data by accident
FROM @Source S
INNER JOIN @Target T ON T.PackageId = S.PackageId

	ON TARGET.PackageId = SOURCE.PackageId
	AND TARGET.JournalId = SOURCE.JournalId

--Again, update any existing records
		TARGET.Amount = Source.Amount

--Again, insert any new records

--Here is the intersting part; we're going to soft delete any records from the 
--destination table that do not exist in the incoming data set.

--Show our final results. As you can see in the second merge we only had 1 record for PackageId = 2
--the other JournalId didn't exist in the source table, so it was soft deleted from the destination.
SELECT * FROM @Source;
SELECT * FROM @Target;

Go ahead and run the above code and see how it all works. Hopefully this will help you on your way.

Paul White gives a great example here at DBA.StackExchange


Data Type Changing in UNION


An interesting issue came up recently where a union was creating a rounding problem. One side of the UNION contained a field with data type NUMBERIC(38,2) and the other side, (the equivalent field) was DECIMAL(24,9). One field having the bigger precision (number of digits to the left of the decimal point) and the other have a bigger scale (number of digits to the right of the decimal point)..

Let’s  see it in action..







And when we union them together: 






The answer lies in the fact that SQL Server will preserve the biggest scale over the biggest precision. After all I’d be far more annoyed if I lost £1000 that .0120144414324p..

The solution to this problem is to ensure that you cast to the same data type in both of your queries to ensure you know the data types that will come out of the other end.

SQL Server – Find all non empty tables


So there are a couple of ways to do this, one is to go to every table in your database and check whether there is data, or you can go to the indexes and check the row count, here I’ll show you how to do both:

1: hit every table (don’t do this in a production environment or your DBAs will go mental..

IF OBJECT_ID('tempdb..#Tables') IS NOT NULL
	DROP TABLE #Tables;

--Create some temp tables and declare variables
TableName VARCHAR(100),
HasRows BIT,
Processed BIT

Result INT

DECLARE @TableCount INT = 0, @i INT = 1;

DECLARE @CurrentTable NVARCHAR(150);

--Get the name of each table in the database wrapped in square brackets.
INSERT INTO #Tables (TableName, HasRows, Processed)
SELECT '[' + + '].[' + + ']', 0, 0
FROM sys.objects o
INNER JOIN sys.schemas s ON s.schema_id = o.schema_id
WHERE type = 'u'
--AND IN (N'Tbl_Name','Tbl_OtherTable');

SELECT @TableCount = COUNT(1) FROM #tables;

--Loop through all of the tables and check if there are rows
WHILE @i <= @TableCount

SELECT @CurrentTable = 'SELECT TOP 1 1 FROM ' + TableName FROM #Tables WHERE TableId = @i;

--PRINT @CurrentTable;

INSERT INTO @Result EXEC sp_executesql @CurrentTable;

--Populate the result table
UPDATE t SET t.HasRows = Result
FROM #Tables t
INNER JOIN @Result r ON t.TableId = @i;

SET @i = @i + 1;


--Get the rows we're interested in
FROM #Tables
WHERE HasRows = 1


The other way is to hit the indexes and check the row counts:

SELECT DISTINCT [Schema], [Table]--, [IndexName], p.[rows] [RowCount]
FROM sys.objects o
INNER JOIN sys.schemas s	ON s.schema_id = o.schema_id
INNER JOIN sys.indexes i	ON i.object_id = o.object_id
INNER JOIN sys.partitions p ON p.index_id = i.index_id
								AND p.object_id = i.object_id
WHERE o.type = 'U'
AND p.[rows] > 0


You can remove the DISTINCT and uncomment and p.rows to get the count of rows in each index.



SSIS Cache Transform – Failed to prepare the cache for new data


Encountered this error today when using a cache transform to populate a lookup cache with some data that I’d look up against multiple times:


I hadn’t set up the cache transform to use a .caw file (more info here), so there were no permissions issues with writing to disk, all that should be happening is writing to memory. Here’s what my SSIS package looked like when the error was received:


In the above we get the number of days worth of data to load and then load each day separately into the destination table (we’re loading each day as we hit some external limits if we don’t). The issue here was you can’t overwrite or append data to the cache. So all that was required was to move the ‘Load the lookup cache’ step outside of the for loop.


After the cache was moved out of the loop, it loaded successfully.


MSBuild – Missing values for the following SqlCmd variables


Today our automated build broke with a strange error:

Missing values for the following SqlCmd variables: DW

We have a couple of references to other databases in our build specifically to deal with warnings thrown by temporary cross database references. One of which is master and the other is another database on the same server:


The build was failing due to the reference to the DW database. I’d removed the reference and added it back in to see if that made a difference, and checked that it matched what was in live, no dice. Still broken.

What I’d overlooked was the project properties. When a new database reference is added it automatically comes with a new SQLCMD variable that you can set, which looks like this:


Having this reference was causing the Build to try and use it, which wasn’t required. Simply removing this variable solved the problem, as our build server already has a copy of the referenced database.


Scripting DROP and CREATE SQL Server Agent Jobs


Something to watch out for..

When Scripting a SQL Server Agent Job as drop and create in SSMS watch out that the generated script uses the job_name to check if the job exists and then job_id to delete it. If you want your scripts to be re-runnable then make sure you change the generated script as per the below:

Before: Notice that the @job_id is used, when you drop this job and recreate it, a new job_id will be generated, meaning you can’t your script twice.

/****** Object:  Job [SSIS_Framework_Demand_Load_Full]    Script Date: 07/11/2016 15:49:17 ******/
IF  EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'SSIS_Framework_Demand_Load_Full')
EXEC msdb.dbo.sp_delete_job @job_id=N'd0370c2e-6fae-4717-8726-8f481a3b47f1', @delete_unused_schedule=1

Make sure you change your code to use the job_name:

/****** Object:  Job [SSIS_Framework_Demand_Load_Full]    Script Date: 07/11/2016 15:49:17 ******/
IF  EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'SSIS_Framework_Demand_Load_Full')
EXEC msdb.dbo.sp_delete_job @job_name=N'SSIS_Framework_Demand_Load_Full', @delete_unused_schedule=1

Good luck!

Using Microsoft Azure Machine Learning


There are many new technologies arriving all the time, but unless you’ve been living under a rock, you’ll have already heard about the impact the cloud is having on
business day to day.

One of the most exciting cloud offerings is the Microsoft Azure platform, based in data centres across Europe. Here you’ll be able to find everything from Linux Virtual Machines to web servers to HD Insight Clusters.

I’ve had the opportunity to travel to the Microsoft data centre in Dublin, and it is impressive as it is large. Moving to commodity hardware as the backbone of the platform allows normal users to gain access to massive clusters of compute power priced on usage. Imagine asking your boss if you could purchase a 100 node cluster  for your server room just to play with for a few days? The Microsoft Azure platform grants you that ability for a fraction of the cost.

One of the most powerful offerings however is Microsoft’s cloud based Machine Learning workspace. This offering allows users to make use of powerful machine learning algorithms to solve the kind of problems procurement have been wanting to solve for years, namely how can we quickly improve our forecasts and predict outcomes?

Given that I work for an airline, the easiest place to start is with our very own type of product, flights. One of the things we need to make sure of as a business is that our flights arrive on time and our customers arrive safely and happily at their destinations without any delays. With this in mind is there any way we can use machine learning to predict flight delays? Of course there is. I’ll show you how below.

Step 1: Get your Azure Subscription

You can sign up for a free account here. This will allow you to test out Microsoft Azure and more specifically to this blog, Microsoft Azure Machine Learning.

Step 2: Sign in

Once you’ve signed into the portal you’ll see the dashboard.


Now go to Browse > Machine Learning


At the moment you’ll get taken back to the old Azure Portal (for everything other than ML always use the new one, it’s leaps forward)

Step 3: Create a new experiment by clicking the big + sign that says NEW


You’ll be offered the below screen and you can go ahead and click Quick Create to set up your new experiment. You’ll need to add a new workspace name, select a location (West Europe is closest to us here in Leeds of the options). You will also need to create a new storage account to store your experiment data and the results of your experiments.


Azure will then go ahead and create you a new Azure Machine Learning workspace, pretty simple huh!

Step 4: Once your workspace has been created you can access it by clicking the Sign-in to ML Studio button:


Step 5: Create a new experiment

On the Experiments tab click the + NEW button and select Blank Experiment


Now you’ll be presented with a blank experiment, and now we can get started!


Pause to reflect

Let’s take a minute to explore the ML workspace. On the left of the canvas you’ll see Saved Datasets, Data Format Conversions, Feature Selection and many other areas, these areas contain modules we can use during our experiment. Machine learning in the format we’re using consists of a small number of parts:

1: Acquire data – Here we get the data we want to use
2: Select Columns – Here we select which values in our data set are important
3: Split data – This is an especially important concept to understand. In this step we need to split our dataset into two parts,1 for training the model and 1 for scoring the model.

4: Training Set: This set contains the thing we’re attempting to predict, in this case it will contain the columns used to do the predicting along with the value we’re trying to predict. It’s this set that the ML algorithm (we’ll learn about algorithms later) will use to work out which columns are important and how they affect the outcome, in our case we’re going to predict whether a flight is delayed by more than 15 minutes or not, i.e. delayed (1) or not delayed (0).

5: Scoring Set: In this set we don’t provide the outcome, i.e. whether the flight was delayed or not, rather we let the algorithm determine this value. Now because we already know the outcome we can test the predicted value created by the model against the value we already know. This is how we score our model and work out how accurate it is.

Using a very common business concept, we’ll split our data 80/20. 80 for training the model and 20 for scoring the model.

6:  Algorithm Selection. This is where we pick which algorithm we’re going to use. There are a whole bunch to choose from and you can use this cheat sheet to work out which one is best suited to your problem. Because we’re only predicting whether the flight was delayed (1) or not delayed (0) we’ll be using a Classification algorithm.

!Remember! This is SCIENCE! This means we don’t have to get the algorithm selection correct the first time. It’s called an experiment for a reason, so run as many different algorithms as you can and mark down all of the results until you find the best fit. Maybe what you’re trying to predict can’t be predicted, that’s fine, science is all about having a hypothesis and testing it to see whether it’s true or not.

7: Evaluate model. This is where we test how well the algorithm has predicted what we asked it to. We’ll learn more about this later.

Now let’s get started!

Step 6: Get your data

Just to make things so much simpler for us, Microsoft have included an On Time Performance dataset that we can use. You can find it by clicking Saved Datasets > Samples > Flight Delays Data


Drag this over to the first space on your experiment canvas.

Now you can right click on the circle at the bottom and select Visualize. This will show you the data and give you some distribution graphs. Note that the ArrDel15 column contains 1’s and 0’s. This is boolean logic, 1 meaning true (is delayed more than 15 minutes) and 0 meaning false (isn’t delayed)


Step 7: Select columns.

Here we need to go to the Manipulation menu and select ‘Select Columns in Dataset’

















Next you’ll need to grab the small circle at the bottom of the Flight Delays Data dataset and drag down to connect it with the top circle of the Select Columns in Dataset transformation


Now click the Launch Column Selector button to pick the columns we want to use


Don’t select all of the columns as selecting any of the delay columns will let the algorithm cheat by giving it the answers, i.e. if [ArrDelay] is over 15 then the chance of [ArrDel15] being 1 is 100% and the algorithm will pick up on this. It also makes sense that there’s a higher chance of the flight having an arrival delay if it it had a departure delay, so let’s exclude [DepDelay] and [DepDel15]. We only need to include [ArrDel15] as this is the column we’re trying to predict



Step 8: Split the data

This is where we split our data into training and scoring datasets. From the menu on the left, select Data Transformation > Sample and Split


Now click on Split Data and you’ll see the properties pop up in Properties panel. Set the ‘Fraction of rows in the first output dataset’ to be 0.8, which is the 80% we mentioned before.

Step 9: Train the model

The Train Model transformation takes two inputs and has one output. The inputs are one side of the split data (this is our training set), and the algorithm we want to use.

From menu select Train > Train Model, drag it onto your canvas and join the right hand input to the output of the Split Data transformation. Then click on Train Model and in the properties panel on the right click the ‘Launch column selector’. This is where we pick the field we want to predict, in our case [ArrDel15], which is the field that shows if the flight was late arriving by more than 15 minutes.


Now from the menu select Initialize Model > Classification > Two Class Boosted Decision Tree, drag it onto the canvas and join it up to the left input of the Train Model transformation. Why I’ve chosen this algorithm is beyond the scope of this blog post and you can find more information here. For now you can test all of the Classification algorithms to see which one gives you the best fit.



Step 10: Score the model

This step will test how well our model has predicted our outcome.

From the menu select Machine Learning > Score > Score Model and drag it onto the canvas. Now we can connect the right hand output from the Split Data transformation up to the right hand input of the Score Model transform and the output of the Train Model up to the left hand input.


Step 11: Evaluate Model

This will show us the results of our experiment and whether we managed to predict our outcome. Remember this is an experiment and we may well find out we’re missing something or we need to find out more about our data, which is all fine.

From the menu select Machine Learning > Evaluate > Evaluate Model and connect it to the output of the Score Model transform.


Step 12: Run our experiment!!

Now we can run out experiment by clicking the run button


How did we do? Right click on the output of the Evaluate Model and select Visualize to find out. Here you can see an output of how well the model was able to predict whether each flight would arrive within 15 minutes or not. Below are the results we’ve achieved.


If you pay attention to the bottom left graph you’ll see the mix of predictions and results that the algorithm achieved


True Positive: The number of records the algorithm predicted would be true (delayed) that were true (delayed).
False Positive: The number of records the algorithm predicted would be true (delayed) that were actually false (on time).
False Negative: The number of records the algorithm predicted would be false (on time) that were true (delayed).
True Negative: The number of records the algorithm predicted would be false (on time) that were false (on time).
Accuracy: This is a ratio of correctly predicted results. This is the (True Positives + True Negatives) / The total number of records.
You can find out more information about Precision, Recall, and F1 Score here.

It’s obvious that we need more information to be able to accurately predict whether a flight will be delayed or not, but it gives a great introduction into how to use Azure Machine Learning, and we can now go on to augment our dataset with other predictor values, maybe weather or strike information might help, the possibilities are out there, it’s your job as a data scientist to find them! Good luck.