SQL Server – Find all non empty tables

Standard

So there are a couple of ways to do this, one is to go to every table in your database and check whether there is data, or you can go to the indexes and check the row count, here I’ll show you how to do both:

1: hit every table (don’t do this in a production environment or your DBAs will go mental..

IF OBJECT_ID('tempdb..#Tables') IS NOT NULL
BEGIN
	DROP TABLE #Tables;
END

--Create some temp tables and declare variables
CREATE TABLE #Tables
(
TableId INTEGER IDENTITY(1,1),
TableName VARCHAR(100),
HasRows BIT,
Processed BIT
);


DECLARE @Result TABLE
(
Result INT
);

DECLARE @TableCount INT = 0, @i INT = 1;

DECLARE @CurrentTable NVARCHAR(150);

--Get the name of each table in the database wrapped in square brackets.
INSERT INTO #Tables (TableName, HasRows, Processed)
SELECT '[' + s.name + '].[' + o.name + ']', 0, 0
FROM sys.objects o
INNER JOIN sys.schemas s ON s.schema_id = o.schema_id
WHERE type = 'u'
--AND o.name IN (N'Tbl_Name','Tbl_OtherTable');

SELECT @TableCount = COUNT(1) FROM #tables;

--Loop through all of the tables and check if there are rows
WHILE @i <= @TableCount
BEGIN

SELECT @CurrentTable = 'SELECT TOP 1 1 FROM ' + TableName FROM #Tables WHERE TableId = @i;

--PRINT @CurrentTable;

INSERT INTO @Result EXEC sp_executesql @CurrentTable;

--Populate the result table
UPDATE t SET t.HasRows = Result
FROM #Tables t
INNER JOIN @Result r ON t.TableId = @i;

SET @i = @i + 1;
DELETE FROM @Result;

END

--Get the rows we're interested in
SELECT *
FROM #Tables
WHERE HasRows = 1

 

The other way is to hit the indexes and check the row counts:

SELECT DISTINCT s.name [Schema], o.name [Table]--, i.name [IndexName], p.[rows] [RowCount]
FROM sys.objects o
INNER JOIN sys.schemas s	ON s.schema_id = o.schema_id
INNER JOIN sys.indexes i	ON i.object_id = o.object_id
INNER JOIN sys.partitions p ON p.index_id = i.index_id
								AND p.object_id = i.object_id
WHERE o.type = 'U'
AND p.[rows] > 0
ORDER BY s.name, o.name

 

You can remove the DISTINCT and uncomment i.name and p.rows to get the count of rows in each index.

 

 

SSIS Cache Transform – Failed to prepare the cache for new data

Standard

Encountered this error today when using a cache transform to populate a lookup cache with some data that I’d look up against multiple times:

ct1

I hadn’t set up the cache transform to use a .caw file (more info here), so there were no permissions issues with writing to disk, all that should be happening is writing to memory. Here’s what my SSIS package looked like when the error was received:

ct2

In the above we get the number of days worth of data to load and then load each day separately into the destination table (we’re loading each day as we hit some external limits if we don’t). The issue here was you can’t overwrite or append data to the cache. So all that was required was to move the ‘Load the lookup cache’ step outside of the for loop.

ct3

After the cache was moved out of the loop, it loaded successfully.

 

MSBuild – Missing values for the following SqlCmd variables

Standard

Today our automated build broke with a strange error:

Missing values for the following SqlCmd variables: DW

We have a couple of references to other databases in our build specifically to deal with warnings thrown by temporary cross database references. One of which is master and the other is another database on the same server:

refs1

The build was failing due to the reference to the DW database. I’d removed the reference and added it back in to see if that made a difference, and checked that it matched what was in live, no dice. Still broken.

What I’d overlooked was the project properties. When a new database reference is added it automatically comes with a new SQLCMD variable that you can set, which looks like this:

refs2

Having this reference was causing the Build to try and use it, which wasn’t required. Simply removing this variable solved the problem, as our build server already has a copy of the referenced database.

 

Scripting DROP and CREATE SQL Server Agent Jobs

Standard

Something to watch out for..

When Scripting a SQL Server Agent Job as drop and create in SSMS watch out that the generated script uses the job_name to check if the job exists and then job_id to delete it. If you want your scripts to be re-runnable then make sure you change the generated script as per the below:

Before: Notice that the @job_id is used, when you drop this job and recreate it, a new job_id will be generated, meaning you can’t your script twice.

/****** Object:  Job [SSIS_Framework_Demand_Load_Full]    Script Date: 07/11/2016 15:49:17 ******/
IF  EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'SSIS_Framework_Demand_Load_Full')
EXEC msdb.dbo.sp_delete_job @job_id=N'd0370c2e-6fae-4717-8726-8f481a3b47f1', @delete_unused_schedule=1
GO

Make sure you change your code to use the job_name:

/****** Object:  Job [SSIS_Framework_Demand_Load_Full]    Script Date: 07/11/2016 15:49:17 ******/
IF  EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'SSIS_Framework_Demand_Load_Full')
EXEC msdb.dbo.sp_delete_job @job_name=N'SSIS_Framework_Demand_Load_Full', @delete_unused_schedule=1
GO

Good luck!

Using Microsoft Azure Machine Learning

Standard

There are many new technologies arriving all the time, but unless you’ve been living under a rock, you’ll have already heard about the impact the cloud is having on
business day to day.

One of the most exciting cloud offerings is the Microsoft Azure platform, based in data centres across Europe. Here you’ll be able to find everything from Linux Virtual Machines to web servers to HD Insight Clusters.

I’ve had the opportunity to travel to the Microsoft data centre in Dublin, and it is impressive as it is large. Moving to commodity hardware as the backbone of the platform allows normal users to gain access to massive clusters of compute power priced on usage. Imagine asking your boss if you could purchase a 100 node cluster  for your server room just to play with for a few days? The Microsoft Azure platform grants you that ability for a fraction of the cost.

One of the most powerful offerings however is Microsoft’s cloud based Machine Learning workspace. This offering allows users to make use of powerful machine learning algorithms to solve the kind of problems procurement have been wanting to solve for years, namely how can we quickly improve our forecasts and predict outcomes?

Given that I work for an airline, the easiest place to start is with our very own type of product, flights. One of the things we need to make sure of as a business is that our flights arrive on time and our customers arrive safely and happily at their destinations without any delays. With this in mind is there any way we can use machine learning to predict flight delays? Of course there is. I’ll show you how below.

Step 1: Get your Azure Subscription

You can sign up for a free account here. This will allow you to test out Microsoft Azure and more specifically to this blog, Microsoft Azure Machine Learning.

Step 2: Sign in

Once you’ve signed into the portal you’ll see the dashboard.

ML

Now go to Browse > Machine Learning

ML2

At the moment you’ll get taken back to the old Azure Portal (for everything other than ML always use the new one, it’s leaps forward)

Step 3: Create a new experiment by clicking the big + sign that says NEW

ML3

You’ll be offered the below screen and you can go ahead and click Quick Create to set up your new experiment. You’ll need to add a new workspace name, select a location (West Europe is closest to us here in Leeds of the options). You will also need to create a new storage account to store your experiment data and the results of your experiments.

ML4

Azure will then go ahead and create you a new Azure Machine Learning workspace, pretty simple huh!

Step 4: Once your workspace has been created you can access it by clicking the Sign-in to ML Studio button:

ML5

Step 5: Create a new experiment

On the Experiments tab click the + NEW button and select Blank Experiment

ML6

Now you’ll be presented with a blank experiment, and now we can get started!

ML7

Pause to reflect

Let’s take a minute to explore the ML workspace. On the left of the canvas you’ll see Saved Datasets, Data Format Conversions, Feature Selection and many other areas, these areas contain modules we can use during our experiment. Machine learning in the format we’re using consists of a small number of parts:

1: Acquire data – Here we get the data we want to use
2: Select Columns – Here we select which values in our data set are important
3: Split data – This is an especially important concept to understand. In this step we need to split our dataset into two parts,1 for training the model and 1 for scoring the model.

4: Training Set: This set contains the thing we’re attempting to predict, in this case it will contain the columns used to do the predicting along with the value we’re trying to predict. It’s this set that the ML algorithm (we’ll learn about algorithms later) will use to work out which columns are important and how they affect the outcome, in our case we’re going to predict whether a flight is delayed by more than 15 minutes or not, i.e. delayed (1) or not delayed (0).

5: Scoring Set: In this set we don’t provide the outcome, i.e. whether the flight was delayed or not, rather we let the algorithm determine this value. Now because we already know the outcome we can test the predicted value created by the model against the value we already know. This is how we score our model and work out how accurate it is.

Using a very common business concept, we’ll split our data 80/20. 80 for training the model and 20 for scoring the model.

6:  Algorithm Selection. This is where we pick which algorithm we’re going to use. There are a whole bunch to choose from and you can use this cheat sheet to work out which one is best suited to your problem. Because we’re only predicting whether the flight was delayed (1) or not delayed (0) we’ll be using a Classification algorithm.

!Remember! This is SCIENCE! This means we don’t have to get the algorithm selection correct the first time. It’s called an experiment for a reason, so run as many different algorithms as you can and mark down all of the results until you find the best fit. Maybe what you’re trying to predict can’t be predicted, that’s fine, science is all about having a hypothesis and testing it to see whether it’s true or not.

7: Evaluate model. This is where we test how well the algorithm has predicted what we asked it to. We’ll learn more about this later.

Now let’s get started!

Step 6: Get your data

Just to make things so much simpler for us, Microsoft have included an On Time Performance dataset that we can use. You can find it by clicking Saved Datasets > Samples > Flight Delays Data

ML8

Drag this over to the first space on your experiment canvas.

Now you can right click on the circle at the bottom and select Visualize. This will show you the data and give you some distribution graphs. Note that the ArrDel15 column contains 1’s and 0’s. This is boolean logic, 1 meaning true (is delayed more than 15 minutes) and 0 meaning false (isn’t delayed)

ML21

Step 7: Select columns.

Here we need to go to the Manipulation menu and select ‘Select Columns in Dataset’

ML9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Next you’ll need to grab the small circle at the bottom of the Flight Delays Data dataset and drag down to connect it with the top circle of the Select Columns in Dataset transformation

ML15

Now click the Launch Column Selector button to pick the columns we want to use

ML11

Don’t select all of the columns as selecting any of the delay columns will let the algorithm cheat by giving it the answers, i.e. if [ArrDelay] is over 15 then the chance of [ArrDel15] being 1 is 100% and the algorithm will pick up on this. It also makes sense that there’s a higher chance of the flight having an arrival delay if it it had a departure delay, so let’s exclude [DepDelay] and [DepDel15]. We only need to include [ArrDel15] as this is the column we’re trying to predict

ML12

 

Step 8: Split the data

This is where we split our data into training and scoring datasets. From the menu on the left, select Data Transformation > Sample and Split

ML13

Now click on Split Data and you’ll see the properties pop up in Properties panel. Set the ‘Fraction of rows in the first output dataset’ to be 0.8, which is the 80% we mentioned before.

Step 9: Train the model

The Train Model transformation takes two inputs and has one output. The inputs are one side of the split data (this is our training set), and the algorithm we want to use.

From menu select Train > Train Model, drag it onto your canvas and join the right hand input to the output of the Split Data transformation. Then click on Train Model and in the properties panel on the right click the ‘Launch column selector’. This is where we pick the field we want to predict, in our case [ArrDel15], which is the field that shows if the flight was late arriving by more than 15 minutes.

ML17

Now from the menu select Initialize Model > Classification > Two Class Boosted Decision Tree, drag it onto the canvas and join it up to the left input of the Train Model transformation. Why I’ve chosen this algorithm is beyond the scope of this blog post and you can find more information here. For now you can test all of the Classification algorithms to see which one gives you the best fit.

ML17

 

Step 10: Score the model

This step will test how well our model has predicted our outcome.

From the menu select Machine Learning > Score > Score Model and drag it onto the canvas. Now we can connect the right hand output from the Split Data transformation up to the right hand input of the Score Model transform and the output of the Train Model up to the left hand input.

 

Step 11: Evaluate Model

This will show us the results of our experiment and whether we managed to predict our outcome. Remember this is an experiment and we may well find out we’re missing something or we need to find out more about our data, which is all fine.

From the menu select Machine Learning > Evaluate > Evaluate Model and connect it to the output of the Score Model transform.

ML18

Step 12: Run our experiment!!

Now we can run out experiment by clicking the run button

ML10

How did we do? Right click on the output of the Evaluate Model and select Visualize to find out. Here you can see an output of how well the model was able to predict whether each flight would arrive within 15 minutes or not. Below are the results we’ve achieved.

ML19

If you pay attention to the bottom left graph you’ll see the mix of predictions and results that the algorithm achieved

ML20

True Positive: The number of records the algorithm predicted would be true (delayed) that were true (delayed).
False Positive: The number of records the algorithm predicted would be true (delayed) that were actually false (on time).
False Negative: The number of records the algorithm predicted would be false (on time) that were true (delayed).
True Negative: The number of records the algorithm predicted would be false (on time) that were false (on time).
Accuracy: This is a ratio of correctly predicted results. This is the (True Positives + True Negatives) / The total number of records.
You can find out more information about Precision, Recall, and F1 Score here.

It’s obvious that we need more information to be able to accurately predict whether a flight will be delayed or not, but it gives a great introduction into how to use Azure Machine Learning, and we can now go on to augment our dataset with other predictor values, maybe weather or strike information might help, the possibilities are out there, it’s your job as a data scientist to find them! Good luck.

 

 

 

 

SSMS – Running scripts in SQLCMD Mode

Standard

According to MSDN, SQLCMD can be used to process Windows System commands and Transact-SQL statements in the same script.

One of the most useful things you can do is pass parameters into your script from the command line by using the :setvar command and use them to replace sections of your script, such as user names. Variables set when using :setvar cannot be updated as they are passed into the script when it is called.

You can turn on SQLCMD in SQL Server Management Studio by selecting SQLCMD Mode form the Query menu:

USE [BestDatabase]
GO

:setvar ENV INT

IF DATABASE_PRINCIPAL_ID('BestDomain\BD_Svc$(ENV)') IS NULL
BEGIN
	CREATE USER [BestDomain\BD_Svc$(ENV)] FROM LOGIN [BestDomain\BD_Svc$(ENV)];
END

GO

The above script accepts the parameter ENV and uses it to replace $(ENV) with the value of the parameter, in this case INT. As you can see the above script, when executed, will add the SQL Server login BestDomain\BD_SvcINT, to the BestDatabase database.

I’m sure you can imagine the rest of the possiblities!

Login failed for user ‘NT AUTHORITY\ANONYMOUS USER’

Standard

I needed to manually run a project deployment model SSIS package (on box A) that talks to a SQL Server instance on another machine (box B), the package runs a stored procedure and returns some results. Box A and Box B are not linked servers, however the SQL Agent service account and the SQL Server service account from Box A both have permissions to execute the procedure on box B.

When I run the package manually through the SSIS project

Anonymous_User

it fails with the below error:

Anonymous_User2

What’s going on? I’ve got permission to run the stored procedure on Box B as well, but my user seems to be getting replaced with NT AUTHORITY\ANONYMOUS LOGIN. The issue here is that as of SQL Server needs credential delegation to be turned on. You can do this either through a linked server, or as I did it, by creating a SQL Agent job to handle the delegation for me.

There’s more learning to do here as we may have impersonation handling the delegation for me, but for the purposes of recording how an error message relates to a problem, I think this suffices for the moment and will allow further investigation.