Spark SQL – DateDiff

Standard

Spark SQL does implement a DateDiff function, however it appears to be slightly different from SQL Server’s version. The SparkSQL version returns the number of days before the EndDate the StartDate is.

For instance for StartDate = ‘2012-10-17’ and EndDate = ‘2012-10-19’ SparkSQL will return -2, so you need to specify your EndDate first and then your StartDate second:

SQL Server:

DECLARE @StartDate DATE = '20121017', @EndDate DATE = '20121019';

SELECT DATEDIFF(DAY,@StartDate, @EndDate) [DaysDifference];

Result:

DaysDifference
————–
2

Spark SQL: 

%sql
select PropertyKey, StartDate, EndDate ,datediff(to_date(EndDate), to_date(StartDate)) AS StopSaleDate
from stopsale
where to_date(StartDate) = "2012-10-17"

Result:

SparkSQLDateDiff

So firstly we need to cast our date string to String and then we use datediff to get the difference, but make sure you put EndDate first followed by StartDate.

Spark SQL working with Dates

Standard

Looking around I found it pretty difficult when working with dates in Spark SQL as there is no way to map a date field in a .csv straight to a date field in DataFrame. However once you’ve converted your DataFrame to an RDD (Resilient Distributed Dataset) you can select from the RDD using the to_date function.

I’ve set up a new SSIS package that uploads a number of text files into a Storage Container in Microsoft Azure using Azure Tools for Visual Studio

For instance if my CSVs look like this:

ProductId, QuantitySold, Date
1,100,2015-12-31
2,120,2015-12-31
3,199,2015-12-31

You would map to to a DataFrame as per the below Scala:

val productSaleText = sc.textFile("wasb://<StorageContainerName>@<StorageAccountName>.blob.core.windows.net/<FolderLocation>/<Filenname (optional)>")

case class ProductSale(ProductId: Integer, QuantitySold: Integer, Date: String)

val productsale = ProductSaleText.map(s => s.split(",")).map(
                s => ProductSale(s(0).toInt,
                    s(1).toInt,
                    s(2)
                    )
        ).toDF()

stopsale.registerTempTable("productsale")

Please note that when loading the prodcutSaleText variable, you can simply point to the directory that holds all of your text files and they will all be loaded in at the same, with the inference that they all have the same schema, and since I’m uploading the files myself I already know their structure.

Which would allow you to select from your newly created RDD like this (SparkSQL):

%sql
select ProductId, COUNT(*) AS ProductSaleCount
from productsale
where to_date(Date) >= "2015-12-17"
and to_date(Date) <= "2015-12-31"
group by ProductId

So not that difficult after all.

SSIS Script Task Won’t Debug

Standard

You’ve gone into the editor and added your breakpoints to your C# but when you run your package the script task won’t debug, what’s going on?

There are a couple of causes of this we’ve found.

Number 1: You have two or more script tasks and two or more of these have break points in them. Only set break points in one script task at a time.

Number 2: You need to set the Run64BitRuntime property of the package to false. Right-click on your project and select ‘Properties’, then go to the Debugging tab and change the below record to FalseSSIS_Debug

 

SSIS moving objects on canvas craziness

Standard

Ever had problems in SSIS when moving objects about where the object you’re using will zoom off the designer and almost into a parallel universe? I’m sure we’ve all been there and the only solution I’ve seen was to hold down Ctrl whilst using the arrow keys to put your objects where you want. Well there’s a better way..

Step 1: Click the ‘Fit view to window’ button:

SSISZoom

Step 2: Move your objects to the shape you want, if you have one outlier that is causing the problem, round it up and bring it back closer to the other objects:

SSISZoom2

Step 3: Select ALL of the objects on the canvas and press Ctrl + X or right-click and select ‘Cut’

Step 4: Press Ctrl+V or right-click and select ‘Paste’ to paste all of your object back onto the canvas, but this time back to where they belong.

SSISZoom3

Publishing To Tableau Server with PowerShell

Standard

Hi All!

Time for a useful little piece of PowerShell code for publishing your Tableau workbooks or data sets to a server. Tabcmd won’t connect to Tableau in PowerShell unless you provide a password in the script, so I’ve added in a little script courtesy of Jamie Thompson (which you can find here). This script will allow you to specify the folder containing the files you want to upload to your server, along with the site name and project and then it will publish them for you. You can change the -Include section to uploade other kinds of files to your Tableau server.

$username = Read-host "Please Enter Username: "
$response = Read-host "Please Enter Password: " -AsSecureString
$password = [Runtime.InteropServices.Marshal]::PtrToStringAuto([Runtime.InteropServices.Marshal]::SecureStringToBSTR($response))

$ReleaseFolder = "C:\Users\jim\Release"
$SiteName = "Samples"
$ProjectName = "default"
$publishFile = ""

ForEach($item in (Get-ChildItem -Path ($ReleaseFolder + "\*") -File -Include *.tdsx))
{

   $publishFile = """$item"""
   
   tabcmd publish $publishFile -t $SiteName -r $ProjectName -u $username -p $password
}

I hope you find it as useful as I have!