Spark SQL – DateDiff


Spark SQL does implement a DateDiff function, however it appears to be slightly different from SQL Server’s version. The SparkSQL version returns the number of days before the EndDate the StartDate is.

For instance for StartDate = ‘2012-10-17’ and EndDate = ‘2012-10-19’ SparkSQL will return -2, so you need to specify your EndDate first and then your StartDate second:

SQL Server:

DECLARE @StartDate DATE = '20121017', @EndDate DATE = '20121019';

SELECT DATEDIFF(DAY,@StartDate, @EndDate) [DaysDifference];



Spark SQL: 

select PropertyKey, StartDate, EndDate ,datediff(to_date(EndDate), to_date(StartDate)) AS StopSaleDate
from stopsale
where to_date(StartDate) = "2012-10-17"



So firstly we need to cast our date string to String and then we use datediff to get the difference, but make sure you put EndDate first followed by StartDate.

Spark SQL working with Dates


Looking around I found it pretty difficult when working with dates in Spark SQL as there is no way to map a date field in a .csv straight to a date field in DataFrame. However once you’ve converted your DataFrame to an RDD (Resilient Distributed Dataset) you can select from the RDD using the to_date function.

I’ve set up a new SSIS package that uploads a number of text files into a Storage Container in Microsoft Azure using Azure Tools for Visual Studio

For instance if my CSVs look like this:

ProductId, QuantitySold, Date

You would map to to a DataFrame as per the below Scala:

val productSaleText = sc.textFile("wasb://<StorageContainerName>@<StorageAccountName><FolderLocation>/<Filenname (optional)>")

case class ProductSale(ProductId: Integer, QuantitySold: Integer, Date: String)

val productsale = => s.split(",")).map(
                s => ProductSale(s(0).toInt,


Please note that when loading the prodcutSaleText variable, you can simply point to the directory that holds all of your text files and they will all be loaded in at the same, with the inference that they all have the same schema, and since I’m uploading the files myself I already know their structure.

Which would allow you to select from your newly created RDD like this (SparkSQL):

select ProductId, COUNT(*) AS ProductSaleCount
from productsale
where to_date(Date) >= "2015-12-17"
and to_date(Date) <= "2015-12-31"
group by ProductId

So not that difficult after all.