Spark SQL filter equal values


When filtering in a DataFrame in Scala you need to use a triple = (i.e. ===) to filter:

So you must use:

ss.filter(ss("ProductKey") === 68325).show()

Because using

ss.filter(ss("ProductKey") == 68325).show()

will return the following error:

<console>:25: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
              ss.filter(ss("ProductKey") == 68325).show()


ss.filter(ss("ProductKey") = 68325).show()

will return the following error as you cannot update a DataFrame

<console>:25: error: value update is not a member of org.apache.spark.sql.DataFrame
              ss.filter(ss("StopSaleOnPropertyKey") = 68325).show()



Spark SQL – DateDiff


Spark SQL does implement a DateDiff function, however it appears to be slightly different from SQL Server’s version. The SparkSQL version returns the number of days before the EndDate the StartDate is.

For instance for StartDate = ‘2012-10-17’ and EndDate = ‘2012-10-19’ SparkSQL will return -2, so you need to specify your EndDate first and then your StartDate second:

SQL Server:

DECLARE @StartDate DATE = '20121017', @EndDate DATE = '20121019';

SELECT DATEDIFF(DAY,@StartDate, @EndDate) [DaysDifference];



Spark SQL: 

select PropertyKey, StartDate, EndDate ,datediff(to_date(EndDate), to_date(StartDate)) AS StopSaleDate
from stopsale
where to_date(StartDate) = "2012-10-17"



So firstly we need to cast our date string to String and then we use datediff to get the difference, but make sure you put EndDate first followed by StartDate.