Spark SQL filter equal values

Standard

When filtering in a DataFrame in Scala you need to use a triple = (i.e. ===) to filter:

So you must use:

ss.filter(ss("ProductKey") === 68325).show()

Because using

ss.filter(ss("ProductKey") == 68325).show()

will return the following error:

<console>:25: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
              ss.filter(ss("ProductKey") == 68325).show()

Also

ss.filter(ss("ProductKey") = 68325).show()

will return the following error as you cannot update a DataFrame

<console>:25: error: value update is not a member of org.apache.spark.sql.DataFrame
              ss.filter(ss("StopSaleOnPropertyKey") = 68325).show()

 

Spark SQL – DateDiff

Standard

Spark SQL does implement a DateDiff function, however it appears to be slightly different from SQL Server’s version. The SparkSQL version returns the number of days before the EndDate the StartDate is.

For instance for StartDate = ‘2012-10-17’ and EndDate = ‘2012-10-19’ SparkSQL will return -2, so you need to specify your EndDate first and then your StartDate second:

SQL Server:

DECLARE @StartDate DATE = '20121017', @EndDate DATE = '20121019';

SELECT DATEDIFF(DAY,@StartDate, @EndDate) [DaysDifference];

Result:

DaysDifference
————–
2

Spark SQL: 

%sql
select PropertyKey, StartDate, EndDate ,datediff(to_date(EndDate), to_date(StartDate)) AS StopSaleDate
from stopsale
where to_date(StartDate) = "2012-10-17"

Result:

SparkSQLDateDiff

So firstly we need to cast our date string to String and then we use datediff to get the difference, but make sure you put EndDate first followed by StartDate.