I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3)
dataset <- filter(!(column %in% array))
In pyspark you can do it like this:
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))
Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
== False
when we have ~
specifically for negation?
*
was useless
df_result = df[df.column_name.isin([1, 2, 3]) == False]
slightly different syntax and a "date" data set:
toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)
*
is not needed. So:
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
You can use the .subtract()
buddy.
Example:
df1 = df.select(col(1),col(2),col(3))
df2 = df.subtract(df1)
This way, df2 will be defined as everything that is df that is not df1.
You can also loop the array and filter:
array = [1, 2, 3]
for i in array:
df = df.filter(df["column"] != i)
Success story sharing
*
in*array
?dataframe.column
is case sensitive! Alternatively, you can use the dictionary syntaxdataframe[column]
, which is not :)==
operator is doing here is calling the overloaded__eq__
method on theColumn
result returned bydataframe.column.isin(*array)
. That's overloaded to return another column result to test for equality with the other argument (in this case,False
). Theis
operator tests for object identity, that is, if the objects are actually the same place in memory. If you useis
here, it would always fail because the constantFalse
doesn't ever live in the same memory location as aColumn
. Additionally, you can't overloadis
.isin(array)
and it works just fine.