ChatGPT解决这个技术问题 Extra ChatGPT

Pyspark dataframe operator "IS NOT IN"

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

R
Ryan Widmaier

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))

What is the job of the * in *array?
*variable is python syntax for expanding an array to dump it's elements into the function parameters one at a time in order.
Note that dataframe.column is case sensitive! Alternatively, you can use the dictionary syntax dataframe[column], which is not :)
@rjurney No. What the == operator is doing here is calling the overloaded __eq__ method on the Column result returned by dataframe.column.isin(*array). That's overloaded to return another column result to test for equality with the other argument (in this case, False). The is operator tests for object identity, that is, if the objects are actually the same place in memory. If you use is here, it would always fail because the constant False doesn't ever live in the same memory location as a Column. Additionally, you can't overload is.
List splatting with * does not make any difference here. You can just use isin(array) and it works just fine.
L
LaSul

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

Everyone here, shouldn't this be the accepted answer? Why use this not-so-evident-to-understand == False when we have ~ specifically for negation?
Also, * was useless
a
approxiblue
df_result = df[df.column_name.isin([1, 2, 3]) == False]

G
Grant Shannon

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

J
Johnny M

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

u
user2321864

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.


S
Shadowtrooper

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

I wouldn't recommend this in Big Data applications...it means you need to go through the whole dataset tree times...which is huge if you image you have few terrabytes to process
No, because Spark internally optimices this filter to make in 1 time this filters.
then it should be ok ... until new breaking change Spark update or framework switch... and 3 rows instead 1 line + hidden optimisation seems still not good pattern for me...no offense, but I still would recommend to avoid it