ChatGPT解决这个技术问题 Extra ChatGPT

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value:

df.select('dt_mvmt').distinct().collect()

[Row(dt_mvmt=u'2016-03-27'),
 Row(dt_mvmt=u'2016-03-28'),
 Row(dt_mvmt=u'2016-03-29'),
 Row(dt_mvmt=None),
 Row(dt_mvmt=u'2016-03-30'),
 Row(dt_mvmt=u'2016-03-31')]

and I can filter correctly with an string value:

df[df.dt_mvmt == '2016-03-31']
# some results here

but this fails:

df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0

But there are definitely values on each category. What's going on?

You actually want to filter rows with null values, not a column with None values. The title could be misleading.
In a nutshell, a comparison involving null (or None, in this case) always returns false. In particular, the comparison (null == null) returns false. Also, the comparison (None == None) returns false.

z
zero323

You can use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

If you want to simply drop NULL values you can use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## |         null|
## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## |               null|
## +-------------------+

The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.


Awesome, thanks. I thought that these filters on PySpark dataframes would be more "pythonic", but alas, they're not. I'm thinking on asking the devs about this.
Actually it is quite Pythonic. You should never check __eq__ with None ;) And is wouldn't work because it doesn't behave the same way.
Strangely this only works for string columns... It seems like df.filter("dt_mvmt is not NULL") handles both.
A
Anthony

Try to just use isNotNull function.

df.filter(df.dt_mvmt.isNotNull()).count()

t
timctran

To obtain entries whose values in the dt_mvmt column are not null we have

df.filter("dt_mvmt is not NULL")

and for entries which are null we have

df.filter("dt_mvmt is NULL")

N
Neeraj Bhadani

There are multiple ways you can remove/filter the null values from a column in DataFrame.

Lets create a simple DataFrame with below code:

date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())

Now you can try one of the below approach to filter out the null values.

# Approach - 1
df.filter("value is not null").show()

# Approach - 2
df.filter(col("value").isNotNull()).show()

# Approach - 3
df.filter(df["value"].isNotNull()).show()

# Approach - 4
df.filter(df.value.isNotNull()).show()

# Approach - 5
df.na.drop(subset=["value"]).show()

# Approach - 6
df.dropna(subset=["value"]).show()

# Note: You can also use where function instead of a filter.

You can also check the section "Working with NULL Values" on my blog for more information.

I hope it helps.


P
Purushothaman Srikanth

isNull()/isNotNull() will return the respective rows which have dt_mvmt as Null or !Null.

method_1 = df.filter(df['dt_mvmt'].isNotNull()).count()
method_2 = df.filter(df.dt_mvmt.isNotNull()).count()

Both will return the same result


S
Suraj Rao

if column = None

COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------

Use create a temptable on data frame:

sqlContext.sql("select * from tempTable where column_old_value='None' ").show()

So use : column_old_value='None'


P
Paul Roub

If you want to keep with the Pandas syntex this worked for me.

df = df[df.dt_mvmt.isNotNull()]

P
Purushothaman Srikanth

None/Null is a data type of the class NoneType in PySpark/Python so, below will not work as you are trying to compare NoneType object with the string object

Wrong way of filreting

df[df.dt_mvmt == None].count()

0

df[df.dt_mvmt != None].count()

0

correct

df=df.where(col("dt_mvmt").isNotNull())

returns all records with dt_mvmt as None/Null


S
Swaminathan Meenakshisundaram

PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.

Below set of code can be considered:

# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number

# Filter here
df = df.filter(df.dt_mvmt.isNotNull())

# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present

A
AS Mackay

If you want to filter out records having None value in column then see below example:

df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])

Now filter out null value records:

df=df.filter(df.b.isNotNull())

df.show()

If you want to remove those records from DF then see below:

df1=df.na.drop(subset=['b'])

df1.show()