ChatGPT解决这个技术问题 Extra ChatGPT

How to replace all Null values of a dataframe in Pyspark

I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null.

For example:

Column_1 column_2
null     null
null     null
234      null
125      124
365      187
and so on

When I want to do a sum of column_1 I am getting a Null as a result, instead of 724.

Now I want to replace the null in all columns of the data frame with empty space. So when I try to do a sum of these columns I don't get a null value but I will get a numerical value.

How can we achieve that in pyspark


M
Mariusz

You can use df.na.fill to replace nulls with zeros, for example:

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df.na.fill(0).show()
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  0|
+---+

D
Dugini Vijay

You can use fillna() func.

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df = df.fillna({'col':'4'})
>>> df.show()

or df.fillna({'col':'4'}).show()

+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

This func is preferred because you can specify which columns to use, thanks.
This is also preferred because you can assign it to the same or another dataframe.
D
Danny Varod

Using fillna there are 3 options...

Documentation:

def fillna(self, value, subset=None): """Replace null values, alias for ``na.fill()``. :func:`DataFrame.fillna` and :func:`DataFrameNaFunctions.fill` are aliases of each other. :param value: int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then `subset` is ignored and `value` must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string. :param subset: optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if `value` is a string, and subset contains a non-string column, then the non-string column is simply ignored.

So you can:

fill all columns with the same value: df.fillna(value) pass a dictionary of column --> value: df.fillna(dict_of_col_to_value) pass a list of columns to fill with the same value: df.fillna(value, subset=list_of_cols)

fillna() is an alias for na.fill() so they are the same.