ChatGPT解决这个技术问题 Extra ChatGPT

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?

PS: I want to check if it's empty so that I only save the DataFrame if it's not empty


z
zero323

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.


For those using pyspark. isEmpty is not a thing. Do len(d.head(1)) > 0 instead.
why is this better then df.rdd.isEmpty?
df.head(1).isEmpty is taking huge time is there any other optimized solution for this.
Hey @Rakesh Sabbani, If df.head(1) is taking a large amount of time, it's probably because your df's execution plan is doing something complicated that prevents spark from taking shortcuts. For example, if you are just reading from parquet files, df = spark.read.parquet(...), I'm pretty sure spark will only read one file partition. But if your df is doing other things like aggregations, you may be inadvertently forcing spark to read and process a large portion, if not all, of you source data.
just reporting my experience to AVOID: I was using df.limit(1).count() naively. On big datasets it takes much more time than the reported examples by @hulin003 which are almost instantaneous
z
zero323

I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?


This is surprisingly slower than df.count() == 0 in my case
Isn't converting to rdd a heavy task?
Not really. RDD's still are the underpinning of everything Spark for the most part.
Don't convert the df to RDD. It slows down the process. If you convert it will convert whole DF to RDD and check if its empty. Think if DF has millions of rows, it takes lot of time in converting to RDD itself.
.rdd slows down so much the process like a lot
v
vahlala

I had the same question, and I tested 3 main solution :

(df != null) && (df.count > 0) df.head(1).isEmpty() as @hulin003 suggest df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

it takes ~9366ms it takes ~5607ms it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest


out of curiosity... what size DataFrames was this tested with?
I've tested 10 million rows... and got the same time as for df.count() or df.rdd.isEmpty()
B
Beryllium

Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]

isEmpty is slower than df.head(1).isEmpty
@Sandeep540 Really? Benchmark? Your proposal instantiates at least one row. The Spark implementation just transports a number. head() is using limit() as well, the groupBy() is not really doing anything, it is required to get a RelationalGroupedDataset which in turn provides count(). So that should not be significantly slower. It is probably faster in case of a data set which contains a lot of columns (possibly denormalized nested data). Anway you have to type less :-)
Beware: I am using .option("mode", "DROPMALFORMED") and df.isEmpty returned false whereas df.head(1).isEmpty returned the correct result of true because... all of the rows were malformed (someone upstream changed the schema on me).
A
Aakil Fernandes

You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.


if dataframe is empty it throws "java.util.NoSuchElementException: next on empty iterator" ; [Spark 1.3.1]
R
Ram Ghadiyaram

If you do df.count > 0. It takes the counts of all partitions across all executors and add them up at Driver. This take a while when you are dealing with millions of rows.

The best way to do this is to perform df.take(1) and check if its null. This will return java.util.NoSuchElementException so better to put a try around df.take(1).

The dataframe return an error when take(1) is done instead of an empty row. I have highlighted the specific code lines where it throws the error.

https://i.stack.imgur.com/tBnaN.png


if you run this on a massive dataframe with millions of records that count method is going to take some time.
using df.take(1) when the df is empty results in getting back an empty ROW which cannot be compared with null
i'm using first() instead of take(1) in a try/catch block and it works
@LetsPlayYahtzee I have updated the answer with same run and picture that shows error. take(1) returns Array[Row]. And when Array doesn't have any values, by default it gives ArrayOutOfBounds. So I don't think it gives an empty Row. I would say to observe this and change the vote.
A
Abdennacer Lachiheb

For Java users you can use this on a dataset :

public boolean isDatasetEmpty(Dataset<Row> ds) {
        boolean isEmpty;
        try {
            isEmpty = ((Row[]) ds.head(1)).length == 0;
        } catch (Exception e) {
            return true;
        }
        return isEmpty;
}

This check all possible scenarios ( empty, null ).


S
Shaido

In Scala you can use implicits to add the methods isEmpty() and nonEmpty() to the DataFrame API, which will make the code a bit nicer to read.

object DataFrameExtensions {
  implicit def extendedDataFrame(dataFrame: DataFrame): ExtendedDataFrame = 
    new ExtendedDataFrame(dataFrame: DataFrame)

  class ExtendedDataFrame(dataFrame: DataFrame) {
    def isEmpty(): Boolean = dataFrame.head(1).isEmpty // Any implementation can be used
    def nonEmpty(): Boolean = !isEmpty
  }
}

Here, other methods can be added as well. To use the implicit conversion, use import DataFrameExtensions._ in the file you want to use the extended functionality. Afterwards, the methods can be used directly as so:

val df: DataFrame = ...
if (df.isEmpty) {
  // Do something
}

A
Adelholzener

If you are using Pyspark, you could also do:

len(df.head(1)) > 0

B
Bose

On PySpark, you can also use this bool(df.head(1)) to obtain a True of False value

It returns False if the dataframe contains no rows


Z
ZygD

PySpark 3.3.0+ / Scala 2.4.0+

df.isEmpty()

'DataFrame' object has no attribute 'isEmpty'. Spark 3.0
In PySpark, it's introduced only from version 3.3.0
S
Shekhar Koirala

I found that on some cases:

>>>print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>

>>>df.take(1).isEmpty
'list' object has no attribute 'isEmpty'

this is same for "length" or replace take() by head()

[Solution] for the issue we can use.

>>>df.limit(2).count() > 1
False

A
Arya McCarthy
df1.take(1).length>0

The take method returns the array of rows, so if the array size is equal to zero, there are no records in df.


J
Joy Jedidja Ndjama

Let's suppose we have the following empty dataframe:

df = spark.sql("show tables").limit(0)

If you are using Spark 2.1, for pyspark, to check if this dataframe is empty, you can use:

df.count() > 0

Or

bool(df.head(1))

S
Stephen Rauch

You can do it like:

val df = sqlContext.emptyDataFrame
if( df.eq(sqlContext.emptyDataFrame) )
    println("empty df ")
else 
    println("normal df")

won't it require the schema of two dataframes (sqlContext.emptyDataFrame & df) to be same in order to ever return true?
This won't work. eq is inherited from AnyRef and tests whether the argument (that) is a reference to the receiver object (this).
J
Jordan Morris

dataframe.limit(1).count > 0

This also triggers a job but since we are selecting single record, even in case of billion scale records the time consumption could be much lower.

From: https://medium.com/checking-emptiness-in-distributed-objects/count-vs-isempty-surprised-to-see-the-impact-fa70c0246ee0


All these are bad options taking almost equal time
@PushpendraJaiswal yes, and in a world of bad options, we should chose the best bad option