ChatGPT解决这个技术问题 Extra ChatGPT

Sort in descending order in PySpark

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.

group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)

But it throws the following error.

sort() got an unexpected keyword argument 'ascending'

z
zero323

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:

from pyspark.sql.functions import col

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(col("count").desc()))

or desc function:

from pyspark.sql.functions import desc

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(desc("count"))

Both methods can be used with with Spark >= 1.3 (including Spark 2.x).


H
Henrique Florencio

Use orderBy:

df.orderBy('column_name', ascending=False)

Complete answer:

group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html


J
Justin Lange

By far the most convenient way is using this:

df.orderBy(df.column_name.desc())

Doesn't require special imports.


Credit to Daniel Haviv a Solutions Architect at Databricks who showed me this way.
by far the best answer here.
This should be the accepted answer instead. Much simpeler and doesnt rely on packages (perhaps wasn't available at the time)
I really like this answer but didn't work for me with count in spark 3.0.0. I think is because count is a function rather than a number. TypeError: Invalid argument, not a string or column: of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
N
Narendra Maru

you can use groupBy and orderBy as follows also

dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

Why are you first renaming the column and then using the old name for sorting? Renaming is not even a part of the question asked
@Sheldore I am renaming the column name for the performance optimization while working with aggregation queries its difficult for Spark to maintain the metadata for the newly added column
P
Prabhath Kota

In pyspark 2.4.4

1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

2) from pyspark.sql.functions import desc
   group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))

No need to import in 1) and 1) is short & easy to read, So I prefer 1) over 2)


Why are you using both orderBy and sort in the same answer in 2)?
A
Aramis NSR

RDD.sortBy(keyfunc, ascending=True, numPartitions=None)

an example:

words =  rdd2.flatMap(lambda line: line.split(" "))
counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b)

print(counter.sortBy(lambda a: a[1],ascending=False).take(10))