pyspark 数据框过滤器或包含基于列表

apache-spark filter pyspark apache-spark-sql

我正在尝试使用列表过滤 pyspark 中的数据框。我想根据列表进行过滤或仅包含列表中具有值的那些记录。我下面的代码不起作用：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

给出以下错误：ValueError：无法将列转换为布尔值：请使用 '&' 表示 'and'，'|'构建 DataFrame 布尔表达式时，为 'or'，'~' 为 'not'。

Skippy le Grand Gourou

它所说的是“df.score in l”无法评估，因为 df.score 为您提供了一列，并且未在该列类型上定义“in”使用“isin”

代码应该是这样的：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

请注意 where() is an alias for filter()，因此两者可以互换。

您将如何使用广播变量作为列表而不是常规 python 列表来做到这一点？当我尝试这样做时，我得到一个“广播”对象没有属性“_get_object_id”错误。

@flyingmeatball 我认为您可以使用 broadcast_variable_name.value 访问列表

如果您想使用广播，那么这是要走的路：l_bc = sc.broadcast(l) 后跟 df.where(df.score.isin(l_bc.value))

如果您尝试根据列值列表过滤数据框，这可能会有所帮助：stackoverflow.com/a/66228314/530399

Vzzarr

根据@user3133475 的回答，也可以像这样从 F.col() 调用 isin() 函数：

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

bantmen

对于大型数据帧，我发现 join 实现比 where 快得多：

def filter_spark_dataframe_by_list(df, column_name, filter_list):
    """ Returns subset of df where df[column_name] is in filter_list """
    spark = SparkSession.builder.getOrCreate()
    filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
    return df.join(filter_df, df[column_name] == filter_df["value"])

pyspark 数据框过滤器或包含基于列表

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们