我正在尝试使用列表过滤 pyspark 中的数据框。我想根据列表进行过滤或仅包含列表中具有值的那些记录。我下面的代码不起作用:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
给出以下错误:ValueError:无法将列转换为布尔值:请使用 '&' 表示 'and','|'构建 DataFrame 布尔表达式时,为 'or','~' 为 'not'。
它所说的是“df.score in l”无法评估,因为 df.score 为您提供了一列,并且未在该列类型上定义“in”使用“isin”
代码应该是这样的:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
请注意 where()
is an alias for filter()
,因此两者可以互换。
根据@user3133475 的回答,也可以像这样从 F.col()
调用 isin()
函数:
import pyspark.sql.functions as F
l = [10,18,20]
df.filter(F.col("score").isin(l))
对于大型数据帧,我发现 join
实现比 where
快得多:
def filter_spark_dataframe_by_list(df, column_name, filter_list):
""" Returns subset of df where df[column_name] is in filter_list """
spark = SparkSession.builder.getOrCreate()
filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
return df.join(filter_df, df[column_name] == filter_df["value"])
不定期副业成功案例分享
l_bc = sc.broadcast(l)
后跟df.where(df.score.isin(l_bc.value))