当值与pyspark中字符串的一部分匹配时过滤df

python apache-spark pyspark apache-spark-sql

我有一个大的 pyspark.sql.dataframe.DataFrame，我想保留（所以 filter）保存在 location 列中的 URL 包含预定字符串的所有行，例如“google.com”。

我努力了：

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

但这会引发

TypeError: _TypeError: 'Column' object is not callable'

如何正确过滤我的df？提前谢谢了！

mrsrinivas

Spark 2.2 及以上版本

df.filter(df.location.contains('google.com')) Spark 2.2 文档链接

Spark 2.1 及之前版本

您可以在过滤器 df.filter("location like '%google.com%'") 或使用 DataFrame 列方法 df.filter(df.location.like('%google.com%')) Spark 2.1 文档中使用普通 SQL关联

嗨 Srinivas，如果我们必须检查两个单词，比如使用 like 的 google.com 和 amazon.com，该怎么办？我们怎么能这样做？

@cph_bon：有很多方法可以做到这一点。 SQL df.filter("location like '%google.com%' AND location like '%amazon.com%'") 或 DataFrame df.filter("location like '%google.com%'").filter("location like '%amazon.com%'")

@mrsrinivas，我们要在所有列中搜索“like 'ID'”是什么。例如，daframe 具有包含单词“ID”的所有列

对于多个子字符串，使用 rlike 和如下连接：df.filter(F.col("yourcol").rlike('|'.join(substrings))) 其中 substrings 是子字符串列表，例如 substrings = ['google.com','amazon.com']

pault

pyspark.sql.Column.contains() 仅在 pyspark 2.2 及更高版本中可用。

df.where(df.location.contains('google.com'))

如何在 .contains() 中给出多个字符串

caffreyd

在使用字符串值过滤 DataFrame 时，我发现 pyspark.sql.functions lower 和 upper 会派上用场，如果您的数据可能包含“foo”和“Foo”等列条目：

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))

当值与pyspark中字符串的一部分匹配时过滤df

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们