I have a large pyspark.sql.dataframe.DataFrame
and I want to keep (so filter
) all rows where the URL saved in the location
column contains a pre-determined string, e.g. 'google.com'.
I have tried:
import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)
but this throws a
TypeError: _TypeError: 'Column' object is not callable'
How do I go around and filter my df properly? Many thanks in advance!
Spark 2.2 onwards
df.filter(df.location.contains('google.com')) Spark 2.2 documentation link
Spark 2.1 and before
You can use plain SQL in filter df.filter("location like '%google.com%'") or with DataFrame column methods df.filter(df.location.like('%google.com%')) Spark 2.1 documentation link
pyspark.sql.Column.contains()
is only available in pyspark version 2.2 and above.
df.where(df.location.contains('google.com'))
.contains()
When filtering a DataFrame with string values, I find that the pyspark.sql.functions
lower
and upper
come in handy, if your data could have column entries like "foo" and "Foo":
import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))
Success story sharing
google.com
andamazon.com
usinglike
? How can we do so?df.filter("location like '%google.com%' AND location like '%amazon.com%'")
or DataFramedf.filter("location like '%google.com%'").filter("location like '%amazon.com%'")
df.filter(F.col("yourcol").rlike('|'.join(substrings)))
where substrings is a list of substrings likesubstrings = ['google.com','amazon.com']