ChatGPT解决这个技术问题 Extra ChatGPT

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!


m
mrsrinivas

Spark 2.2 onwards

df.filter(df.location.contains('google.com')) Spark 2.2 documentation link

Spark 2.1 and before

You can use plain SQL in filter df.filter("location like '%google.com%'") or with DataFrame column methods df.filter(df.location.like('%google.com%')) Spark 2.1 documentation link


Hi Srinivas, what if we had to check for two words, let's say google.com and amazon.com using like? How can we do so?
@cph_bon: There are many ways to do it. SQL df.filter("location like '%google.com%' AND location like '%amazon.com%'") or DataFrame df.filter("location like '%google.com%'").filter("location like '%amazon.com%'")
@mrsrinivas, what is we want to search "like 'ID' " in all columns. For instance, daframe to have all columns that include word "ID"
For multiple substrings use rlike with a join like so: df.filter(F.col("yourcol").rlike('|'.join(substrings))) where substrings is a list of substrings like substrings = ['google.com','amazon.com']
p
pault

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))

how to give more than one string in .contains()
c
caffreyd

When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))