ChatGPT解决这个技术问题 Extra ChatGPT

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by

from pyspark.sql.functions import col

but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?


C
Community

It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.

If you carefully check the source you'll find col listed among other _functions. This dictionary is further iterated and _create_function is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals.

Finally __all__, which defines a list of items exported from the module, just exports all globals excluding ones contained in the blacklist.

If this mechanisms is still not clear you can create a toy example:

Create Python module called foo.py with a following content: # Creates a function assigned to the name foo globals()["foo"] = lambda x: "foo {0}".format(x) # Exports all entries from globals which start with foo __all__ = [x for x in globals() if x.startswith("foo")]

Place it somewhere on the Python path (for example in the working directory).

Import foo: from foo import foo foo(1)

An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.

Depending on the IDE installing type annotations might resolve the problem (see for example zero323/pyspark-stubs#172).


V
Vincent Claes

In Pycharm the col function and others are flagged as "not found"

https://i.stack.imgur.com/3dlSQ.png

a workaround is to import functions and call the col function from there.

for example:

from pyspark.sql import functions as F
df.select(F.col("my_column"))

For me this still gives the Cannot find reference 'col' in 'functions.py' warning...
@Nexaspx is your pyspark version the same as your spark version? have you set the SPARK_HOME environment variable?
In Pylint this seems not to help. But in Flake8, which is better, it does.
D
Dmytro

As of VS Code 1.26.1 this can be solved by modifying python.linting.pylintArgs setting:

"python.linting.pylintArgs": [
        "--generated-members=pyspark.*",
        "--extension-pkg-whitelist=pyspark",
        "--ignored-modules=pyspark.sql.functions"
    ]

That issue was explained on github: https://github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443


the link doesn't exist anymore, js
This is what I needed! I added the option to my setup.cfg as ignored-modules = pyspark.sql.functions
T
Thomas

As explained above, pyspark generates some of its functions on the fly, which makes that most IDEs cannot detect them properly. However, there is a python package pyspark-stubs that includes a collection of stub files such that type hints are improved, static error detection, code completion, ... By just installing with

pip install pyspark-stubs==x.x.x

(where x.x.x has to be replaced with your pyspark version (2.3.0. in my case for instance)), col and other functions will be detected, without changing anything at your code for most IDEs (Pycharm, Visual Studio Code, Atom, Jupyter Notebook, ...)


for pyspark 2.4.5 i had to use pyspark-stubs 2.4.0
``` $ pip freeze | grep pyspark pyspark==2.4.4 pyspark-stubs==2.4.0 ``` I installed 2.4.0, but it's still not working.
b
bekce

To import all pyspark functions directly;

from pyspark.sql.functions import *
...
col('my_column')

You may also want to use an alias in order to solve function shadowing;

from pyspark.sql import functions as f
...
f.col('my_column')

Perfect. That's all I needed. Some of these other answers are solving string theory. It's just a basic import! :)
import * is generally discouraged as it can lead to unknown imports or overwrites.
M
Mihai Chelaru

I ran into a similar problem trying to set up a PySpark development environment with Eclipse and PyDev. PySpark uses a dynamic namespace. To get it to work I needed to add PySpark to "force Builtins" as below.

https://i.stack.imgur.com/EyjzM.png


In eclipse if you still see the error marker, right click on project -> pydev -> code analysis. This should take away the marker
h
hlongmore

As pointed out by @zero323, there are several spark functions that have wrappers generated at runtime by adding to the globals dict, then adding those to __all__. As pointed out by @vincent-claes referencing the functions using the function path (as F or as something else, I prefer something more descriptive) can make it so the imports don't show an error in PyCharm. However, as @nexaspx alluded to in a comment on that answer, that shifts the warning to the usage line(s). As mentioned by @thomas pyspark-stubs can be installed to improve the situation.

But, if for some reason adding that package is not an option (maybe you are using a docker image for your environment and can't add it to the image right now), or it isn't working, here is my workaround: first, add an import for just the generated wrapper with an alias, then disable the inspection for just that import. This allows all the usages to still have inspections for other functions in the same statement, reduces the warning points to just one, and then ignores that one warning.

from pyspark.sql import functions as pyspark_functions
# noinspection PyUnresolvedReferences
from pyspark.sql.functions import col as pyspark_col
# ...
pyspark_functions.round(...)
pyspark_col(...)

If you have several imports, group them like so to have just one noinspection:

# noinspection PyUnresolvedReferences
from pyspark.sql.functions import (
    col as pyspark_col, count as pyspark_count, expr as pyspark_expr,
    floor as pyspark_floor, log1p as pyspark_log1p, upper as pyspark_upper,
)

(this is how PyCharm formatted it when I used the Reformat File command).

While we're on the subject of how to import pyspark.sql.functions, I recommend not importing the individual functions from pyspark.sql.functions to avoid shadowing Python builtins which can lead to obscure errors, as @SARose states.