How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

apache-spark pyspark

I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some environment variable, but I can't find any further detailed instructions.

I can run the python interpreter on my machine, so I'm confident that it is installed correctly and running "java -version" gives me the expected response, so I don't think the problem is with either of those.

I get a stack trace of errors from cloudpickly.py:

Traceback (most recent call last):
  File "C:\software\spark-2.4.4-bin-hadoop2.7\bin\..\python\pyspark\shell.py", line 31, in <module>
    from pyspark import SparkConf
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\context.py", line 31, in <module>
    from pyspark import accumulators
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\serializers.py", line 71, in <module>
    from pyspark import cloudpickle
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "C:\software\spark-2.4.4-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

John

This is happening because you're using python 3.8. The latest pip release of pyspark (pyspark 2.4.4 at time of writing) doesn't support python 3.8. Downgrade to python 3.7 for now, and you should be fine.

I can confirm pyspark 2.4.4 is working for me with python3.7.5

Can confirm that a fresh conda environment with python 3.7.0 works! Thanks.

Here is the link to the issue tracker bug: issues.apache.org/jira/browse/SPARK-29536 and the github pull request: github.com/apache/spark/pull/26194 for this. The fix for this will be part of pyspark 3.0. On March 30, 2019, v3.0.0-rc1 was released in beta: github.com/apache/spark/releases. Hopefully, v3.0.0 will be coming soon.

I use Spark version 2.4.4 and it give same problem with conda python 3.7.0

I use spark 2.4.6, and Installing python 3.7.8 on ubuntu 20.04 using this solved the problem.

Ani Menon

Its python and pyspark version mismatch like John rightly pointed out. For a newer python version you can try,

pip install --upgrade pyspark

That will update the package, if one is available. If this doesn't help then you might have to downgrade to a compatible version of python.

pyspark package doc clearly states:

NOTE: If you are using this with a Spark standalone cluster you must ensure that the version (including minor version) matches or you may experience odd errors.

ei-grad

As a dirty workaround one can replace the _cell_set_template_code with the Python3-only implementation suggested by docstring of _make_cell_set_template_code function:

Notes
-----
In Python 3, we could use an easier function:

.. code-block:: python

   def f():
       cell = None

       def _stub(value):
           nonlocal cell
           cell = value

       return _stub

    _cell_set_template_code = f()

Here is a patch for spark v2.4.5: https://gist.github.com/ei-grad/d311d0f34b60ebef96841a3a39103622

Apply it by:

git apply <(curl https://gist.githubusercontent.com/ei-grad/d311d0f34b60ebef96841a3a39103622/raw)

This fixes the problem with ./bin/pyspark, but ./bin/spark-submit uses bundled pyspark.zip with its own copy of cloudpickle.py. And if it would be fixed there, then it still wouldn't work, failing with the same error while unpickling some object in pyspark/serializers.py.

But it looks like Python 3.8 support is already arrived to spark v3.0.0-preview2, so one can try it. Or, stick to Python 3.7, like the accepted answer suggests.

Paul

Make sure to use the right versions of Java, Python and Spark. I got the same error caused by an outdated Spark version (Spark 2.4.7).

By downloading the latest Spark 3.0.1, next to Python 3.8 (as part of Anaconda3 2020.07) and Java JDK 8 got the problem solved for me!

Same issue here. The problem was solved by upgrading from PySpark 2.4.4 to 3.01.

Jelmer

The problem with python 3.8 has been resolved in the most recent versions. I got this error because my scikit-learn version was very outdated

pip install scikit-learn --upgrade

solved the problem

mohamed_abdullah

Try to install the latest version of pyinstaller that can be compatible with python 3.8 using this command:

pip install https://github.com/pyinstaller/pyinstaller/archive/develop.tar.gz

reference:
https://github.com/pyinstaller/pyinstaller/issues/4265

I did this and pyspark still gives the same error

Same here. It seems like this is a different issue, even if it's the same error message. OP's problem happens in pyspark\cloudpickle.py. The PyInstaller problem happens in PyInstaller\building\utils.py.

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US