I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:
Last login: Fri Jan 8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
16/01/08 14:46:50 INFO Remoting: Starting remoting
16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199]
16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.
16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
/_/
Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:
https://i.stack.imgur.com/SCMrY.png
Then from a blog I tried this:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"
# Append pyspark to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?.
Update:
Then I search for apache-spark and python path in order to set the environment variables of Pycharm:
apache-spark path:
user@MacBook-Pro-User-2:~$ brew info apache-spark
apache-spark: stable 1.6.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
Poured from bottle
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
python path:
user@MacBook-Pro-User-2:~$ brew info python
python: stable 2.7.11 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org
/usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *
Then with the above information I tried to set the environment variables as follows:
https://i.stack.imgur.com/TOsDo.png
Any idea of how to correctly link Pycharm with pyspark?
Then when I run a python script with the above configuration I have this exception:
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
UPDATE: Then I tried this configurations proposed by @zero323
Configuration 1:
/usr/local/Cellar/apache-spark/1.5.1/
https://i.stack.imgur.com/i9dZu.png
out:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
CHANGES.txt NOTICE libexec/
INSTALL_RECEIPT.json README.md
LICENSE bin/
Configuration 2:
/usr/local/Cellar/apache-spark/1.5.1/libexec
https://i.stack.imgur.com/Bq2YP.png
out:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
R/ bin/ data/ examples/ python/
RELEASE conf/ ec2/ lib/ sbin/
With PySpark package (Spark 2.2.0 and later)
With SPARK-1267 being merged you should be able to simplify the process by pip
installing Spark in the environment you use for PyCharm development.
Go to File -> Settings -> Project Interpreter Click on install button and search for PySpark Click on install package button.
Manually with user provided Spark installation
Create Run configuration:
Go to Run -> Edit configurations Add new Python configuration Set Script path so it points to the script you want to execute Edit Environment variables field so it contains at least: SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.) PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3, 0.10.7 - 2.4) Apply the settings
Add PySpark library to the interpreter path (required for code completion):
Go to File -> Settings -> Project Interpreter Open settings for an interpreter you want to use with Spark Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required) Save the settings
Optionally
Install or add to path type annotations matching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).
Finally
Use newly created configuration to run your script.
Here's how I solved this on mac osx.
brew install apache-spark Add this to ~/.bash_profile export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH Add pyspark and py4j to content root (use the correct Spark version): /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
https://i.stack.imgur.com/8coKV.png
$SPARK_HOME/python
in the interpreter classpath and added the Environment variables and it works as expected.
Add pyspark and py4j to content root (use the correct Spark version)
helped me in code completion. How did you get it done by changing Project Interpreter?
Here is the setup that works for me (Win7 64bit, PyCharm2017.3CE)
Set up Intellisense:
Click File -> Settings -> Project: -> Project Interpreter Click the gear icon to the right of the Project Interpreter dropdown Click More... from the context menu Choose the interpreter, then click the "Show Paths" icon (bottom right) Click the + icon two add the following paths: \python\lib\py4j-0.9-src.zip \bin\python\lib\pyspark.zip Click OK, OK, OK
Go ahead and test your new intellisense capabilities.
Configure pyspark in pycharm (windows)
File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok
Ensure SPARK_HOME set in windows environment, pycharm will take from there. To confirm :
Run menu - edit configurations - environment variables - [...] - show
Optionally set SPARK_CONF_DIR in environment variables.
I used the following page as a reference and was able to get pyspark/Spark 1.6.1 (installed via homebrew) imported in PyCharm 5.
http://renien.com/blog/accessing-pyspark-pycharm/
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1"
# Append pyspark to Python Path
sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
With the above, pyspark loads, but I get a gateway error when I try to create a SparkContext. There's some issue with Spark from homebrew, so I just grabbed Spark from the Spark website (download the Pre-built for Hadoop 2.6 and later) and point to the spark and py4j directories under that. Here's the code in pycharm that works!
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"
# Need to Explicitly point to python3 if you are using Python 3.x
os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"
#You might need to enter your local IP
#os.environ['SPARK_LOCAL_IP']="192.168.2.138"
#Path for pyspark and py4j
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext('local')
words = sc.parallelize(["scala","java","hadoop","spark","akka"])
print(words.count())
I had a lot of help from these instructions, which helped me troubleshoot in PyDev and then get it working PyCharm - https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
I'm sure somebody has spent a few hours bashing their head against their monitor trying to get this working, so hopefully this helps save their sanity!
I use conda
to manage my Python packages. So all I did in a terminal outside PyCharm was:
conda install pyspark
or, if you want an earlier version, say 2.2.0, then do:
conda install pyspark=2.2.0
This automatically pulls in py4j as well. PyCharm then no longer complained about import pyspark...
and code completion also worked. Note my PyCharm project was already configured to use the Python interpreter that comes with Anaconda.
The simplest way is to install PySpark through project interpreter.
Go to File - Settings - Project - Project Interpreter Click on the + icon on top right. Search for PySpark and other packages you want to install Finally click install package Its Done!!
Check out this video.
Assume your spark python directory is: /home/user/spark/python
Assume your Py4j source is: /home/user/spark/python/lib/py4j-0.9-src.zip
Basically you add the the spark python directory and the py4j directory within that to the interpreter paths. I don't have enough reputation to post a screenshot or I would.
In the video, the user creates a virtual environment within pycharm itself, however, you can make the virtual environment outside of pycharm or activate a pre-existing virtual environment, then start pycharm with it and add those paths to the virtual environment interpreter paths from within pycharm.
I used other methods to add spark via the bash environment variables, which works great outside of pycharm, but for some reason they weren't recognized within pycharm, but this method worked perfectly.
SparkContext
object at the beginning of your script as well. I note this because using the interactive pyspark console via the command line automatically creates the context for you, whereas in PyCharm, you need to take care of that yourself; syntax would be: sc = SparkContext()
You need to setup PYTHONPATH, SPARK_HOME before you launch IDE or Python.
Windows, edit environment variables, added spark python and py4j into
PYTHONPATH=%PYTHONPATH%;{py4j};{spark python}
Unix,
export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python}
I used pycharm to link python and spark. I had Java and Spark pre-installed in my pc.
These are the steps I followed
Create New project In Settings for New Project--> I selected Python3.7(venv) as my python. This is the python.exe file present in the venv folder inside my new project. You can give any python available in your pc. In settings --> Project structure --> Add Content_Root I added two zip folders as directories of spark C:\Users\USER\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip C:\Users\USER\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\pyspark.zip Create a python file inside the new project. Then go to Edit Configurations(in upper right side dropdown) and select Environment Variables I used the below environment variables and it worked fine for me PYTHONUNBUFFERED 1 JAVA_HOME C:\Program Files\Java\jre1.8.0_251 PYSPARK_PYTHON C:\Users\USER\PycharmProjects\pyspark\venv\Scripts\python.exe SPARK_HOME C:\Users\USER\spark-3.0.0-preview2-bin-hadoop2.7 HADOOP_HOME C:\Users\USER\winutils you may want to additionally download winutils.exe and place it in the path C:\Users\USER\winutils\bin Give the same environment variables inside Edit Configurations--> Templates Go to Settings--> Project Interpreter --> import pyspark Run your first pyspark program!
From the documentation:
To run Spark applications in Python, use the bin/spark-submit script located in the Spark directory. This script will load Spark’s Java/Scala libraries and allow you to submit applications to a cluster. You can also use bin/pyspark to launch an interactive Python shell.
You are invoking your script directly with the CPython interpreter, which I think is causing problems.
Try running your script with:
"${SPARK_HOME}"/bin/spark-submit test_1.py
If that works, you should be able to get it working in PyCharm by setting the project's interpreter to spark-submit.
bin/pyspark
I followed the tutorials on-line and added the env variables to .bashrc:
# add pyspark to python
export SPARK_HOME=/home/lolo/spark-1.6.1
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
I then just got the value in SPARK_HOME and PYTHONPATH to pycharm:
(srz-reco)lolo@K:~$ echo $SPARK_HOME
/home/lolo/spark-1.6.1
(srz-reco)lolo@K:~$ echo $PYTHONPATH
/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:
Then I copied it to Run/Debug Configurations -> Environment variables of the script.
This tutorial from pyspark_xray, a tool that enables debugging pyspark code on PyCharm, can answer your question. It covers both Windows and Mac.
Preparation
Open command line, kick off java command, if you get an error, then download and install java (version 1.8.0_221 as of April 2020)
If you don't have it, download and install PyCharm Community edition (version 2020.1 as of April 2020)
If you don't have it, download and install Anaconda Python 3.7 runtime
Download and install spark latest Pre-built for Apache Hadoop (spark-2.4.5-bin-hadoop2.7 as of April 2020, 200+MB size) locally Windows: if you don't have unzip tool, please download and install 7zip, a free tool to zip/unzip files extract contents of spark tgz file to c:\spark-x.x.x-bin-hadoopx.x folder follow the steps in this tutorial install winutils.exe into c:\spark-x.x.x-bin-hadoopx.x\bin folder, without this executable, you will run into error when writing engine output Mac: extract contents of spark tgz file to \Users[USERNAME]\spark-x.x.x-bin-hadoopx.x folder
Windows: if you don't have unzip tool, please download and install 7zip, a free tool to zip/unzip files extract contents of spark tgz file to c:\spark-x.x.x-bin-hadoopx.x folder follow the steps in this tutorial install winutils.exe into c:\spark-x.x.x-bin-hadoopx.x\bin folder, without this executable, you will run into error when writing engine output
if you don't have unzip tool, please download and install 7zip, a free tool to zip/unzip files
extract contents of spark tgz file to c:\spark-x.x.x-bin-hadoopx.x folder
follow the steps in this tutorial install winutils.exe into c:\spark-x.x.x-bin-hadoopx.x\bin folder, without this executable, you will run into error when writing engine output
install winutils.exe into c:\spark-x.x.x-bin-hadoopx.x\bin folder, without this executable, you will run into error when writing engine output
Mac: extract contents of spark tgz file to \Users[USERNAME]\spark-x.x.x-bin-hadoopx.x folder
extract contents of spark tgz file to \Users[USERNAME]\spark-x.x.x-bin-hadoopx.x folder
install pyspark by pip install pyspark or conda install pyspark
Run Configuration
You run Spark application on a cluster from command line by issuing spark-submit
command which submit a Spark job to the cluster. But from PyCharm or other IDE on a local laptop or PC, spark-submit
cannot be used to kick off a Spark job. Instead, follow these steps to set up a Run Configuration of pyspark_xray's demo_app on PyCharm
Set Environment Variables: set HADOOP_HOME value to C:\spark-2.4.5-bin-hadoop2.7 set SPARK_HOME value to C:\spark-2.4.5-bin-hadoop2.7
set HADOOP_HOME value to C:\spark-2.4.5-bin-hadoop2.7
set SPARK_HOME value to C:\spark-2.4.5-bin-hadoop2.7
use Github Desktop or other git tools to clone pyspark_xray from Github
PyCharm > Open pyspark_xray as project
Open PyCharm > Run > Edit Configurations > Defaults > Python and enter the following values: Environment variables (Windows): PYTHONUNBUFFERED=1;PYSPARK_PYTHON=python;PYTHONPATH=$SPARK_HOME/python;PYSPARK_SUBMIT_ARGS=pyspark-shell;
Environment variables (Windows): PYTHONUNBUFFERED=1;PYSPARK_PYTHON=python;PYTHONPATH=$SPARK_HOME/python;PYSPARK_SUBMIT_ARGS=pyspark-shell;
Open PyCharm > Run > Edit Configurations, create a new Python configuration, point the script to the path of driver.py of pyspark_xray > demo_app
Go to Project Structure:
Option 1: File -> Settings -> Project: -> Project Structure
Option 2: PyCharm -> Preferences -> Project: -> Project Structure
Add Content Root: all ZIP files from $SPARK_HOME/python/lib
For latest Spark and Python versions on MacOS as below:
SPARK_VERSION=3.1.1
PY4J=0.10.9
PYTHON=3.8.12
Adding the below environment variables for SPARK_HOME
, PYTHONPATH
as well as PYENV_ROOT
to ~/.bash_profile
. In addition, SPARK_HOME
and PYENV_ROOT
being added to PATH
.
export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1`
export SPARK_HOME=/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYENV_ROOT=/usr/local/opt/pyenv
export PATH=$PYENV_ROOT/bin:$PATH
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
fi
Under Project -> Preferences -> Python Interpreter
, add the PyEnv Python as a new Interpreter and use it instead of the default interpreter.
Under Add Python Interpreter, go to Virtual Environment -> Under Existing Environment
-> select /usr/local/opt/pyenv/versions/3.8.12/bin/python
as the Python interpreter for the PySpark project.
In the Python code, add the below code block at the beginning (NOTE: pyspark
, findspark
and py4j
needs to be installed as packages beforehead)
import findspark
from pyspark import SparkContext
findspark.init("/usr/local/Cellar/apache-spark/3.1.1/libexec")
The easiest way is
Go to the site-packages folder of your anaconda/python installation, Copy paste the pyspark and pyspark.egg-info folders there.
Restart pycharm to update index. The above mentioned two folders are present in spark/python folder of your spark installation. This way you'll get code completion suggestions also from pycharm.
The site-packages can be easily found in your python installation. In anaconda its under anaconda/lib/pythonx.x/site-packages
I tried to add the pyspark module via Project Interpreter menu but was not enough... there are a number of system environment variables that need to be set like SPARK_HOME
and a path to /hadoop/bin/winutils.exe
in order to read local data files. You also need to be using correct versions of Python, JRE, JDK all available in system environment variables and PATH
. After googling a lot, the instructions in these videos worked
Success story sharing
spark-defaults.conf
) or through submit args - same as with Jupyter notebook. Submit args can defined in PyCharm's Environment variables, instead of code, if you prefer this option.