ChatGPT解决这个技术问题 Extra ChatGPT

PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

When I do

training_data =  train_dataRDD.collectAsMap()

It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

It looks like heap space is small. How can I set it to bigger limits?

EDIT:

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

@bcaceiro: I see lot of spark options being set in the post. I dont use scala. I am using IPython. Do you know if I can set those options from within the shell?
@bcaceiro : Updated the question with suggestion from the post that you directed me too. It seems like there is some problem with JVM.

p
pg2455

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it. You will not encounter this error again. :)


Can you change this conf value from the actual script (ie. set('spark.driver.memory','15g')) ?
I tried doing it but was not successful. I think it need to restart with new global parameters.
From docs: spark.driver.memory "Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file."
I was running the Spark code using SBT run from IDEA SBT Console, the fix for me was to add -Xmx4096M -d64 to the java VM parameters that get passed on the SBT Console launch. This is under Other settings -> SBT.
Spark keeps evolving. So you might have to look into its documentation and find out the configuration parameters that correlate to the memory allocation.
l
louis_guitton

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()

F
Francesco Boi

I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.