ChatGPT解决这个技术问题 Extra ChatGPT

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

Following is the way, I did:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.


1
10465355

There is no need for an UDF here. Column already provides cast method with DataType instance :

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")
BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

types.ArrayType(types.IntegerType()).simpleString()   
'array<int>'
types.MapType(types.StringType(), types.IntegerType()).simpleString()
'map<string,int>'

Using the col function also works. from pyspark.sql.functions import col, changedTypedf = joindf.withColumn("label", col("show").cast(DoubleType()))
What are the possible values of cast() argument (the "string" syntax)?
I can't believe how terse Spark doc was on the valid string for the datatype. The closest reference I could find was this: docs.tibco.com/pub/sfire-analyst/7.7.1/doc/html/en-US/… .
How to convert multiple columns in one go?
How do I change nullable to false?
Z
ZygD

Preserve the name of the column and avoid extra column addition by using the same name as input column:

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Thanks I was looking for how to retain original column name
is there a list somewhere of the short string data types Spark will identify?
this solution also works splendidly in a loop e.g. from pyspark.sql.types import IntegerType for ftr in ftr_list: df = df.withColumn(f, df[f].cast(IntegerType()))
@Quetzalcoatl Your code is wrong. What is f? Where are you using ftr?
Yeh, thanks -- 'f' should be 'ftr'. Others likely figured that out.
Z
ZygD

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.

We can reach the column in spark statement with col("colum_name") keyword:

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

Thank you! Using 'double' is more elegant than DoubleType() which may also need to be imported.
Z
ZygD

PySpark version:

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

A
Abhishek Choudhary

the solution was simple -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))