Convert pyspark string to date format

python apache-spark pyspark apache-spark-sql

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.

I tried:

df.select(to_date(df.STRING_COLUMN).alias('new_date')).show()

And I get a string of nulls. Can anyone help?

Unless you're using one of the TimeSeriesRDD addons (see the Spark 2016 conference for some discussion, there are two I know of but both are still in development), there aren't a lot of great tools for time series. Accordingly, I've found there's rarely a reason to bother converting strings to datetime objects, if your goal is verious types of groupBy or resampling operations. Just perform them on the string columns.

The analysis will be done using little to no groupBy but rather longitudinal studies of medical records. Therefore being able to manipulate the date is important

Possible duplicate of Why I get null results from date_format() PySpark function?

Reza S

Update (1/10/2018):

For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs:

>>> from pyspark.sql.functions import to_timestamp
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect()
[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]

Original Answer (for Spark < 2.2)

It is possible (preferrable?) to do this without a udf:

from pyspark.sql.functions import unix_timestamp, from_unixtime

df = spark.createDataFrame(
    [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], 
    ['date_str']
)

df2 = df.select(
    'date_str', 
    from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date')
)

print(df2)
#DataFrame[date_str: string, date: timestamp]

df2.show(truncate=False)
#+----------+-------------------+
#|date_str  |date               |
#+----------+-------------------+
#|11/25/1991|1991-11-25 00:00:00|
#|11/24/1991|1991-11-24 00:00:00|
#|11/30/1991|1991-11-30 00:00:00|
#+----------+-------------------+

This is the correct answer. Using an udf for this will destroy your performance.

from pyspark.sql.functions import from_unixtime, unix_timestamp

Note that you can find a java date format reference here: docs.oracle.com/javase/6/docs/api/java/text/…

Also note that to_date() with the format argument is spark 2.2+. to_date existed before 2.2, but the format option did not exist

TL;DR: df = df.withColumn("ResultColumn", to_timestamp(col("OriginalDateCol"), 'yyyy-MM-dd HH:mm:ss'))

Siddhant Saraf

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType



# Creation of a dummy dataframe:
df1 = sqlContext.createDataFrame([("11/25/1991","11/24/1991","11/30/1991"), 
                            ("11/25/1391","11/24/1992","11/30/1992")], schema=['first', 'second', 'third'])

# Setting an user define function:
# This function converts the string cell into a date:
func =  udf (lambda x: datetime.strptime(x, '%m/%d/%Y'), DateType())

df = df1.withColumn('test', func(col('first')))

df.show()

df.printSchema()

Here is the output:

+----------+----------+----------+----------+
|     first|    second|     third|      test|
+----------+----------+----------+----------+
|11/25/1991|11/24/1991|11/30/1991|1991-01-25|
|11/25/1391|11/24/1992|11/30/1992|1391-01-17|
+----------+----------+----------+----------+

root
 |-- first: string (nullable = true)
 |-- second: string (nullable = true)
 |-- third: string (nullable = true)
 |-- test: date (nullable = true)

A udf shouldn't be necessary here, but the built ins for handling this are atrocious. This is what I would do for now too.

Why don't the dates match in the test column to the first column? Yes it is now of date type but the days and months don't match. Is there a reason?

test gives out incorrect values for date. This is not the right answer.

Any solution with UDF is not an answer, barely a workaround. I dont think, there is many use cases that you cant do by combining PSF and .transform() itself.

Frank

The strptime() approach does not work for me. I get another cleaner solution, using cast:

from pyspark.sql.types import DateType
spark_df1 = spark_df.withColumn("record_date",spark_df['order_submitted_date'].cast(DateType()))
#below is the result
spark_df1.select('order_submitted_date','record_date').show(10,False)

+---------------------+-----------+
|order_submitted_date |record_date|
+---------------------+-----------+
|2015-08-19 12:54:16.0|2015-08-19 |
|2016-04-14 13:55:50.0|2016-04-14 |
|2013-10-11 18:23:36.0|2013-10-11 |
|2015-08-19 20:18:55.0|2015-08-19 |
|2015-08-20 12:07:40.0|2015-08-20 |
|2013-10-11 21:24:12.0|2013-10-11 |
|2013-10-11 23:29:28.0|2013-10-11 |
|2015-08-20 16:59:35.0|2015-08-20 |
|2015-08-20 17:32:03.0|2015-08-20 |
|2016-04-13 16:56:21.0|2016-04-13 |

Thanks, this approach worked for me! In case someone wants to convert a string like 2008-08-01T14:45:37Z to a timestamp instead of date, df = df.withColumn("CreationDate",df['CreationDate'].cast(TimestampType())) works well... (Spark 2.2.0)

I tried this option among many from AWS Glue pyspark, works like charm!

This works if the date is already in an acceptable format (yyyy-MM-dd). In OP's case, the date's in MM-dd-yyyy format would return null using this method.

Ruthger Righart

In the accepted answer's update you don't see the example for the to_date function, so another solution using it would be:

from pyspark.sql import functions as F

df = df.withColumn(
            'new_date',
                F.to_date(
                    F.unix_timestamp('STRINGCOLUMN', 'MM-dd-yyyy').cast('timestamp')))

doing a simple to_date() does not work, this is the correct answer

Santosh kumar Manda

possibly not so many answers so thinking to share my code which can help someone

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date

spark = SparkSession.builder.appName("Python Spark SQL basic example")\
    .config("spark.some.config.option", "some-value").getOrCreate()


df = spark.createDataFrame([('2019-06-22',)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt'))
print df1
print df1.show()

output

DataFrame[dt: date]
+----------+
|        dt|
+----------+
|2019-06-22|
+----------+

the above code to convert to date if you want to convert datetime then use to_timestamp. let me know if you have any doubt.

Vishwajeet Pol

Try this:

df = spark.createDataFrame([('2018-07-27 10:30:00',)], ['Date_col'])
df.select(from_unixtime(unix_timestamp(df.Date_col, 'yyyy-MM-dd HH:mm:ss')).alias('dt_col'))
df.show()
+-------------------+  
|           Date_col|  
+-------------------+  
|2018-07-27 10:30:00|  
+-------------------+

You might consider elaborating on how your answer improves upon what's already been provided and accepted.

Convert pyspark string to date format

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US