I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy
and I am attempting to convert this into a date column.
I tried:
df.select(to_date(df.STRING_COLUMN).alias('new_date')).show()
And I get a string of nulls. Can anyone help?
groupBy
or resampling operations. Just perform them on the string columns.
groupBy
but rather longitudinal studies of medical records. Therefore being able to manipulate the date is important
Update (1/10/2018):
For Spark 2.2+ the best way to do this is probably using the to_date
or to_timestamp
functions, which both support the format
argument. From the docs:
>>> from pyspark.sql.functions import to_timestamp
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect()
[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]
Original Answer (for Spark < 2.2)
It is possible (preferrable?) to do this without a udf:
from pyspark.sql.functions import unix_timestamp, from_unixtime
df = spark.createDataFrame(
[("11/25/1991",), ("11/24/1991",), ("11/30/1991",)],
['date_str']
)
df2 = df.select(
'date_str',
from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date')
)
print(df2)
#DataFrame[date_str: string, date: timestamp]
df2.show(truncate=False)
#+----------+-------------------+
#|date_str |date |
#+----------+-------------------+
#|11/25/1991|1991-11-25 00:00:00|
#|11/24/1991|1991-11-24 00:00:00|
#|11/30/1991|1991-11-30 00:00:00|
#+----------+-------------------+
from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
# Creation of a dummy dataframe:
df1 = sqlContext.createDataFrame([("11/25/1991","11/24/1991","11/30/1991"),
("11/25/1391","11/24/1992","11/30/1992")], schema=['first', 'second', 'third'])
# Setting an user define function:
# This function converts the string cell into a date:
func = udf (lambda x: datetime.strptime(x, '%m/%d/%Y'), DateType())
df = df1.withColumn('test', func(col('first')))
df.show()
df.printSchema()
Here is the output:
+----------+----------+----------+----------+
| first| second| third| test|
+----------+----------+----------+----------+
|11/25/1991|11/24/1991|11/30/1991|1991-01-25|
|11/25/1391|11/24/1992|11/30/1992|1391-01-17|
+----------+----------+----------+----------+
root
|-- first: string (nullable = true)
|-- second: string (nullable = true)
|-- third: string (nullable = true)
|-- test: date (nullable = true)
udf
shouldn't be necessary here, but the built ins for handling this are atrocious. This is what I would do for now too.
The strptime() approach does not work for me. I get another cleaner solution, using cast:
from pyspark.sql.types import DateType
spark_df1 = spark_df.withColumn("record_date",spark_df['order_submitted_date'].cast(DateType()))
#below is the result
spark_df1.select('order_submitted_date','record_date').show(10,False)
+---------------------+-----------+
|order_submitted_date |record_date|
+---------------------+-----------+
|2015-08-19 12:54:16.0|2015-08-19 |
|2016-04-14 13:55:50.0|2016-04-14 |
|2013-10-11 18:23:36.0|2013-10-11 |
|2015-08-19 20:18:55.0|2015-08-19 |
|2015-08-20 12:07:40.0|2015-08-20 |
|2013-10-11 21:24:12.0|2013-10-11 |
|2013-10-11 23:29:28.0|2013-10-11 |
|2015-08-20 16:59:35.0|2015-08-20 |
|2015-08-20 17:32:03.0|2015-08-20 |
|2016-04-13 16:56:21.0|2016-04-13 |
2008-08-01T14:45:37Z
to a timestamp instead of date, df = df.withColumn("CreationDate",df['CreationDate'].cast(TimestampType()))
works well... (Spark 2.2.0)
null
using this method.
In the accepted answer's update you don't see the example for the to_date
function, so another solution using it would be:
from pyspark.sql import functions as F
df = df.withColumn(
'new_date',
F.to_date(
F.unix_timestamp('STRINGCOLUMN', 'MM-dd-yyyy').cast('timestamp')))
possibly not so many answers so thinking to share my code which can help someone
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date
spark = SparkSession.builder.appName("Python Spark SQL basic example")\
.config("spark.some.config.option", "some-value").getOrCreate()
df = spark.createDataFrame([('2019-06-22',)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt'))
print df1
print df1.show()
output
DataFrame[dt: date]
+----------+
| dt|
+----------+
|2019-06-22|
+----------+
the above code to convert to date if you want to convert datetime then use to_timestamp. let me know if you have any doubt.
Try this:
df = spark.createDataFrame([('2018-07-27 10:30:00',)], ['Date_col'])
df.select(from_unixtime(unix_timestamp(df.Date_col, 'yyyy-MM-dd HH:mm:ss')).alias('dt_col'))
df.show()
+-------------------+
| Date_col|
+-------------------+
|2018-07-27 10:30:00|
+-------------------+
Success story sharing
to_date()
with the format argument is spark 2.2+.to_date
existed before 2.2, but the format option did not existdf = df.withColumn("ResultColumn", to_timestamp(col("OriginalDateCol"), 'yyyy-MM-dd HH:mm:ss'))