pyspark-将具有2种时间格式的列的时间格式转换为公共时间格式

nnt7mjpx  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(324)

dat 有两种时间戳。我正在尝试将多个字符串日期格式转换为单个格式。

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

# Sample data

data1 = [("host1","cpu","2020-03-23 07:30:20"),
       ("host2","memory","1616131516"),
       ("host3","disk","2020-03-23 08:50:00"),
       ("host4","memory","1816131316"),
         ]

# Defining Schema

schema1= StructType([ \
    StructField("hostname",StringType(),True), \
    StructField("kpi",StringType(),True), \
    StructField("dat",StringType(),True), \
    ])

# Creating dataframe

df = spark.createDataFrame(data=data1,schema=schema1)
df.printSchema()
df.show(truncate=False)

root
 |-- hostname: string (nullable = true)
 |-- kpi: string (nullable = true)
 |-- dat: string (nullable = true)

+--------+------+-------------------+
|hostname|kpi   |dat                |
+--------+------+-------------------+
|host1   |cpu   |2020-03-23 07:30:20|
|host2   |memory|1616131516         |
|host3   |disk  |2020-03-23 08:50:00|
|host4   |memory|1816131316         |
+--------+------+-------------------+

我有只转换unixtime格式的代码。我需要转换列的两种格式 dat '到所需格式: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" 在一行代码中,因为我正在使用数据流。

df1 = df.withColumn('datetime',from_unixtime(df.dat,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(truncate=False)
df.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|null                    |
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|null                    |
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+

我想要的Dataframe是:

+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+
7gs2gvoe

7gs2gvoe1#

你可以用 date_format 将其他“标准”日期格式转换为所需格式,以及 coalesce 使用转换的现有列 from_unixtime .

import pyspark.sql.functions as F

df1 = df.withColumn(
    'datetime',
    F.coalesce(
        F.from_unixtime(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), 
        F.date_format(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
    )
)

df1.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+

相关问题