上周一在spark

tyky79it 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(289)

我正在使用spark2.0和pythonapi。
我有一个dataframe，其列的类型为datetype（）。我想在数据框中添加一列，其中包含最近的星期一。
我可以这样做：

reg_schema = pyspark.sql.types.StructType([
    pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True),
    pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True)
])
reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file)
reg = reg.withColumn('monday',
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Mon',
        reg.AccountCreationDate).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Tue',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 1)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Wed',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 2)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Thu',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 3)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Fri',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 4)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sat',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 5)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sun',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 6))
        )))))))

然而，这似乎是许多代码的东西，应该是相当简单的。有没有更简洁的方法？

python apache-spark pyspark apache-spark-sql pyspark-sql

来源：https://stackoverflow.com/questions/40271814/get-last-monday-in-spark

2条答案

按热度按时间

nr9pn0ug1#

我发现Pypark的功能 trunc 同样有效。

import pyspark.sql.functions as f

df = spark.createDataFrame([
    (datetime.date(2020, 10, 27), ),
    (datetime.date(2020, 12, 21), ),
    (datetime.date(2020, 10, 13), ),
    (datetime.date(2020, 11, 11), ),
], ["date_col"])
df = df.withColumn("first_day_of_week", f.trunc("date_col", "week"))

赞(0）回复(0）举报 2021-05-27

8qgya5xd2#

您可以使用 next_day 减去一周。所需功能可按如下方式导入：

from pyspark.sql.functions import next_day, date_sub

作为：

def previous_day(date, dayOfWeek):
    return date_sub(next_day(date, "monday"), 7)

最后举个例子：

from pyspark.sql.functions import to_date

df = sc.parallelize([
    ("2016-10-26", )
]).toDF(["date"]).withColumn("date", to_date("date"))

df.withColumn("last_monday", previous_day("date", "monday"))

结果如下：

+----------+-----------+
|      date|last_monday|
+----------+-----------+
|2016-10-26| 2016-10-24|
+----------+-----------+

赞(0）回复(0）举报 2021-05-27

我来回答

上周一在spark

2条答案

相关问题

热门标签

最新问答