pyspark-无法从日期列获取季度和周

jyztefdp  于 2021-05-16  发布在  Spark
关注(0)|答案(2)|浏览(483)

我有一个PyparkDataframe,看起来像这样:

+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+

我想创建两个列,一个是季度,另一个是周数。以下是我所做的,参考了一周和一个季度的文档:

from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                                F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                              F.quarter("d")))
sales_table.show(10)

下面是错误:

Column is not iterable
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
    return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
    args_command, temp_args = self._build_args(*args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
    for element in object:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

如何创建和附加这两列?
有没有更好或更有效的方法来创建这些列,而不必转换 date 列到 yyyy-mm-dd 每次格式化并在一个命令中创建这两列?

von4xj4u

von4xj4u1#

您可以只使用字符串列上的函数 date 直接。

df = df.select(
    '*',
    F.weekofyear('date').alias('week_year'), 
    F.quarter('date').alias('quarter')
)
df.show()

+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|week_year|quarter|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|        3|      1|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|       35|      3|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|       34|      3|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|        9|      1|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|       48|      4|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|       16|      2|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|       10|      1|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|       43|      4|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|        4|      1|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|       36|      3|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
iszxjhcz

iszxjhcz2#

你不必使用 date_format 像你已经做的那样在这里工作 dateyyyy-MM-dd 格式直接使用 week_of_year and quarterdate 列。
例子:

df.show()

# +----------+

# |      date|

# +----------+

# |2015-01-14|

# +----------+

from pyspark.sql import functions as F

df.withColumn("week_year", F.weekofyear(F.col("date"))).\
withColumn("quarter", F.quarter(F.col("date"))).\
show()

# +----------+---------+-------+

# |      date|week_year|quarter|

# +----------+---------+-------+

# |2015-01-14|        3|      1|

# +----------+---------+-------+

相关问题