如何使Spark编号列关键字计数

csbfibhn  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(397)

我有带时间关键字库的Dataframe。

我想在下面的条件下按关键字编号。
条件1。分为时间间隔,例如10分钟间隔(可更改)。
条件2。间隔(10min)内每个关键字的编号。
使编号列由.withcolumn()使用
怎么做??

这是我的示例代码。

class DataSetTest {
String date;
String keyword;

public DataSetTest(String date, String keyword) {
    this.date = date;
    this.keyword = keyword;
}}

DataSetTest ro1 = new DataSetTest("2020.11.26 11:30:00", "apple");
DataSetTest ro2 = new DataSetTest("2020.11.26 11:31:00", "apple");
DataSetTest ro3 = new DataSetTest("2020.11.26 11:32:00", "apple");
DataSetTest ro4 = new DataSetTest("2020.11.26 11:34:00", "banana");
DataSetTest ro5 = new DataSetTest("2020.11.26 11:35:00", "car");
DataSetTest ro6 = new DataSetTest("2020.11.26 11:37:00", "car");
DataSetTest ro7 = new DataSetTest("2020.11.26 11:39:00", "apple");
DataSetTest ro8 = new DataSetTest("2020.11.26 11:40:00", "banana");
DataSetTest ro9 = new DataSetTest("2020.11.26 11:42:00", "car");
DataSetTest ro10 = new DataSetTest("2020.11.26 11:43:00", "banana");
DataSetTest ro11= new DataSetTest("2020.11.26 11:45:00", "apple");
DataSetTest ro12 = new DataSetTest("2020.11.26 11:46:00", "apple");
DataSetTest ro13 = new DataSetTest("2020.11.26 11:50:00", "apple");
DataSetTest ro14= new DataSetTest("2020.11.26 11:51:00", "banana");
DataSetTest ro15= new DataSetTest("2020.11.26 11:52:00", "car");
DataSetTest ro16= new DataSetTest("2020.11.26 11:54:00", "car");
DataSetTest ro17= new DataSetTest("2020.11.26 11:58:00", "banana");
Dataset<Row> df = session.getSqlContext().createDataFrame(Arrays.asList(ro1, ro2, ro3, ro4, ro5, ro6, ro7, ro8, ro9, ro10, ro11, ro12, ro13, ro14, ro15, ro16, ro17), DataSetTest.class);

WindowSpec windowSpec = Window.orderBy("keyword");
dataset.withColumn("index", row_number().over(windowSpec));
zrfyljdw

zrfyljdw1#

对于基本间隔,可以使用date\u trunc函数:

SELECT time_truncated, time, keyword, ROW_NUMBER() OVER(PARTITION BY time_truncated, keyword ORDER BY time) AS RN
FROM
(
  SELECT *, date_trunc('HOUR', time) AS time_truncated
  FROM table
)t

对于“10分钟”这样的时间间隔,您需要自定义日期截断公式:

date_trunc('minute', from_unixtime(to_unix_timestamp(time) - minute(time) % 10 * 60)) as time_truncated

相关问题