hive SQL:查询具有年/月/日分区的时间段的数据

6ovsh4lw  于 8个月前  发布在  Hive
关注(0)|答案(3)|浏览(92)

我有一个AWS S3数据湖的parquet文件结构如下:

s3://bucket/device/table_x/year=2000/month=01/day=02/xyz.parquet

我的目标是使用AWS Athena来查询数据,以便在Grafana Jmeter 板中显示。我面临的挑战是,为了创建任何时间段的动态面板,同时也利用我的分区,我需要找到一种方法来将我的数据限制在我的WHERE部分中的相关时间段内-但要以一种跨年,月和日的方式来实现-而不必根据查询构建SQL语句。
我现在最好的建议是在查询下面,这应该可以工作-但它很复杂。是否有关于这种声明的建议最佳做法?

SELECT
    Count(a1) as AVG_a1                 
FROM
    tbl_11111111_a
WHERE
    (
        -- Same year, same month
        (year = 'START_YEAR' AND month = 'START_MONTH' AND day BETWEEN 'START_DAY' AND 'END_DAY')
        OR
        -- Same year, different months
        (year = 'START_YEAR' AND month = 'START_MONTH' AND day >= 'START_DAY')
        OR
        (year = 'START_YEAR' AND month > 'START_MONTH' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
        OR
        (year = 'START_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
        OR
        -- Different years
        (year > 'START_YEAR' AND year < 'END_YEAR')
        OR
        (year = 'END_YEAR' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
        OR
        (year = 'END_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
    )
    AND
    t BETWEEN TIMESTAMP 'START_YEAR-START_MONTH-START_DAY 00:00:00' AND TIMESTAMP 'END_YEAR-END_MONTH-END_DAY 00:00:00'
mv1qrgav

mv1qrgav1#

我最终使用了this AWS examplethis blog article中描述的方法。
具体来说,我按照最初列出的那样设置我的S3结构,除了我删除了hive符号(这可以保留,但我相信在这种情况下投影日期格式会变得更加混乱):

s3://bucket/device/table_x/2000/01/02/xyz.parquet

有了这个,我设置我的表属性使用分区投影与以下设置:

"projection.enabled" = "true",
 "projection.date_created.type" = "date",
 "projection.date_created.format" = "yyyy/MM/dd",
 "projection.date_created.range" = "2000/01/01,NOW",
 "projection.date_created.interval" = "1",
 "projection.date_created.interval.unit" = "DAYS",
 "storage.location.template" = "s3://bucket/device/table_x/${date_created}/"

在我的Table Schema中,我将JSON更新为以下内容:

[
  {
    "Name": "t",
    "Type": "timestamp",
    "Comment": ""
  },
  {
    "Name": "a1",
    "Type": "double",
    "Comment": ""
  },
  {
    "Name": "a2",
    "Type": "double",
    "Comment": ""
  },
  {
    "Name": "date_created",
    "Type": "string",
    "Comment": "",
    "PartitionKey": "Partition (0)"
  }
]

在进行了这些更新之后,我现在可以使用pilcrow提出的更简单的分区结构来查询数据:

SELECT
    Count(a1) as AVG_a1                 
FROM
    tbl_11111111_a
WHERE
    date_created BETWEEN '2000/01/01' AND '2000/01/02'

这里的重点是,我能够保留一个子文件夹嵌套的日期结构(我更喜欢用于非Athena数据库的目的-但仍然保留简单的SQL查询WHERE语句用于我的Athena查询,同时使用有效的分区投影。
还请注意,我相信pilcrow在分析我原来的年/月/日SQL查询效率低下方面是正确的,因为我发现当切换到与单日5分钟间隔相关的查询的新结构时,速度提高了66%。

iszxjhcz

iszxjhcz2#

我有两个选择。
首先,理想情况下,repartition使用单个“YYYYMMDD”属性分区,而不是嵌套。对于更自然的查询,这是一个很好的实践,例如,“... WHERE ymd_partition BETWEEN 'START_YMD' AND 'END_YMD'“,没有所有那些布尔扭曲。
否则,继续使用当前的方法,它有一个可行的想法,但为了效率而收紧它。
我们希望根据输入范围应用三种情景(同一年/月、同一年、不同年)中的一种且仅一种;然而,正如所写,我们根据所有记录来评估每个场景的每个条件,而不管输入范围如何。(例如,[2023-01-31, 2023-02-1]的输入范围将检查 * 从1月起的每个parquet文件 *,因为倒数第二个OR条件,因此意味着我们依赖t BETWEEN...条件作为昂贵的支持。
解决方案不会是漂亮的。

WHERE

-- apply only when input range has same year, same month
-- 
(start_yr = end_yr AND start_mo = end_mo
  AND "year" = start_yr
  AND "month" = start_mo
  AND "day" BETWEEN start_day AND end_day)

OR
-- apply only when input range same year, different months
--
(start_yr = end_yr AND start_mo != end_mo
  AND "year" = start_yr
  AND ( ("month" = start_mo AND "day" >= start_day)
          OR
        -- "day BETWEEN 1 AND 31" is superfluous here
        ("month" > start_mo AND "month" < end_mo)
          OR
        ("month" = end_mo AND "day" <= end_day) ) )

OR
-- apply only when input range has different years
--
(start_yr != end_yr
  AND
( ("year" = start_yr AND ( ("month" = start_mo AND "day" >= start_day)
                              OR
                           ("month" > start_mo) ) )
     OR
  ("year" > start_yr AND "year" < end_yr)
     OR
  ("year" = end_yr AND ( ("month" < end_mo )
                             OR
                         ("month" = end_mo AND "day" <= end_day) ) ) ) )

我可能漏掉了一个括号。
如果需要,可以重新添加t BETWEEN逻辑。

1mrurvl1

1mrurvl13#

我现在最好的建议是在查询下面,这应该可以工作-但它很复杂。是否有关于这种声明的建议最佳做法?
我认为您还可以使用CASE表达式将数据过滤到START_YEARSTART_MONTHSTART_DAYEND_YEAREND_MONTHEND_DAY参数指定的时间段。

SELECT
    Count(a1) as AVG_a1                 
FROM
    tbl_11111111_a
WHERE
    CASE
        WHEN year = START_YEAR AND month = START_MONTH AND day BETWEEN START_DAY AND END_DAY THEN 1
        WHEN year = START_YEAR AND month = START_MONTH AND day >= START_DAY THEN 1
        WHEN year = START_YEAR AND month > START_MONTH AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
        WHEN year = START_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
        WHEN year > START_YEAR AND year < END_YEAR THEN 1
        WHEN year = END_YEAR AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
        WHEN year = END_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
        ELSE 0
    END = 1
    AND
    t BETWEEN TIMESTAMP CONCAT(START_YEAR, '-', START_MONTH, '-', START_DAY, ' 00:00:00') AND TIMESTAMP CONCAT(END_YEAR, '-', END_MONTH, '-', END_DAY, ' 00:00:00')

相关问题