在athena中为特定的日志格式创建表

bxgwgixi  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(342)

需要帮助为以下日志格式创建架构。我使用athena来查询一堆具有这种格式的日志,但是我在为这种特定的日志格式创建表时遇到了一个问题。我尝试了几次不同的尝试,但都失败了。下面也是create table语句的最新尝试,但也不断失败。

CREATE TABLE "fastly_logs" (
  "some_num" int,
  "timestamp1" timestamp,
  "cache_value" string,
  "some_fastly" string,  
  "log_info" struct <
                request_ipv4: string,    
                request_https: int,
                request: string,
                request_host: string,
                request_uri: string,
                http_code: int,
                response: int,
                req_header_size: int,
                req_body_size: int,
                resp_header_size: int,
                resp_body_size: int,
                content_type: string,
                pop: string,
                device_type: string,
                referrer: string,
                request_user_agent: string,
                info_state: string,
                hits: int,
                tls_version: string,
                >
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '^\D(\d{3})\D (\d{4}-\d{2}-\d{2}T.*Z) (cache-\D+{3}+\d{4,10}) (Prod-fastly\D+\d{6}\D) {(.*)}'
)
LOCATION 's3://prod-fastly/'
TBLPROPERTIES ('has_encrypted_data'='true')

日志格式:

<134>2020-02-10T16:04:22Z cache-dca17767 Prod-fastly[476113]: {"timestamp":"Mon, 10 Feb 2020 16:04:22 GMT", "request_ipv4":"111.111.111.111", "request_https":"true", "tls_version":"TLSv1.2", "request":"POST", "request_host":"test.example.com", "vp_stem":"www", "request_uri":"/this/ajax-test", "http_code":200, "response":"OK", "req_header_size":1215, "req_body_size":54, "resp_header_size":916, "resp_body_size":16, "content_type":"application/json", "pop":"DCA", "device_type":"", "referrer":"https://test.example.copm/some/path", "request_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36", "info_state":"PASS", "hits":0 }
64jmpszr

64jmpszr1#

如果您知道哪些字符是可能的,我建议您不要使用\d。首先,在日期之前没有空格,所以这马上就不匹配了。应该是
<(\d+)> ^\D(\d{3})\D 你的约会看起来不错,但我不知道是什么形式 timestamp 期望 (\d{4}-\d{2}-\d{2}T.*Z) 缓存看起来很好,但是值似乎是十六进制的,所以应该使用 cache-[0-9a-fA-F]{1,13} ,例如 (cache-\D+{3}+\d{4,10}) 同样,不要使用\d。最好是去
Prod-fastly[\d+] (Prod-fastly\D+\d{6}\D) 其余的都是json对象,是的,但是您不应该尝试解析它,所以它的类型必须是字符串。只是 (.*)$ 会做的;不要用花括号 {(.*)} 我建议在字段之间留出一个以上的空间
总共 ^<(\d+)>(\d{4}-\d{2}-\d{2}T.*Z)\s+(cache-[0-9a-fA-F]{1,13})\s+(Prod-fastly\[\d+\])\s+(.*)$ 您需要创建这个原始的纯文本表,然后可以使用struct创建另一个表,并在insert/select语句期间解析其中的json对象

CREATE TABLE "fastly_logs" (
  "some_num" int,
  "timestamp1" timestamp,
  "cache_value" string,
  "some_fastly" string,  
  "remainder" string)

那就试试 SELECT json extract(remainder, "$.request_host") FROM fastly_logs;

相关问题