从JQ流出JSON时包含任何对象的路径

9q78igpj  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(52)

我希望能够通过指定一个JSON查找来从巨大的(100 GB以上)JSON文件中提取特定的JSON对象。(虽然我的最终目标是有一个用于查找的路径类型字符串,但我非常乐意将此字符串处理为jq过滤器-用于流媒体的jq过滤器是我遇到的问题)。理想情况下,我希望只使用jq
我的文件格式如下:
文件1:

{
   "data": {
      "<userId1>": {
         "things": {
            "<thingId1>": {
               "subfield1": "blah1",
               "subfield3": "foobar"
            },
            "<thingId2>": {
               "subfield2": "blah2"
            }
         }
      },
      "<userId2>": {
         "things": {
            "<thingId1>": {
               "subfield4": "blah3"
            },
            "<thingId2>": {
               "subfield3": "blah4"
            }
         }
      }
   },
   "users": {
      "<userId1>": {
         "name": "user1",
         "email": "[email protected]"
      },
      "<userId2>": {
         "name": "user2"
      },
      "<userId3>": {
         "email": "[email protected]"
      }
   }
}

字符串
文件2(一个更简单的对象键列表):

{
   "<key1>": {
      "subfield1": "blah1",
      "subfield3": "foobar"
   },
   "<key2>": {
      "subfield2": "blah2"
   }
}


文件3(一个深度键控的对象列表):

{
   "<key1>": {
      "<subKey1A>": {
         "<deeperKey1A>": {
            "subfield1": "blah1",
            "subfield3": "foobar"
         },
         "<deeperKey2A>": {
            "subfield1": "blah1",
            "subfield4": "foobar4"
         }
      },
      "<subKey2A>": {
         "<deeperKey2A>": {
            "subfield4": "foobar4"
         }
      }
   },
   "<key2>": {
      "<subKey1B>": {
         "<deeperKey2B>": {
            "subfield1": "blah1",
            "subfield4": "foobar4"
         }
      },
      "<subKey2B>": {
         "<deeperKey1B>": {
            "subfield1": "blah1",
            "subfield3": "foobar"
         },
         "<deeperKey3B>": {
            "subfield4": "foobar6"
         }
      }
   }
}


在上述两种情况下,任何<blah>键都是任意ID。
我想输出以下文件-搜索是我想实现的搜索类型查找-认为$<>路径是“这匹配任何东西”(也许使用*更清楚,但我喜欢它有助于识别什么是阅读它时的jq组件.).我还包括了特定的jq过滤器为每一个情况下,我已经得到的工作。
文件1 -用于data/$userId/things/$thingId搜索:

{ "path": ["data", "<userId1>", "things", "<thingId1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["data", "<userId1>", "things", "<thingId2>"], "value": { "subfield2": "blah2" } }
{ "path": ["data", "<userId2>", "things", "<thingId1>"], "value": { "subfield4": "blah3" } }
{ "path": ["data", "<userId2>", "things", "<thingId1>"], "value": { "subfield3": "blah4" }


jq过滤器:

foreach (

  ( inputs | select(has(1) and ((.[0] | length) >= 4) and .[0][0] == "data" and .[0][2] == "things")
  | (first | [rindex("things")] | max | values) as $p
  | [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]

) as [$vpath, $path, $value] ([];

  if [first.path, $path] | unique[1]
  then [{$path}, first] else .[:1] end
  | first.value |= setpath($vpath; $value);

  .[1] | values
)


文件1 -用于users/$userId搜索:

{ "path": ["users", "<userId1>"], "value": { "name": "user1", "email": "[email protected]" } }
{ "path": ["users", "<userId2>"], "value": { "name": "user2" } }
{ "path": ["users", "<userId3>"], "value": { "email": "[email protected]" } }


jq过滤器:

foreach (

  ( inputs | select(has(1) and ((.[0] | length) >= 1) and .[0][0] == "users")
  | (first | [rindex("users")] | max | values) as $p
  | [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]

) as [$vpath, $path, $value] ([];

  if [first.path, $path] | unique[1]
  then [{$path}, first] else .[:1] end
  | first.value |= setpath($vpath; $value);

  .[1] | values
)


文件2 -对于$key搜索:

{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>"], "value": { "subfield2": "blah2" } }


我偶然发现了一个部分解决方案,同时胡乱调整上述过滤器:

foreach (

  ( inputs | select(has(1) and ((.[0] | length) >= 1))
  | (first | [-1] | max | values) as $p
  | [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]

) as [$vpath, $path, $value] ([];

  if [first.path, $path] | unique[1]
  then [{$path}, first] else .[:1] end
  | first.value |= setpath($vpath; $value);

  .[1] | values
)


这几乎可以工作,但输出重复的条目(与子对象中的键数相同):

{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>"], "value": { "subfield2": "blah2" } }


我真的不明白为什么用[-1]替换[rindex("some-key")]会产生这样的输出!
文件3 -对于$key/$subKey/$deeperKey搜索:

{ "path": ["<key1>", "<subKey1A>", "<deeperKey1A>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key1>", "<subKey1A>", "<deeperKey2A>"], "value": { "subfield1": "blah1", "subfield4": "foobar4" } }
{ "path": ["<key1>", "<subKey2A>", "<deeperKey2A>"], "value": { "subfield4": "foobar4" } }
{ "path": ["<key2>", "<subKey1B>", "<deeperKey2B>"], "value": { "subfield1": "blah1", "subfield4": "foobar4" } }
{ "path": ["<key2>", "<subKey2B>", "<deeperKey1B>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>", "<subKey2B>", "<deeperKey1C>"], "value": { "subfield4": "foobar6" } }


我已经设法让它再次与上面的过滤器一起工作,但我所能打印出来的只是第一个键(<key1>),value是嵌套在其父路径({ "<subKey1A>": { "<deeperKey1A>": { "subfield1": "blah1", "subfield3": "foobar" } } })下的对象。

注意事项

这可以被归类为my original question的副本,虽然这是一个很好的答案,但我仍然需要更多的灵活性,我不想改变这个问题(我已经有过一次),从优秀的答案中删除意义。

eeq64g8w

eeq64g8w1#

让我们首先假设输入JSON足够小,我们不必使用流解析器。
为了简单起见,我们还假设查询以JSON数组的形式表示,并以“*”作为后缀。
接下来,让我们定义一个helper函数:

# matchesQuery($array) evaluates to true iff the input is an array that matches the array $array
# in the sense that there is elementwise matching of as many elements as there are in $array,
# it being understood that "*" in $array is like a wildcard.
def matchesQuery($array):
  . as $in
  | type == "array"
    and length >= ($array|length)
    and all(range(0; $array|length); . as $i | $array[$i] | IN("*", $in[$i]));

字符串
然后可以使用以下过滤器来提出查询:

def query($query):
  . as $in
  | ($query|length) as $ql
  | paths
  | select(length == $ql and matchesQuery($query)) as $path
  | ($in | getpath($path)) as $value
  | {$path, $value} ;


例如,对第一个文件执行query( ["data", "*", "things", "*"] )会产生所需的输出。
接下来,我们可以通过如下方式 Package query/1来调整上面的内容,以便与jq的流解析器一起使用:

def streamQuery($query):
  fromstream(inputs
    | . as $in
    | if ( length == 2 )
      then select( .[0] | matchesQuery($query))
      else . 
      end )
  | query($query);


最后,让我们添加一个“主”程序

streamQuery( $query )


这样我们就可以在命令行上指定查询:

jq -n --stream --argjson query '["data", "*", "things", "*"]' -c -f streamQuery.jq  file1.json

相关问题