我希望能够通过指定一个JSON查找来从巨大的(100 GB以上)JSON文件中提取特定的JSON对象。(虽然我的最终目标是有一个用于查找的路径类型字符串,但我非常乐意将此字符串处理为jq
过滤器-用于流媒体的jq
过滤器是我遇到的问题)。理想情况下,我希望只使用jq
。
我的文件格式如下:
文件1:
{
"data": {
"<userId1>": {
"things": {
"<thingId1>": {
"subfield1": "blah1",
"subfield3": "foobar"
},
"<thingId2>": {
"subfield2": "blah2"
}
}
},
"<userId2>": {
"things": {
"<thingId1>": {
"subfield4": "blah3"
},
"<thingId2>": {
"subfield3": "blah4"
}
}
}
},
"users": {
"<userId1>": {
"name": "user1",
"email": "[email protected]"
},
"<userId2>": {
"name": "user2"
},
"<userId3>": {
"email": "[email protected]"
}
}
}
字符串
文件2(一个更简单的对象键列表):
{
"<key1>": {
"subfield1": "blah1",
"subfield3": "foobar"
},
"<key2>": {
"subfield2": "blah2"
}
}
型
文件3(一个深度键控的对象列表):
{
"<key1>": {
"<subKey1A>": {
"<deeperKey1A>": {
"subfield1": "blah1",
"subfield3": "foobar"
},
"<deeperKey2A>": {
"subfield1": "blah1",
"subfield4": "foobar4"
}
},
"<subKey2A>": {
"<deeperKey2A>": {
"subfield4": "foobar4"
}
}
},
"<key2>": {
"<subKey1B>": {
"<deeperKey2B>": {
"subfield1": "blah1",
"subfield4": "foobar4"
}
},
"<subKey2B>": {
"<deeperKey1B>": {
"subfield1": "blah1",
"subfield3": "foobar"
},
"<deeperKey3B>": {
"subfield4": "foobar6"
}
}
}
}
型
在上述两种情况下,任何<blah>
键都是任意ID。
我想输出以下文件-搜索是我想实现的搜索类型查找-认为$<>
路径是“这匹配任何东西”(也许使用*
更清楚,但我喜欢它有助于识别什么是阅读它时的jq
组件.).我还包括了特定的jq
过滤器为每一个情况下,我已经得到的工作。
文件1 -用于data/$userId/things/$thingId
搜索:
{ "path": ["data", "<userId1>", "things", "<thingId1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["data", "<userId1>", "things", "<thingId2>"], "value": { "subfield2": "blah2" } }
{ "path": ["data", "<userId2>", "things", "<thingId1>"], "value": { "subfield4": "blah3" } }
{ "path": ["data", "<userId2>", "things", "<thingId1>"], "value": { "subfield3": "blah4" }
型jq
过滤器:
foreach (
( inputs | select(has(1) and ((.[0] | length) >= 4) and .[0][0] == "data" and .[0][2] == "things")
| (first | [rindex("things")] | max | values) as $p
| [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]
) as [$vpath, $path, $value] ([];
if [first.path, $path] | unique[1]
then [{$path}, first] else .[:1] end
| first.value |= setpath($vpath; $value);
.[1] | values
)
型
文件1 -用于users/$userId
搜索:
{ "path": ["users", "<userId1>"], "value": { "name": "user1", "email": "[email protected]" } }
{ "path": ["users", "<userId2>"], "value": { "name": "user2" } }
{ "path": ["users", "<userId3>"], "value": { "email": "[email protected]" } }
型jq
过滤器:
foreach (
( inputs | select(has(1) and ((.[0] | length) >= 1) and .[0][0] == "users")
| (first | [rindex("users")] | max | values) as $p
| [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]
) as [$vpath, $path, $value] ([];
if [first.path, $path] | unique[1]
then [{$path}, first] else .[:1] end
| first.value |= setpath($vpath; $value);
.[1] | values
)
型
文件2 -对于$key
搜索:
{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>"], "value": { "subfield2": "blah2" } }
型
我偶然发现了一个部分解决方案,同时胡乱调整上述过滤器:
foreach (
( inputs | select(has(1) and ((.[0] | length) >= 1))
| (first | [-1] | max | values) as $p
| [.[0][$p+2:], .[0][:$p+2], .[1]]), [[]]
) as [$vpath, $path, $value] ([];
if [first.path, $path] | unique[1]
then [{$path}, first] else .[:1] end
| first.value |= setpath($vpath; $value);
.[1] | values
)
型
这几乎可以工作,但输出重复的条目(与子对象中的键数相同):
{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key1>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>"], "value": { "subfield2": "blah2" } }
型
我真的不明白为什么用[-1]
替换[rindex("some-key")]
会产生这样的输出!
文件3 -对于$key/$subKey/$deeperKey
搜索:
{ "path": ["<key1>", "<subKey1A>", "<deeperKey1A>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key1>", "<subKey1A>", "<deeperKey2A>"], "value": { "subfield1": "blah1", "subfield4": "foobar4" } }
{ "path": ["<key1>", "<subKey2A>", "<deeperKey2A>"], "value": { "subfield4": "foobar4" } }
{ "path": ["<key2>", "<subKey1B>", "<deeperKey2B>"], "value": { "subfield1": "blah1", "subfield4": "foobar4" } }
{ "path": ["<key2>", "<subKey2B>", "<deeperKey1B>"], "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": ["<key2>", "<subKey2B>", "<deeperKey1C>"], "value": { "subfield4": "foobar6" } }
型
我已经设法让它再次与上面的过滤器一起工作,但我所能打印出来的只是第一个键(<key1>
),value
是嵌套在其父路径({ "<subKey1A>": { "<deeperKey1A>": { "subfield1": "blah1", "subfield3": "foobar" } } }
)下的对象。
注意事项
这可以被归类为my original question的副本,虽然这是一个很好的答案,但我仍然需要更多的灵活性,我不想改变这个问题(我已经有过一次),从优秀的答案中删除意义。
1条答案
按热度按时间eeq64g8w1#
让我们首先假设输入JSON足够小,我们不必使用流解析器。
为了简单起见,我们还假设查询以JSON数组的形式表示,并以“*”作为后缀。
接下来,让我们定义一个helper函数:
字符串
然后可以使用以下过滤器来提出查询:
型
例如,对第一个文件执行
query( ["data", "*", "things", "*"] )
会产生所需的输出。接下来,我们可以通过如下方式 Package
query/1
来调整上面的内容,以便与jq的流解析器一起使用:型
最后,让我们添加一个“主”程序
型
这样我们就可以在命令行上指定查询:
型