Shell脚本需要很长时间才能为JSON文件添加索引并推送ElasticSearch

i86rm4rw  于 8个月前  发布在  ElasticSearch
关注(0)|答案(1)|浏览(73)

Shell脚本需要很长时间才能将索引添加到JSON文件。
我们有大量的JSON文件(包含所有必需的数据),我们在其中添加索引逻辑,然后推送到ElasticSearch。每个文件有20万条记录。索引到 * 一个文件大约需要30分钟 ,推送到ElasticSearch大约需要20秒,这很好。30分钟的索引时间太长了,因为我们需要处理大约5k的文件(大约80GB的数据)。
我们也不是shell脚本的Maven,
有人能帮助使这个索引逻辑快吗?**30分钟一个文件似乎很长。
我们正在阅读json中的每一行,并在下面添加一行用于索引:

"{\"index\":{\"_id\":$guid,\"_index\" : \"demoindex\", \"_type\" : \"usage\"}}\n$row\n";

这就是代码:

#!/bin/bash

FEED_DIR="/apps/elasticsearch/demo"
FEED_TMP_DIR="/apps/elasticsearch/demo/temp"
FEED_ARCHIVE="/apps/elasticsearch/demo/archive"
FEEDER_LOG="/apps/log/feeder/demofeeder.log"

SCRPT_HOME="/apps/bin/elasticsearch"

LOCKFILE=$SCRPT_HOME/system/demofeeder.lock

# Skip if another version of Feeder2 is still executing
if test -f "$LOCKFILE"; then
    echo "demofeeder.sh is still running."
    echo "$LOCKFILE exists."
    echo "Exiting !!"
    exit 1;
fi

echo "Creating lock file: $LOCKFILE" > $FEEDER_LOG
touch $LOCKFILE

echo -e "\n Setting Replica to 0 before indexing!\n" >> $FEEDER_LOG
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/demoindex/_settings' -d '{ "number_of_replicas" : 0 } }'

echo "Starting demoindex Feeding Data at "$(date -u)"." >> $FEEDER_LOG
echo "Starting demoindex Feeding Data at "$(date -u)"."

echo "Clearing Temp directory" >> $FEEDER_LOG
rm -f $FEED_TMP_DIR/*.json

cd $FEED_DIR
for feed in $(ls -1 *.json)
do
  echo "Parsing $feed"

  while IFS= read -r line; do
    guid=`echo "$line" | awk -F "\"" '{print $4}'`;
    echo "{\"index\": {\"_index\": \"demoindex\", \"_id\": \"$guid\", \"_type\": \"usage\"}}" >> $FEED_TMP_DIR/"$feed";
    echo $line >> $FEED_TMP_DIR/"$feed";
  done <"$feed"

  echo "Loading parsed files into elasticsearch: $FEED_TMP_DIR/$feed"
  curl -s -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/demoindex/usage/_bulk?' --data-binary "@$FEED_TMP_DIR/$feed" >> $FEEDER_LOG

  echo "Deleting parsed json: "$FEED_TMP_DIR/$feed" (not enabled)"
  #rm "$FEED_TMP_DIR/$feed"

  echo "Moving $feed to Archive Folder (not enabled)"
  #mv $FEED_DIR/$feed $FEED_ARCHIVE/$feed
done

echo "Data Feeding demoindex Completed at "$(date -u)"." >> $FEEDER_LOG
echo "Removing lock file: $LOCKFILE" >> $FEEDER_LOG
rm -f $LOCKFILE
echo "Data Feeding demoindex Completed at "$(date -u)"."
ffx8fchx

ffx8fchx1#

AFAIU从你的代码中,你从每个文件的每一行的第四列中得到一些值,使用这个值编译一些字符串,然后打印这个新字符串和原始字符串,对吗?这可以使用这个简单的awk来完成:

$ ls *.txt
file1.txt  file2.txt  file3.txt  file4.txt

$ cat *.txt
1 2 3 v1 5 6 7 8 9 10
1 2 3 v2 5 6 7 8 9 10
1 2 3 v3 5 6 7 8 9 10
1 2 3 v4 5 6 7 8 9 10

$ awk '{print "some text", $4; print}' *.txt
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v2
1 2 3 v2 5 6 7 8 9 10
some text v3
1 2 3 v3 5 6 7 8 9 10
some text v4
1 2 3 v4 5 6 7 8 9 10

200k字符串的文件:

$ wc -l f1
200000 f1

$ time awk '{print "some text", $4; print}' f1
...
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v1
1 2 3 v1 5 6 7 8 9 10

real    0m1,345s
user    0m0,242s
sys     0m0,680s

相关问题