hadoop—使用flume从twitter检索数据并以json格式存储到hdfs

3lxsmp7m  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(313)

我正在尝试使用flume从twitter检索数据,并以json格式存储到hdfs中,数据正在加载到hdfs中,但不是以json格式。
我附上了从twitter存储的hdfs文件中的几行:

Objavro.schema\E4
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}\00\E0D\C9H\B8$\DCb,C\8A5y\D1n\CE$733267766577356800\00\96\00Zumaran \00\C6C.A.B//C.A.H
Wsp:351 220-1251
Fb:Ramiro Pedernera✌
Insta:Ramiropedernera
Snapp:ramipedernera12\00\B2\9E\00\B2(\00(DIVI^Lista RAMIRO P.\00RamiPedernera12\00(2016-05-19T17:37:13Z\00tGaray culiadaso me metió una patada en la frente ??\00\00\00\00\00\00\A8<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>\00\E0D\C9H\B8$\DCb,C\8A5y\D1n
Objavro.schema\E4

因为这不是json格式,所以无法通过在配置单元中创建表并加载此数据来处理它。所以请帮助我将json格式的twitter数据加载到hadoop hdfs中
这是我使用的命令:

bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

并附上twitter.conf:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =********
TwitterAgent.sources.Twitter.consumerSecret =*************
TwitterAgent.sources.Twitter.accessToken =****************
TwitterAgent.sources.Twitter.accessTokenSecret =*****************
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser_/twitter-cool
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = json
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sources.Twitter.handler = org.apache.flume.source.http.JSONHandler
wh6knrhe

wh6knrhe1#

默认情况下,来自flume的twittersource的事件采用avro格式。要改变这一点,您必须修改twittersource的源文件,以获得原始格式(json)的tweets。幸运的是,cloudera已经在这里做到了https://github.com/cloudera/cdh-twitter-example
您所要做的就是按照自述文件中的步骤为新的twitter源安装库,并更改 TwitterAgent.sources.Twitter.type 在flume配置文件中 com.cloudera.flume.source.TwitterSource . 同一个项目中有一个配置文件的示例。
希望有帮助

nkkqxpd9

nkkqxpd92#

要从avro格式更改为json格式,您必须执行以下几个步骤:
在配置文件中更改属性

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
``` `com.cloudera.flume.source.TwitterSource` 是一个自定义类,它在hdfs中以json格式写入记录。
去上这门课https://github.com/cloudera/cdh-twitter-example 并将flume sources文件夹下载到本地并从中生成jar文件。
要构建flume源jar:
$  `cd hive-serdes` $  `mvn package` $  `cd ..` 这将在目标目录中生成一个名为flume-sources-1.0-snapshot.jar的文件。
将jar添加到flume类路径
复制 `flume-sources-1.0-SNAPSHOT.jar` 至 `/usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar` 还有 `/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar` 如果这些目录不存在,则创建为

sudo mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/

sudo mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

有关更多信息,请参阅使用cdh分析twitter数据
希望这对你有帮助!!!

相关问题