twitter流与多个twitt具有相同的id

tyg4sfes  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(387)

我用这个管道收集推特。我试着用一些自己的脚本来分析收集到的脚本。我发现我收到了多条相同id的推文hdfs://user/flume/tweets 并看到这多条tweet都在存储文件中。所以这不是Hive或oozie的问题。
可能是Flume问题:我在Flume参数中做了一些编辑:

TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 //in github 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100000 //in github 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000 //in github 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 10000 //in github 100

或者twitter发布了这个tweets?这不是hadoop的问题吗?
升级1
以下是我的Flume配置:


# The configuration file needs to define the sources,

# the channels and the sinks.

# Sources, channels and sinks are defined per agent,

# in this case called 'TwitterAgent'

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
 TwitterAgent.sources.Twitter.channels = MemChannel
 TwitterAgent.sources.Twitter.consumerKey = MyKey
 TwitterAgent.sources.Twitter.consumerSecret = MyKey
 TwitterAgent.sources.Twitter.accessToken = MyKey
 TwitterAgent.sources.Twitter.accessTokenSecret = MyKey
 TwitterAgent.sources.Twitter.keywords = hadoop, big-data , big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

 TwitterAgent.sinks.HDFS.channel = MemChannel
 TwitterAgent.sinks.HDFS.type = hdfs
 TwitterAgent.sinks.HDFS.hdfs.path = hdfs://rh-hadoop-master:8020/user/flume/tweets/%Y/%m/%d/%H/
 TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
 TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
 TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000
 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
 TwitterAgent.sinks.HDFS.hdfs.rollCount = 100000

 TwitterAgent.channels.MemChannel.type = memory
 TwitterAgent.channels.MemChannel.capacity = 100000
 TwitterAgent.channels.MemChannel.transactionCapacity = 10000

下面是重复行的示例:

{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":539321584226680833,"in_reply_to_user_id_str":null,"timestamp_ms":"1417419260447","in_reply_to_status_id":null,"created_at":"Mon Dec 01 07:34:20 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"Testing Engineer, Hyderabad / Secunderabad, 2 - 5 Year Exp,Software Test Engineer , &amp;#x22;Big Data&amp;#x22;... http://t.co/DAK1ilWhM5","contributors":null,"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"http://bit.ly/1ttBxPY","indices":[116,138],"display_url":"bit.ly/1ttBxPY","url":"http://t.co/DAK1ilWhM5"}],"hashtags":[{"text":"x22","indices":[89,93]},{"text":"x22","indices":[107,111]}],"user_mentions":[]},"source":"<a href=\"http://monsterindia.com\" rel=\"nofollow\">IT jobs, India<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"539321584226680833","user":{"location":"India","default_profile":false,"profile_background_tile":false,"statuses_count":63546,"lang":"en","profile_link_color":"0084B4","id":123537533,"following":null,"protected":false,"favourites_count":0,"profile_text_color":"333333","verified":false,"description":"Get latest job opportunities in Indian IT industry","contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"IT Jobs, India","profile_background_color":"C0DEED","created_at":"Tue Mar 16 11:48:44 +0000 2010","default_profile_image":false,"followers_count":1245,"profile_image_url_https":"https://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":0,"profile_sidebar_fill_color":"DDEEF6","screen_name":"tech_career","id_str":"123537533","profile_image_url":"http://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","listed_count":43,"is_translator":false}}
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":539321584226680833,"in_reply_to_user_id_str":null,"timestamp_ms":"1417419260447","in_reply_to_status_id":null,"created_at":"Mon Dec 01 07:34:20 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"Testing Engineer, Hyderabad / Secunderabad, 2 - 5 Year Exp,Software Test Engineer , &amp;#x22;Big Data&amp;#x22;... http://t.co/DAK1ilWhM5","contributors":null,"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"http://bit.ly/1ttBxPY","indices":[116,138],"display_url":"bit.ly/1ttBxPY","url":"http://t.co/DAK1ilWhM5"}],"hashtags":[{"text":"x22","indices":[89,93]},{"text":"x22","indices":[107,111]}],"user_mentions":[]},"source":"<a href=\"http://monsterindia.com\" rel=\"nofollow\">IT jobs, India<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"539321584226680833","user":{"location":"India","default_profile":false,"profile_background_tile":false,"statuses_count":63546,"lang":"en","profile_link_color":"0084B4","id":123537533,"following":null,"protected":false,"favourites_count":0,"profile_text_color":"333333","verified":false,"description":"Get latest job opportunities in Indian IT industry","contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"IT Jobs, India","profile_background_color":"C0DEED","created_at":"Tue Mar 16 11:48:44 +0000 2010","default_profile_image":false,"followers_count":1245,"profile_image_url_https":"https://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":0,"profile_sidebar_fill_color":"DDEEF6","screen_name":"tech_career","id_str":"123537533","profile_image_url":"http://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","listed_count":43,"is_translator":false}}
mw3dktmi

mw3dktmi1#

flume不会向要存储的数据添加任何类型的id。hdfs也是如此,它在存储数据时不添加任何id。它们只是一起工作,以便移动生成的数据并存储它。
如果你用相同的id存储tweet,那是因为你用这些id接收数据,或者你用错误的方式解释数据。
既然如此,也许你可以通过编辑问题来增加一些例子。

相关问题