
x33g5p2x  于2020-09-30 发布在 Flume  


  1. Flume提供一个分布式的,可靠的,对大数据量的日志进行高效收集、聚集、移动的服务,flume只能在Unix环境下运行
  2. Flume基于流式架构,容错性强,也很灵活简单。
  3. Flume、kafka用来实时进行数据收集,Spark、Storm用来实时处理数据,impala用来实时查询


  1. source(数据源)

用于采集数据,Source是产生数据流的地方,同时Source会将产生的数据流阐述到Channel,这个有点类似于Java IO部分的Channel

  1. Channel(管道)


  1. Sink(目的地)


  1. Event





  1. 查询JAVA_HOME

显示:/opt/module/jdk1.8.0_144 /opt/module/jdk1.8.0_144

  1. 安装Flume
tar -zxvf apache-flume1.8.0-bin.tar.gz -C /opt/module/
  1. 改名
mv flume-env.sh.template flume-env.sh
  1. flume-env.sh涉及修改项
export JAVA_HOME=/opt/module/jdk1.8.0_144


5.1 案例一:监控端口数据

  1. 目标:Flume监控一端Console,另一端Console发送消息,使被监控端实时显示

  2. 分布实现:

  3. 安装telnet工具

yum -y install telnet
  1. 创建Flume Agent配置文件flume-telnet.conf


a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = netcat
a1.sources.r1.bind = bigdata113
a1.sources.r1.port = 44445

# 定义sink
a1.sinks.k1.type = logger

# 定义memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 双向链接
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 判断44444端口是否被占用
netstat -tunlp | grep 44445
  1. 启动fume配置文件
/opt/module/flume-1.8.0/bin/flume-ng agent /
--conf /opt/module/flume1.8.0/conf/ /
--name a1 /
--conf-file /opt/module/flume-1.8.0/jobconf/flume-telnet.conf /
  1. 使用telnet工具向本机的44444端口发送内容
 telnet bigdata111 44445

5.2 案例二:实时读取本地文件到HDFS

  1. 创建flume-hdfs.conf文件
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2
# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/Andy
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://bigdata111:9000/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = logs-
a2.sinks.k2.hdfs.round = true
a2.sinks.k2.hdfs.roundValue = 1
a2.sinks.k2.hdfs.roundUnit = hour
a2.sinks.k2.hdfs.useLocalTimeStamp = true
a2.sinks.k2.hdfs.batchSize = 1000
a2.sinks.k2.hdfs.fileType = DataStream
a2.sinks.k2.hdfs.rollInterval = 600
a2.sinks.k2.hdfs.rollSize = 134217700
a2.sinks.k2.hdfs.rollCount = 0
a2.sinks.k2.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
  1. 执行监控配置
/opt/module/flume1.8.0/bin/flume-ng agent /
--conf /opt/module/flume1.8.0/conf/ /
--name a2 /
--conf-file /opt/module/flume1.8.0/jobconf/flume-hdfs.conf

5.3 案例三:实时读取目录文件到HDFS

  1. 目标:使用flume监听整个目录的文件

  2. 分布实现:

  3. 创建配置文件flume-dir.conf

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume1.8.0/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
a3.sources.r3.ignorePattern = ([^ ]*/.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://bigdata111:9000/flume/%H
a3.sinks.k3.hdfs.filePrefix = upload-
a3.sinks.k3.hdfs.round = true
a3.sinks.k3.hdfs.roundValue = 1
a3.sinks.k3.hdfs.roundUnit = hour
a3.sinks.k3.hdfs.useLocalTimeStamp = true
a3.sinks.k3.hdfs.batchSize = 100
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.rollInterval = 600
a3.sinks.k3.hdfs.rollSize = 134217700
a3.sinks.k3.hdfs.rollCount = 0
a3.sinks.k3.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
  1. 执行测试:测试如下脚本后,请向upload文件夹中添加文件试试
/opt/module/flume1.8.0/bin/flume-ng agent /
--conf /opt/module/flume1.8.0/conf/ /
--name a3 /
--conf-file /opt/module/flume1.8.0/jobconf/flume-dir.conf
  1. 提示:在使用Spooling Directory Source时

  2. 不要在监控目录中创建并持续修改文件

  3. 上传完成的文件会以.COMPLETED结尾

  4. 被监控文件夹每500毫秒扫描一次文件变动

5.4 案例四:flum与flume之间数据传递:单flume多channel、sink

  1. 目标:使用flume1监控文件变动,flume1将变动内容传递给flume2,flume2负责存储到HDFS。同时flume1将变动的内容传递给flume3,flume3负责输出到local

  2. 分布实现

  3. 创建flume1.conf,用于监控某文件的变动,同时产生两个channel和两个sink分别输送给lume2和flume3

# 1.agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给多个channel
a1.sources.r1.selector.type = replicating

# 2.source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.shell = /bin/bash -c

# 3.sink1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = bigdata111
a1.sinks.k1.port = 4141

# sink2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = bigdata111
a1.sinks.k2.port = 4142

# 4.channel—1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 4.channel—2
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
  1. 创建flume-2.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给hdfs:
# 1 agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# 2 source
a2.sources.r1.type = avro
a2.sources.r1.bind = bigdata111
a2.sources.r1.port = 4141

# 3 sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume2/%H
a2.sinks.k1.hdfs.filePrefix = flume2-
a2.sinks.k1.hdfs.round = true
a2.sinks.k1.hdfs.roundValue = 1
a2.sinks.k1.hdfs.roundUnit = hour
a2.sinks.k1.hdfs.useLocalTimeStamp = true
a2.sinks.k1.hdfs.batchSize = 100
a2.sinks.k1.hdfs.fileType = DataStream
a2.sinks.k1.hdfs.rollInterval = 600
a2.sinks.k1.hdfs.rollSize = 134217700
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#5 Bind 
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
  1. 创建flume-3.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给本地目录:
#1 agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# 2 source
a3.sources.r1.type = avro
a3.sources.r1.bind = bigdata111
a3.sources.r1.port = 4142

#3 sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/flume3

# 4 channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# 5 Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
  1. 提示:输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。

  2. 执行测试:分别开启对应flume-job(依次启动flume2,flume3,flume1),同时产生文件变动并观察结果:

$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume2.conf

$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume3.conf

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume1.conf

5.5 案例五:Flume与Flume之间数据传递,多Flume汇总数据到单Flume

  1. 目标:flume11监控文件hive.log,flume-22监控某一个端口的数据流,flume11与flume-22将数据发送给flume-33,flume33将最终数据写入到HDFS。

  2. 分布实现:

  3. 创建flume11.conf,用于监控hive.log文件,同时sink数据到flume-33:

# 1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 2 source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.shell = /bin/bash -c

# 3 sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = bigdata111
a1.sinks.k1.port = 4141

# 4 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 5. Bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 创建flume-22.conf,用于监控端口44444数据流,同时sink数据到flume-33:
# 1 agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# 2 source
a2.sources.r1.type = netcat
a2.sources.r1.bind = bigdata111
a2.sources.r1.port = 44444

#3 sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = bigdata111
a2.sinks.k1.port = 4141

# 4 channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# 5 Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
  1. 创建flume33.conf,用于接收flume11与flume22发送过来的数据流,最终合并后sink到HDFS:
# 1 agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# 2 source
a3.sources.r1.type = avro
a3.sources.r1.bind = bigdata111
a3.sources.r1.port = 4141

# 3 sink
a3.sinks.k1.type = hdfs
a3.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume3/%H
a3.sinks.k1.hdfs.filePrefix = flume3-
a3.sinks.k1.hdfs.round = true
a3.sinks.k1.hdfs.roundValue = 1
a3.sinks.k1.hdfs.roundUnit = hour
a3.sinks.k1.hdfs.useLocalTimeStamp = true
a3.sinks.k1.hdfs.batchSize = 100
a3.sinks.k1.hdfs.fileType = DataStream
a3.sinks.k1.hdfs.rollInterval = 600
a3.sinks.k1.hdfs.rollSize = 134217700
a3.sinks.k1.hdfs.rollCount = 0
a3.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# 5 Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
  1. 执行测试:分别开启对应flume-job(依次启动flume-33,flume-22,flume11),同时产生文件变动并观察结果:
$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume33.conf
$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume22.conf
$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume11.conf
  1. 数据发送
  • telnet bigdata111 44444打开后发送5555555
  • 在/opt/Andy 中追加666666


6.1 时间戳拦截器


#1.定义agent名, source、channel、sink的名称
a4.sources = r1
a4.channels = c1
a4.sinks = k1

a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /opt/module/flume-1.8.0/upload

a4.sources.r1.interceptors = i1
a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100

a4.sinks.k1.type = hdfs
a4.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H
a4.sinks.k1.hdfs.filePrefix = events-
a4.sinks.k1.hdfs.fileType = DataStream

a4.sinks.k1.hdfs.rollCount = 0
a4.sinks.k1.hdfs.rollSize = 134217728
a4.sinks.k1.hdfs.rollInterval = 60

a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1


/opt/module/flume-1.8.0/bin/flume-ng agent -n a4 /
-f /opt/module/flume-1.8.0/jobconf/flume-interceptors.conf /
-c /opt/module/flume-1.8.0/conf /

6.2 主机名拦截器


a1.sources= r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

a1.sources.r1.interceptors.i1.useIP = false
a1.sources.r1.interceptors.i1.hostHeader = agentHost

a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flumehost/%H
a1.sinks.k1.hdfs.filePrefix = Andy_%{agentHost}
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


bin/flume-ng agent -c conf/ -f jobconf/host.conf -n a1 -Dflume.root.logger=INFO,console



a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.interceptors.i1.prefix = UUID_

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# bin/flume-ng agent -c conf/ -f jobconf/uuid.conf -n a1 -Dflume.root.logger==INFO,console

6.4 查询替换拦截器


#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = search_replace

a1.sources.r1.interceptors.i1.searchPattern = [0-9]+
a1.sources.r1.interceptors.i1.replaceString = itstar
a1.sources.r1.interceptors.i1.charset = UTF-8

#3 sink
a1.sinks.k1.type = logger

#4 Chanel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#5 bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# bin/flume-ng agent -c conf/ -f jobconf/search.conf -n a1 -Dflume.root.logger=INFO,console

6.5 正则过滤拦截器


#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^A.*
a1.sources.r1.interceptors.i1.excludeEvents = true

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# bin/flume-ng agent -c conf/ -f jobconf/filter.conf -n a1 -Dflume.root.logger=INFO,console

6.6 正则抽取拦截器


#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)
a1.sources.r1.interceptors.i1.serializers = s1 s2
a1.sources.r1.interceptors.i1.serializers.s1.name = cookieid
a1.sources.r1.interceptors.i1.serializers.s2.name = ip

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1



# bin/flume-ng agent -c conf/ -f jobconf/extractor.conf -n a1 -Dflume.root.logger=INFO,console



  1. pom.xml
  <!-- flume核心依赖 -->
<!-- 打包插件 -->
<!-- 编译插件 -->
  1. 自定义实现拦截器
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
public class MyInterceptor implements Interceptor {
 public void initialize() {
 public void close() {
  * 拦截source发送到通道channel中的消息
  * @param event 接收过滤的event
  * @return event 根据业务处理后的event
 public Event intercept(Event event) {
  // 获取事件对象中的字节数据
  byte[] arr = event.getBody();
  // 将获取的数据转换成大写
  event.setBody(new String(arr).toUpperCase().getBytes());
  // 返回到消息中
  return event;
 // 接收被过滤事件集合
 public List<Event> intercept(List<Event> events) {
  List<Event> list = new ArrayList<>();
  for (Event event : events) {
  return list;
 public static class Builder implements Interceptor.Builder {
  // 获取配置文件的属性
  public Interceptor build() {
return new MyInterceptor();
  public void configure(Context context) {

使用Maven做成Jar包,在flume的目录下mkdir jar,上传此jar到jar目录中

  1. Flume配置文件


a1.sources = r1
a1.sinks =k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = ToUpCase.MyInterceptor$Builder
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /ToUpCase1
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是 Sequencefile,可用 DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 运行命令:
bin/flume-ng agent -c conf/ -n a1 -f jar/ToUpCase.conf -C jar/Flume_Andy-1.0-SNAPSHOT.jar -Dflume.root.logger=DEBUG,console
