在sqoop中,文件导入,我想使用定义的Map器控制文件分割中导入的数据

czfnxgou  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(297)

mysql->从员工中选择*

empno | empname      | salary 
======================================================
|   101 | Ram          |   5000 |    
|   102 | Hari         |   7000 |   
|   104 | Vamshi       |   7000 |   
|   103 | Revathy      |   7000 |  
|   105 | Jaya         |   9000 |  
|   106 | Suresh       |   8000 |  
|   107 | Ramesh       |   9000 |  
|   108 | Prasana      |  10000 |  
|   109 | Ramsamy      |  20000 |  
|   110 | Singaram     |  30000 |  
|   200 | ramanathan   |  30000 |  
|   201 | Victor       |  33000 |  
|   202 | Naveen       |  33000 |  
|   203 | Karthik      |  33000 |  
|   204 | Karthikeyan  |  33000 |   
|   205 | Somasundaram |  43000 |   
|   301 | Test1        |  50000 |   
|   302 | Test2        |  60000 |   
|   303 | Test3        |  70000 

Command in Sqoop

sqoop import  --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee 
--direct --verbose
 --split-by salary 

By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
 3 records in second file,
 3 records in third file & 3 records in last file

    5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
 BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee

15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [      
                 5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000

15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4

我想知道它是如何对每个文件中的记录数进行分类的。可定制吗。

ubby3x7f

ubby3x7f1#

工资范围为
5000 - 70000 (i.e. min 5000, max 70000) . 工资分为四类。

(70000 - 5000 )/4=16250

因此,

split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)

相关问题