hive:有没有办法定制hiveinputformat?

yyyllmsg  于 2021-05-30  发布在  Hadoop
关注(0)|答案(1)|浏览(306)

下面是一个场景:3个文件夹位于hdfs中。文件如下:

/root/20140901/part-0
/root/20140901/part-1
/root/20140901/part-2
/root/20140902/part-0
/root/20140902/part-1
/root/20140902/part-2
/root/20140903/part-0
/root/20140903/part-1
/root/20140903/part-2

在创建一个命令如下的配置单元表之后,我调用hql=[ select * from hive_combine_test where rdm > 50000; ],这将花费9个Map器,与hdfs中的文件数量相同。

CREATE EXTERNAL table hive_combine_test
(id string, 
rdm string)
PARTITIONED BY (dateid string)
row format delimited fields terminated by '\t'
stored as textfile;

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140901')
location '/root/20140901';

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140902')
location '/root/20140902';

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140903')
location '/root/20140903';

但是我想要的是把所有的第一部分放在一起,这样的话,应该只有三个绘图员。
我试着继承 org.apache.hadoop.hive.ql.io.HiveInputFormat 为了测试客户 JudHiveInputFormat 我可以工作。

public class JudHiveInputFormat<K extends WritableComparable, V extends Writable>
                    extends HiveInputFormat<WritableComparable, Writable> {

}

但当我把它装进Hive时,它会返回异常:

hive> add jar /my_path/jud_udf.jar;
hive> set hive.input.format=com.judking.hive.inputformat.JudHiveInputFormat;
hive> select * from hive_combine_test where rdm > 50000;

java.lang.RuntimeException: com.judking.hive.inputformat.JudCombineHiveInputFormat
    at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:290)
    at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1472)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1239)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1057)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:880)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:870)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

有人能给我一些线索吗?谢谢!

siotufzp

siotufzp1#

据我所知,要在配置单元中添加自定义输入/输出格式,您需要在CREATETABLE语句中提到该格式。像这样的事情:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT '<your input format class name >' OUTPUTFORMAT '<your output format class name>';

因为您只需要inputformat,所以create table语句如下所示:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 'JudHiveInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';

为什么需要提到这个输出格式类,因为您已经覆盖了输入格式,所以hive也需要输出类,所以这里我们需要说hive使用它的默认输出格式类。
也许你可以试试。
希望对你有帮助。。。!!!

相关问题