我正在尝试执行大约30个oozie工作流,每个工作流都有以下操作:
一个shell操作:标识在一组10个配置单元表中更新的最后一条记录
10个分叉的sqoop操作:查询rdbms中更新的记录,并将其发布到相关表中
10个分叉配置单元操作:将新的sqoope数据与相应的配置单元表合并。
以下是工作流xml文件中的shell操作部分:
<start to="shell-node_SET10_wf_hive_tables"/>
<action name="shell-node_SET10_wf_hive_tables">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>LAST_VALUE_10.sh</exec>
<argument>${hive_table1}</argument>
<argument>${hive_table2}</argument>
<argument>${hive_table3}</argument>
<argument>${hive_table4}</argument>
<argument>${hive_table5}</argument>
<argument>${hive_table6}</argument>
<argument>${hive_table7}</argument>
<argument>${hive_table8}</argument>
<argument>${hive_table9}</argument>
<argument>${hive_table10}</argument>
<file>${last_value_script_path}#LAST_VALUE_10.sh</file>
<file>${keytabpath}/${keytabaccount}#${keytabaccount}</file>
<capture-output/>
</shell>
<ok to="forking"/>
<error to="failed-notification-email_SET10_SHELL"/>
</action>
在上述shell操作中调用的最后一个\u value \u 10.sh bash脚本的内容如下:
# !/bin/bash
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp1=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $1"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp2=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $2"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp3=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $3"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp4=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $4"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp5=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $5"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp6=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $6"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp7=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $7"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp8=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $8"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp9=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $9"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp10=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from ${10}"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
[ "$last_val_temp1" = "NULL" ] && last_val_temp_1=0 || last_val_temp_1=$last_val_temp1
[ "$last_val_temp2" = "NULL" ] && last_val_temp_2=0 || last_val_temp_2=$last_val_temp2
[ "$last_val_temp3" = "NULL" ] && last_val_temp_3=0 || last_val_temp_3=$last_val_temp3
[ "$last_val_temp4" = "NULL" ] && last_val_temp_4=0 || last_val_temp_4=$last_val_temp4
[ "$last_val_temp5" = "NULL" ] && last_val_temp_5=0 || last_val_temp_5=$last_val_temp5
[ "$last_val_temp6" = "NULL" ] && last_val_temp_6=0 || last_val_temp_6=$last_val_temp6
[ "$last_val_temp7" = "NULL" ] && last_val_temp_7=0 || last_val_temp_7=$last_val_temp7
[ "$last_val_temp8" = "NULL" ] && last_val_temp_8=0 || last_val_temp_8=$last_val_temp8
[ "$last_val_temp9" = "NULL" ] && last_val_temp_9=0 || last_val_temp_9=$last_val_temp9
[ "${last_val_temp10}" = "NULL" ] && last_val_temp_10=0 || last_val_temp_10=${last_val_temp10}
printf "last_val_1=$last_val_temp_1\nlast_val_2=$last_val_temp_2\nlast_val_3=$last_val_temp_3\nlast_val_4=$last_val_temp_4\nlast_val_5=$last_val_temp_5\nlast_val_6=$last_val_temp_6\nlast_val_7=$last_val_temp_7\nlast_val_8=$last_val_temp_8\nlast_val_9=$last_val_temp_9\nlast_val_10=${last_val_temp_10}"
我的30个工作流程都采用相同的格式,每个工作流程导入10个表。每个工作流都有一个惟一的last\u值脚本,我已经复制了keytab文件30次,每个惟一的keytab文件名都被一个工作流使用。我已经和我的oozie协调员安排好了,每天一次,每个工作流程之间有15分钟的延迟。
我看到,随机一些工作流程失败,每天与下面相同的错误。今天失败的工作流在下一次运行中成功,但随机失败的时间可能是10天左右。每天都会有一两个工作流失败,这些工作流已经成功运行了几天,但始终出现相同的错误。
错误:
[main] INFO com.unraveldata.agent.ResourceCollector - Unravel Sensor 4.5.1.1rc0013/1.3.11.3 initializing.
./LAST_VALUE_10.sh: line 3: kinit: command not found
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
scan complete in 2ms
Connecting to jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM
20/10/08 02:31:12 [main]: ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:168)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:146)
at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:211)
at org.apache.hive.beeline.Commands.connect(Commands.java:1529)
at org.apache.hive.beeline.Commands.connect(Commands.java:1424)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:52)
at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1139)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1178)
at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:818)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:898)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:518)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 35 more
Unknown HS2 problem when communicating with Thrift server.
Error: Could not open client transport with JDBC Uri: jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM: GSS initiate failed (state=08S01,code=0)
No current connection
我无法找出这种随机故障的原因,需要帮助识别和修复。我尝试了多种方法,比如用bash脚本在每个工作流中分叉shell操作,只查询一个hive表,等等……但无法解决它。
暂无答案!
目前还没有任何答案,快来回答吧!