在for循环中使用python的子流程模块批量创建配置单元表

sg2wtvxw 于 2021-06-26 发布在 Hive

关注(0)|答案(2)|浏览(192)

我有一个inputlist，其中有要创建的表名，这个列表可以有2000个名称。
例如：

inputList = ['model_0001', 'model_0002', 'model_0003', ..., 'model_1000']

我正在使用下面涉及子流程模块的python代码创建许多配置单元表，现在我必须监视流程并更改传递给for循环的列表参数（请参阅代码中的0,20）。配置单元表创建过程被提交到配置单元集群，并并行运行这些过程。我想用一个参数来控制它，可以提交多少个并行表创建进程。这样代码就可以在没有任何干预的情况下运行。
我还希望，如果提交了20个作业，完成了1个，那么下一个作业就会启动，基本上在任何时候只有20个作业在运行。

createTablecmd2 = "CREATE TABLE {tableName}_modified AS SELECT k.{tableName}, col2  FROM  {tableName} as d  left outer join table3 as k on d.col4 = k.col4 "

## Creating tables from 1 to 20, inputList[0] corresponds to model_0001 and inputList[19] corresponds to model_0020

for currTable in inputList[0:20]:    
    sqlstmt = createTablecmd2.format(tableName = currTable)
    cmd3 = "hive -e '{stmt}'".format(stmt = sqlstmt)
    print "submitting command", cmd3
    %time result = subprocess.Popen(cmd3, shell = True , stdout=subprocess.PIPE, stdin=subprocess.PIPE)

Hive python subprocess

来源：https://stackoverflow.com/questions/42632431/batch-creation-of-hive-tables-using-pythons-subprocess-module-in-a-for-loop

2条答案

按热度按时间

relj7zay1#

关键的问题是你是在Yarn上运行还是独立运行。如果你正在用Yarn跑步，看一下集束管理器。您的群集可能已被超额订阅。添加有关环境的更多详细信息。如果打开，请查看作业历史服务器以查看作业的详细信息。它会告诉你发生了什么。添加有关您的环境的更多详细信息。

赞(0）回复(0）举报 2021-06-26

aelbi1ox2#

这就是我在子流程中使用.wait（）对象解决95%问题的方法。

inputList = ['model_0001', 'model_0002', 'model_0003', ..., 'model_1000']

import subprocess
batchsize = 20

createTablecmd2 = "CREATE TABLE {tableName}_modified AS SELECT k.{tableName}, col2  FROM  {tableName} as d  left outer join table3 as k on d.col4 = k.col4 "

for currTable in xrange(0, len(inputList),batchsize):
    batch = inputList[currTable:currTable+batchsize]
    for i in batch:
        sqlstmt = createTablecmd2.format(tableName = i)
        cmd3 = "hive -e 'use dbname2; {stmt}'".format(stmt = sqlstmt)
        print "submitting command", cmd3
        %time result = subprocess.Popen(cmd3, shell = True,stdout=subprocess.PIPE, stdin=subprocess.PIPE)
    result.wait()

内部for循环完成后，进程等待子进程完成，然后启动另一批配置单元作业。这不是100%我想要的，但它符合我目前的目的。这是一个同步解决方案。例如，如果批次由型号\u id \u 0020组成，则。。。model\u id\u 0040，它将等待最后一个进程，即model\u id\u 0040完成，然后启动下一批hadoop/hive作业。但是假设模型\u id \u 0040（当前批中的最后一个）在3-4个其他作业之前完成，它将启动另一批。所以在这种情况下它可以是异步的。
众所周知，hadoop/hive作业可以在不同的时间完成，尽管在我的特定情况下，它是完全相同的过程，并且复制相同的表结构。

赞(0）回复(0）举报 2021-06-26

我来回答

在for循环中使用python的子流程模块批量创建配置单元表

2条答案

相关问题

热门标签

最新问答