在dc/os上启用gpu资源(cuda)

i5desfxk 于 2021-06-26 发布在 Mesos

关注(0)|答案(2)|浏览(317)

我有一个集群与gpu节点（nvidia）和部署的dc/os 1.8。我希望能够使用gpu隔离在gpu节点上调度作业（批处理和spark）。dc/os基于支持gpu隔离的Mesos1.0.1。

mesos gpu dcos

来源：https://stackoverflow.com/questions/40346321/enable-gpu-resources-cuda-on-dc-os

2条答案

按热度按时间

s5a0g9ez1#

不幸的是，dc/os在1.8中没有正式支持GPU（GPU的实验性支持将在下一个版本中提供，如下所述：https://github.com/dcos/dcos/pull/766 ).
在下一个版本中，只有marathon能够正式启动gpu服务（metronome（即批处理作业）不会）。
关于spark，与universe捆绑的spark版本可能还没有内置mesos的gpu支持。不过，spark本身也很快就会推出：https://github.com/apache/spark/pull/14644

赞(0）回复(0）举报 2021-06-26

mqkwyuun2#

为了在dc/os集群中支持gpu资源，需要执行以下步骤：
在gpu节点上配置mesos代理：
1.1. 停止dcos-mesos-slave.service： systemctl stop dcos-mesos-slave.service 1.2. 将下一个参数添加到/var/lib/dcos/mesos从属公共文件中： # a comma separated list of GPUs (id), as determined by running nvidia-smi on the host where the agent is to be launched MESOS_NVIDIA_GPU_DEVICES="0,1" # value of the gpus resource must be complied with number of ids above MESOS_RESOURCES= [ {"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1025, "end": 2180},{"begin": 2182, "end": 3887},{"begin": 3889, "end": 5049},{"begin": 5052, "end": 8079},{"begin": 8082, "end": 8180},{"begin": 8182, "end": 32000}]}} ,{"name": "gpus","type": "SCALAR","scalar": {"value": 2}}] MESOS_ISOLATION=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,docker/volume,cgroups/devices,gpu/nvidia 1.3. 启动dcos-mesos-slave.service： systemctl start dcos-mesos-slave.service 在mesos框架中启用gpu资源功能：
2.1. marathon框架应与选项一起启动 --enable_features "gpu_resources" 2.2. aurora scheduler应该使用以下选项启动 -allow_gpu_resource 注意。
任何运行支持nvidiagpu的mesos代理的主机都必须安装有效的nvidia内核驱动程序。还强烈建议安装相应的用户级库和工具，作为英伟达CUDA工具包的一部分。许多使用nvidiagpu的作业依赖于cuda，不包括它将严重限制可以在mesos上运行的gpu感知作业的类型。

赞(0）回复(0）举报 2021-06-26

我来回答

在dc/os上启用gpu资源(cuda)

2条答案

相关问题

热门标签

最新问答