我正在尝试查询GKE Pod的GPU使用指标。
下面是我做的测试:
1.创建了具有两个节点池的GKE集群,其中一个具有两个仅CPU的节点,另一个具有一个具有NVIDIA Tesla T4 GPU的节点。所有节点都运行容器优化操作系统。
1.正如在https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers中所写的那样,我运行了kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
。
kubectl create -f dcgm-exporter.yaml
个
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
# resources:
# limits:
# nvidia.com/gpu: "1"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
字符串
- pod仅在gpu节点上运行,但会崩溃并出现以下错误:
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
型
取消注解resources: limits: nvidia.com/gpu: "1"
,它成功运行。然而,我不希望这个pod占用任何GPU,只是观察它们。
如何在不分配GPU的情况下运行dcgm-exporter?我尝试使用Ubuntu节点,但也失败了。
2条答案
按热度按时间flvtvl501#
它与这些工作:
1.将
privileged: true
设置为securityContext
。1.添加卷装载
"nvidia-install-dir-host"
。字符串
up9lanfz2#
我今天安装了dgcm-exporter by helm,以下是我的价值观:
字符串
我认为没有必要分配gpu给dgcm,并记录我的步骤here。