kubernetes CRI分析-k8s CRI分析。kubelet删除pod分析。kubelet调用CRI删除pod分析。kubernetes中有3个功能接口，分别是容器网络接口CNI、容器运行时接口CRI和容器存储接口CSI。本文会对kubelet调用CRI删除pod分析。

关联博客kubernetes/k8s CRI 分析-容器运行时接口分析
kubernetes/k8s CRI分析-kubelet创建pod分析
kubernetes/k8s CSI分析-容器存储接口分析
kubernetes/k8s CNI分析-容器网络接口分析

之前的博文先对 CRI 做了介绍，然后对 kubelet CRI 相关源码包括 kubelet 组件 CRI 相关启动参数分析、CRI 相关 interface/struct 分析、CRI 相关初始化分析、kubelet调用CRI创建pod分析 4 个部分进行了分析，没有看的小伙伴，可以点击上面的链接去看一下。

把之前博客分析到的 CRI 架构图再贴出来一遍。

本篇博文将对 kubelet 调用 CRI 删除 pod 做分析。

kubelet中CRI相关的源码分析

kubelet的CRI源码分析包括如下几部分：
（1）kubelet CRI相关启动参数分析；
（2）kubelet CRI相关interface/struct分析；
（3）kubelet CRI初始化分析；
（4）kubelet调用CRI创建pod分析；
（5）kubelet调用CRI删除pod分析。

上两篇博文先对前四部分做了分析，本篇博文将对kubelet调用CRI删除pod做分析。

基于tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

5.kubelet调用CRI删除pod分析

kubelet CRI删除pod调用流程

下面以kubelet dockershim删除pod调用流程为例做一下分析。

kubelet通过调用dockershim来停止容器，而dockershim则调用docker来停止容器，并调用CNI来删除pod网络。

图1：kubelet dockershim删除pod调用图示

dockershim属于kubelet内置CRI shim，其余remote CRI shim的创建pod调用流程其实与dockershim调用基本一致，只不过是调用了不同的容器引擎来操作容器，但一样由CRI shim调用CNI来删除pod网络。

下面进行详细的源码分析。

直接看到kubeGenericRuntimeManager的KillPod方法，调用CRI删除pod的逻辑将在该方法里触发发起。

从该方法代码也可以看出，kubelet删除一个pod的逻辑为：
（1）先停止属于该pod的所有containers；
（2）然后再停止pod sandbox容器。

注意点：这里只是停止容器，而删除容器的操作由kubelet的gc来做。

// pkg/kubelet/kuberuntime/kuberuntime_manager.go
// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.
// gracePeriodOverride if specified allows the caller to override the pod default grace period.
// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.
// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.
func (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {
	err := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)
	return err.Error()
}

// killPodWithSyncResult kills a runningPod and returns SyncResult.
// Note: The pod passed in could be *nil* when kubelet restarted.
func (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {
	killContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)
	for _, containerResult := range killContainerResults {
		result.AddSyncResult(containerResult)
	}

	// stop sandbox, the sandbox will be removed in GarbageCollect
	killSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)
	result.AddSyncResult(killSandboxResult)
	// Stop all sandboxes belongs to same pod
	for _, podSandbox := range runningPod.Sandboxes {
		if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {
			killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())
			klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)
		}
	}

	return
}

5.1 m.killContainersWithSyncResult

m.killContainersWithSyncResult作用：停止属于该pod的所有containers。

主要逻辑：起与容器数量相同的goroutine，调用m.killContainer来停止容器。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainersWithSyncResult kills all pod's containers with sync results.
func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {
	containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))
	wg := sync.WaitGroup{}

	wg.Add(len(runningPod.Containers))
	for _, container := range runningPod.Containers {
		go func(container *kubecontainer.Container) {
			defer utilruntime.HandleCrash()
			defer wg.Done()

			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
			if err := m.killContainer(pod, container.ID, container.Name, "", gracePeriodOverride); err != nil {
				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
			}
			containerResults <- killContainerResult
		}(container)
	}
	wg.Wait()
	close(containerResults)

	for containerResult := range containerResults {
		syncResults = append(syncResults, containerResult)
	}
	return
}

5.1.1 m.killContainer

m.killContainer方法主要是调用m.runtimeService.StopContainer。

runtimeService即RemoteRuntimeService，实现了CRI shim客户端-容器运行时接口RuntimeService interface，持有与CRI shim容器运行时服务端通信的客户端。所以调用m.runtimeService.StopContainer，实际上等于调用了CRI shim服务端的StopContainer方法，来进行容器的停止操作。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainer kills a container through the following steps:
// * Run the pre-stop lifecycle hooks (if applicable).
// * Stop the container.
func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
	...

	klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)

	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
	if err != nil {
		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
	} else {
		klog.V(3).Infof("Container %q exited normally", containerID.String())
	}

	m.containerRefManager.ClearRef(containerID)

	return err
}

m.runtimeService.StopContainer

m.runtimeService.StopContainer方法，会调用r.runtimeClient.StopContainer，即利用CRI shim客户端，调用CRI shim服务端来进行停止容器的操作。

分析到这里，kubelet中的CRI相关调用就分析完毕了，接下来将会进入到CRI shim（以kubelet内置CRI shim-dockershim为例）里进行停止容器的操作分析。

// pkg/kubelet/remote/remote_runtime.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {
	// Use timeout + default timeout (2 minutes) as timeout to leave extra time
	// for SIGKILL container and request latency.
	t := r.timeout + time.Duration(timeout)*time.Second
	ctx, cancel := getContextWithTimeout(t)
	defer cancel()

	r.logReduction.ClearID(containerID)
	_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{
		ContainerId: containerID,
		Timeout:     timeout,
	})
	if err != nil {
		klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)
		return err
	}

	return nil
}

5.1.2 r.runtimeClient.StopContainer

接下来将会以dockershim为例，进入到CRI shim来进行停止容器操作的分析。

前面kubelet调用r.runtimeClient.StopContainer，会进入到dockershim下面的StopContainer方法。

// pkg/kubelet/dockershim/docker_container.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {
	err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)
	if err != nil {
		return nil, err
	}
	return &runtimeapi.StopContainerResponse{}, nil
}

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

d.client.ContainerStop

构建请求参数，向docker指定的url发送http请求，停止容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

5.2 m.runtimeService.StopPodSandbox

在m.runtimeService.StopPodSandbox中的runtimeService即RemoteRuntimeService，其实现了CRI shim客户端-容器运行时接口RuntimeService interface，持有与CRI shim容器运行时服务端通信的客户端。所以调用m.runtimeService.StopPodSandbox，实际上等于调用了CRI shim服务端的StopPodSandbox方法，来进行pod sandbox的停止操作。

分析到这里，kubelet中的CRI相关调用就分析完毕了，接下来将会进入到CRI shim（以kubelet内置CRI shim-dockershim为例）里进行停止pod sandbox的分析。

// pkg/kubelet/remote/remote_runtime.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be forced to termination.
func (r *RemoteRuntimeService) StopPodSandbox(podSandBoxID string) error {
	ctx, cancel := getContextWithTimeout(r.timeout)
	defer cancel()

	_, err := r.runtimeClient.StopPodSandbox(ctx, &runtimeapi.StopPodSandboxRequest{
		PodSandboxId: podSandBoxID,
	})
	if err != nil {
		klog.Errorf("StopPodSandbox %q from runtime service failed: %v", podSandBoxID, err)
		return err
	}

	return nil
}

5.2.1 r.runtimeClient.StopPodSandbox

接下来将会以dockershim为例，进入到CRI shim来进行停止pod sandbox的分析。

前面kubelet调用r.runtimeClient.StopPodSandbox，会进入到dockershim下面的StopPodSandbox方法。

停止pod sandbox主要有2个步骤：
（1）调用ds.network.TearDownPod：删除pod网络；
（2）调用ds.client.StopContainer：停止pod sandbox容器。

需要注意的是，上面的2个步骤只有都成功了，停止pod sandbox的操作才算成功，且上面2个步骤成功的先后顺序没有要求。

// pkg/kubelet/dockershim/docker_sandbox.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
	var namespace, name string
	var hostNetwork bool

	podSandboxID := r.PodSandboxId
	resp := &runtimeapi.StopPodSandboxResponse{}

	// Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.
	inspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)
	if statusErr == nil {
		namespace = metadata.Namespace
		name = metadata.Name
		hostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)
	} else {
		checkpoint := NewPodSandboxCheckpoint("", "", &CheckpointData{})
		checkpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)

		// Proceed if both sandbox container and checkpoint could not be found. This means that following
		// actions will only have sandbox ID and not have pod namespace and name information.
		// Return error if encounter any unexpected error.
		if checkpointErr != nil {
			if checkpointErr != errors.ErrCheckpointNotFound {
				err := ds.checkpointManager.RemoveCheckpoint(podSandboxID)
				if err != nil {
					klog.Errorf("Failed to delete corrupt checkpoint for sandbox %q: %v", podSandboxID, err)
				}
			}
			if libdocker.IsContainerNotFoundError(statusErr) {
				klog.Warningf("Both sandbox container and checkpoint for id %q could not be found. "+
					"Proceed without further sandbox information.", podSandboxID)
			} else {
				return nil, utilerrors.NewAggregate([]error{
					fmt.Errorf("failed to get checkpoint for sandbox %q: %v", podSandboxID, checkpointErr),
					fmt.Errorf("failed to get sandbox status: %v", statusErr)})
			}
		} else {
			_, name, namespace, _, hostNetwork = checkpoint.GetData()
		}
	}

	// WARNING: The following operations made the following assumption:
	// 1. kubelet will retry on any error returned by StopPodSandbox.
	// 2. tearing down network and stopping sandbox container can succeed in any sequence.
	// This depends on the implementation detail of network plugin and proper error handling.
	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
	// since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best
	// effort clean up and will not return error.
	errList := []error{}
	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

	if len(errList) == 0 {
		return resp, nil
	}

	// TODO: Stop all running containers in the sandbox.
	return nil, utilerrors.NewAggregate(errList)
}

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

d.client.ContainerStop

构建请求参数，向docker指定的url发送http请求，停止pod sandbox容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

总结

CRI架构图

在 CRI 之下，包括两种类型的容器运行时的实现：
（1）kubelet内置的 dockershim，实现了 Docker 容器引擎的支持以及 CNI 网络插件（包括 kubenet）的支持。dockershim代码内置于kubelet，被kubelet调用，让dockershim起独立的server来建立CRI shim，向kubelet暴露grpc server；
（2）外部的容器运行时，用来支持 rkt、containerd 等容器引擎的外部容器运行时。

kubelet调用CRI删除pod流程分析

kubelet删除一个pod的逻辑为：
（1）先停止属于该pod的所有containers；
（2）然后再停止pod sandbox容器（包括删除pod网络）。

注意点：这里只是停止容器，而删除容器的操作由kubelet的gc来做。

kubelet CRI删除pod调用流程

下面以kubelet dockershim删除pod调用流程为例做一下分析。

kubelet通过调用dockershim来停止容器，而dockershim则调用docker来停止容器，并调用CNI来删除pod网络。

图1：kubelet dockershim删除pod调用图示

dockershim属于kubelet内置CRI shim，其余remote CRI shim的创建pod调用流程其实与dockershim调用基本一致，只不过是调用了不同的容器引擎来操作容器，但一样由CRI shim调用CNI来删除pod网络。

关联博客《kubernetes/k8s CRI 分析-容器运行时接口分析》
《kubernetes/k8s CRI分析-kubelet创建pod分析》

kubernetes/k8s CRI分析-kubelet删除pod分析

kubelet中CRI相关的源码分析

基于tag v1.17.4

5.kubelet调用CRI删除pod分析

kubelet CRI删除pod调用流程

图1：kubelet dockershim删除pod调用图示

5.1 m.killContainersWithSyncResult

5.1.1 m.killContainer

m.runtimeService.StopContainer

5.1.2 r.runtimeClient.StopContainer

ds.client.StopContainer

d.client.ContainerStop

5.2 m.runtimeService.StopPodSandbox

5.2.1 r.runtimeClient.StopPodSandbox

ds.client.StopContainer

d.client.ContainerStop

总结

CRI架构图

kubelet调用CRI删除pod流程分析

kubelet CRI删除pod调用流程

图1：kubelet dockershim删除pod调用图示

相关文章

热门标签

最新文章

目录