K8sGPT源码分析

内容目录

前情提要

接上一篇我们用 k8sgpt 对 pod 进行分析，本文主要对 pod 诊断的源码进行解析，

我们还是先创建一个含错误镜像的pod

 kubectl create deployment demo3 --image=httddd --port=80

新开窗口运行本地模型

OLLAMA_HOST=0.0.0.0 ollama start

git clone 最新代码, 这里我已经 fork 官方main 分支到自己代码库

https://github.com/jaychenthinkfast/k8sgpt.git

为防止 go api 不兼容和本地运行该项目的 go 版本和官方应保持一致，选用 go 1.22 版本，进入项目目录后执行

go mod tidy

更新依赖，然后执行代码对 pod 进行诊断，为避免缓存影响后续在代码中查看最新的提示词建议开启禁止缓存的 flag -c

~ go run main.go analyze --explain --filter=Pod -n=default  -c

诊断结果

 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (1/1, 50 it/hr)           
AI Provider: localai

0: Pod default/demo3-5754546d78-4frd5(Deployment/demo3)
- Error: Back-off pulling image "httddd"
Error: Unable to pull image "httddd" due to invalid URL.

Solution: 
1. Check if the URL is correct and valid.
2. Verify that the image exists on the registry.
3. If using a private registry, ensure credentials are set correctly.
4. Try running docker pull httddd to test the image directly.

PodAnalyzer

程序链路

cmd/analyze/analyze.go 诊断入口

——pkg/analysis/analysis.go 分析器入口

———–pkg/analyzer/pod.go pod 分析器

pod 分析器逻辑

分析器主要是收集各种错误信息用于后续的 gpt 分析输入

根据 namespace 获取 pod 信息，

pkg/analyzer/pod.go:37

list, err := a.Client.GetClient().CoreV1().Pods(a.Namespace).List(a.Context, metav1.ListOptions{})

收集 pending pod的错误信息

pkg/analyzer/pod.go:47

// Check for pending pods
        if pod.Status.Phase == "Pending" {
            // Check through container status to check for crashes
            for _, containerStatus := range pod.Status.Conditions {
                if containerStatus.Type == v1.PodScheduled && containerStatus.Reason == "Unschedulable" {
                    if containerStatus.Message != "" {
                        failures = append(failures, common.Failure{
                            Text:      containerStatus.Message,
                            Sensitive: []common.Sensitive{},
                        })
                    }
                }
            }
        }

收集init containers 错误

pkg/analyzer/pod.go:62

failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.InitContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))…)
收集containers 错误

pkg/analyzer/pod.go:65

failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.ContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))…)

根据错误信息和提示词模板生成提示词进行模型分析

pkg/analysis/analysis.go:344

我们打印出模板和提示词看下是什么样子的详情见 commit

    fmt.Println("promptTmpl", promptTmpl)
    fmt.Println("prompt", prompt)

输出

promptTmpl Simplify the following Kubernetes error message delimited by triple dashes written in --- %s --- language; --- %s ---.
        Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
        Error: {Explain error here}
        Solution: {Step by step solution here}

prompt Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Back-off pulling image "httddd" ---.
        Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
        Error: {Explain error here}
        Solution: {Step by step solution here}

可以看出来提示词里包含了语言要求，和错误信息，并且要求模型给出详细的解决方案，还给出了输出的格式和限定的字符数

分析到这里如果我们没有本地或者远程模型支持或者本地硬件较弱模型分析耗时长，仅打印出提示词也可以手动到国内外大模型进行分析，可以再丰富下提示词用中文输出，并且注释掉代码中的模型调用详情见 commit

// Process template.
    promptTmpl += " 请用中文输出"
    prompt := fmt.Sprintf(strings.TrimSpace(promptTmpl), a.Language, inputKey)
    fmt.Println("promptTmpl", promptTmpl)
    fmt.Println("prompt", prompt)
    //response, err := a.AIClient.GetCompletion(a.Context, prompt)
    //if err != nil {
    //  return "", err
    //}
    //
    //if err = a.Cache.Store(cacheKey, base64.StdEncoding.EncodeToString([]byte(response))); err != nil {
    //  color.Red("error storing value to cache; value won't be cached: %v", err)
    //}
    //return response, nil
    return "", nil

重新执行诊断

~ go run main.go analyze --explain --filter=Pod -n=default  -c

输出新的提示词

prompt Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Back-off pulling image "httddd" ---.
        Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
        Error: {Explain error here}
        Solution: {Step by step solution here}
         请用中文输出

模型效果对比

文心一言

Error: 无法拉取镜像“httddd”，可能因镜像名错误、镜像不存在或网络问题导致。
Solution: 1. 检查镜像名是否正确（应为如“httpd”而非“httddd”）。2. 确认镜像已上传至仓库。3. 检查Kubernetes集群的网络设置，确保可以访问镜像仓库。4. 重新部署Pod以尝试再次拉取。

通义千问

错误：Kubernetes在尝试拉取"httddd"镜像时遇到问题，导致操作回退。
解决方案：1.检查镜像名称是否正确，应无“ttt”错误。2.确认Docker镜像仓库是否可达。3.手动拉取镜像测试：docker pull {正确的镜像名称}。4.如手动拉取成功，重启Pod。

错误：Kubernetes无法从镜像仓库拉取"httddd"镜像，可能是因为镜像名错误或镜像仓库不可达。
解决方案：1.校验并修正镜像名称中的“httddd”。2.确保网络可访问镜像仓库。3.使用docker命令尝试拉取镜像验证。4.若拉取成功，重新部署Pod。

腾讯元宝

错误：Kubernetes在尝试拉取名为"httddd"的镜像时遇到了问题，导致任务无法继续。

解决方案：

确认镜像名称是否正确。检查部署配置文件或命令中的镜像名称是否与实际镜像名称一致。
检查镜像是否存在。确保所需的镜像已经存在于镜像仓库中。
确保集群有权限访问镜像仓库。检查集群是否具有访问和拉取镜像的权限。
检查网络连接。确保集群能够访问外部网络以拉取镜像。
查看Kubernetes日志。使用kubectl describe pod 命令查看Pod的详细状态，找出具体原因。

chatgpt

错误：拉取镜像失败，出现 Back-off pulling image "httddd" 错误。
解决方案：

检查镜像名称或路径是否正确。
确保网络连接正常，能够访问镜像仓库。
检查认证信息是否正确，如需要用户名和密码。
如果是私有仓库，确保 Kubernetes 集群有权限访问。
重新部署 Pod 或者手动删除并重新创建 Pod。