在公司k8s集群上部署了ceph提供对象存储和文件存储服务,按照rook官方文档一步一步安装,但是部署完成之后测试发现文件存储无法扩容成功。记录了排查过程及解决方案。
问题现象 使用 rook 部署 ceph 集群,部署成功后测试 cephfs 的扩容功能,修改 pvc,将容量增大,发现无法扩容成功。日志如下:
1 2 3 4 5 6 7 8 9 10 11 # kubectl -n rook-ceph logs --tail 10 csi-cephfsplugin-provisioner-55db45f9db-nbk82 -c csi-resizer E0322 02:52:13.924110 1 controller.go:272] Error syncing PVC: resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty I0322 02:52:13.924170 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty I0322 02:52:13.924208 1 controller.go:281] Started PVC processing "rook-ceph/cephfs-pvc" I0322 02:52:13.928205 1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerExpandVolume I0322 02:52:13.928231 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":2147483648},"volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_id":"0001-0009-rook-ceph-0000000000000001-c512defd-87b4-11eb-bcbd-26ec2e544591"} I0322 02:52:13.928325 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-176dd919-6a5c-46c5-9a48-82f642d34ead I0322 02:52:13.929327 1 connection.go:185] GRPC response: {} I0322 02:52:13.929370 1 connection.go:186] GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty E0322 02:52:13.929430 1 controller.go:272] Error syncing PVC: resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty I0322 02:52:13.929502 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty
问题定位 日志中记录报错发生在 connection.go 文件中,定位到报错日志对应的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // resizeVolume resize the volume to request size, and update PV's capacity if succeeded. func (ctrl *resizeController) resizeVolume( pvc *v1.PersistentVolumeClaim, pv *v1.PersistentVolume) (resource.Quantity, bool, error) { // before trying expansion we will remove the PVC from map // that tracks PVCs which can't be expanded when in-use. If // pvc indeed can not be expanded when in-use then it will be added // back when expansion fails with in-use error. ctrl.usedPVCs.removePVCWithInUseError(pvc) requestSize := pvc.Spec.Resources.Requests[v1.ResourceStorage] newSize, fsResizeRequired, err := ctrl.resizer.Resize(pv, requestSize) if err != nil { // if this error was a in-use error then it must be tracked so as we don't retry without // first verifying if volume is in-use if inUseError(err) { ctrl.usedPVCs.addPVCWithInUseError(pvc) } return newSize, fsResizeRequired, fmt.Errorf("resize volume %q by resizer %q failed: %v", pv.Name, ctrl.name, err) } klog.V(4).Infof("Resize volume succeeded for volume %q, start to update PV's capacity", pv.Name) err = ctrl.updatePVCapacity(pv, newSize) if err != nil { return newSize, fsResizeRequired, err } klog.V(4).Infof("Update capacity of PV %q to %s succeeded", pv.Name, newSize.String()) return newSize, fsResizeRequired, nil }
resizeVolume 方法用于更新 pv 的大小,可以看到在调用 ctrl.resizer.Resize 方法时出现了错误导致报错。进一步查看 Resize 方法,结合 provided secret is empty
的报错,发现错误发生在如下这个地方。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 func (r *csiResizer) Resize(pv *v1.PersistentVolume, requestSize resource.Quantity) (resource.Quantity, bool, error) { ··· var secrets map[string]string secreRef := source.ControllerExpandSecretRef if secreRef != nil { var err error secrets, err = getCredentials(r.k8sClient, secreRef) if err != nil { return oldSize, false, err } } ··· }
上述代码逻辑是:首先获取 pv 资源中的 ControllerExpandSecretRef 这个字段,根据方法名断定这个字段定义了 Secret 的信息,但是没有找到对应的 Secret 因此报错。SecretReference 结构体如下:
1 2 3 4 5 6 7 type SecretReference struct { // Name is unique within a namespace to reference a secret resource. // +optional Name string `json:"name,omitempty" protobuf:"bytes,1,opt,name=name"` // Namespace defines the space within which the secret name must be unique. // +optional Namespace string `json:"namespace,omitempty"
SecretReference 结构体中定义了 Secret 的 Namespace 和 Name。 定位到了问题代码,下面进一步确定为什么 pv 的 ControllerExpandSecretRef 字段中没有 Secret 的信息。
查看 pv 的详细信息发现 pv 中没有 ControllerExpandSecretRef 这个字段,但是有个相似的 nodeStageSecretRef 字段。
解决过程
首先尝试修改 pv 的资源定义文件,增加 SecretReference 字段。
1 # kubectl patch pv pvc-176dd919-6a5c-46c5-9a48-82f642d34ead -o jsonpath='{.spec.volumeName}' --patch '{"spec":{"csi":{"controllerExpandSecretRef":{"name":"rook-csi-cephfs-provisioner","namespace":"rook-ceph"}}}}'
发现操作无果,执行命令不报错也未生效。
网上有提到需要 kubelet 开启 --feature-gates "ExpandCSIVolumes=true"
这个参数,该参数在 1.16 版本才默认启用,1.16 以前需手动开启, 而我们的版本刚好是1.15, 测试发现并无效果。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 spec: accessModes: - ReadWriteOnce capacity: storage: 3Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: cephfs-pvc namespace: rook-ceph resourceVersion: "438654796" uid: 176dd919-6a5c-46c5-9a48-82f642d34ead csi: driver: rook-ceph.cephfs.csi.ceph.com nodeStageSecretRef: name: rook-csi-cephfs-node namespace: rook-ceph volumeAttributes: clusterID: rook-ceph fsName: myfs pool: myfs-data0 storage.kubernetes.io/csiProvisionerIdentity: 1615980051642-8081-rook-ceph.cephfs.csi.ceph.com subvolumeName: csi-vol-c512defd-87b4-11eb-bcbd-26ec2e544591 volumeHandle: 0001-0009-rook-ceph-0000000000000001-c512defd-87b4-11eb-bcbd-26ec2e544591 persistentVolumeReclaimPolicy: Retain storageClassName: rook-cephfs volumeMode: Filesystem
各种方法无效,最终决定修改代码自己编译一个镜像。修改 csi-resizer.go 代码, 不再获取 SecretReference 字段,而是在代码中定义其结构体内容。1 2 3 4 5 6 7 8 9 secreRef := &v1.SecretReference{ Namespace: "rook-ceph", Name: "rook-csi-cephfs-provisioner", } secrets, err := getCredentials(r.k8sClient, secreRef) if err != nil { return oldSize, false, err }
修改完后编译制作镜像,将新镜像替换后测试扩容功能可用。扩容成功的日志如下:1 2 3 I0322 02:26:46.445966 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"6d947adc-99b8-491d-9918-2df162b9e18c", APIVersion:"v1", ResourceVersion:"11769690", FieldPath:""}): type: 'Normal' reason: 'VolumeResizeSuccessful' Resize volume succeeded I0322 02:27:55.123523 1 controller.go:281] Started PVC processing "rook-ceph/cephfs-pvc" I0322 02:27:55.129411 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"6d947adc-99b8-491d-9918-2df162b9e18c", APIVersion:"v1", ResourceVersion:"11769899", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-6d947adc-99b8-491d-9918-2df162b9e18c
注意:这仅仅是临时解决方案,测试环境玩一玩还可以,生产环境万万使不得。
问题总结 出现该问题的根本原因是 csi provisioner 插件需要 k8s 1.16 以上版本使用,而我们的集群版本为 1.15。