夜下听雨

记录、总结;生活、学习!

0%

rook-ceph集群中cephfs无法扩容

在公司k8s集群上部署了ceph提供对象存储和文件存储服务,按照rook官方文档一步一步安装,但是部署完成之后测试发现文件存储无法扩容成功。记录了排查过程及解决方案。

问题现象

使用 rook 部署 ceph 集群,部署成功后测试 cephfs 的扩容功能,修改 pvc,将容量增大,发现无法扩容成功。日志如下:

1
2
3
4
5
6
7
8
9
10
11
# kubectl -n rook-ceph logs --tail 10 csi-cephfsplugin-provisioner-55db45f9db-nbk82 -c csi-resizer
E0322 02:52:13.924110 1 controller.go:272] Error syncing PVC: resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty
I0322 02:52:13.924170 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty
I0322 02:52:13.924208 1 controller.go:281] Started PVC processing "rook-ceph/cephfs-pvc"
I0322 02:52:13.928205 1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerExpandVolume
I0322 02:52:13.928231 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":2147483648},"volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_id":"0001-0009-rook-ceph-0000000000000001-c512defd-87b4-11eb-bcbd-26ec2e544591"}
I0322 02:52:13.928325 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-176dd919-6a5c-46c5-9a48-82f642d34ead
I0322 02:52:13.929327 1 connection.go:185] GRPC response: {}
I0322 02:52:13.929370 1 connection.go:186] GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty
E0322 02:52:13.929430 1 controller.go:272] Error syncing PVC: resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty
I0322 02:52:13.929502 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"176dd919-6a5c-46c5-9a48-82f642d34ead", APIVersion:"v1", ResourceVersion:"442928655", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume "pvc-176dd919-6a5c-46c5-9a48-82f642d34ead" by resizer "rook-ceph.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty

问题定位

日志中记录报错发生在 connection.go 文件中,定位到报错日志对应的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// resizeVolume resize the volume to request size, and update PV's capacity if succeeded.
func (ctrl *resizeController) resizeVolume(
pvc *v1.PersistentVolumeClaim,
pv *v1.PersistentVolume) (resource.Quantity, bool, error) {

// before trying expansion we will remove the PVC from map
// that tracks PVCs which can't be expanded when in-use. If
// pvc indeed can not be expanded when in-use then it will be added
// back when expansion fails with in-use error.
ctrl.usedPVCs.removePVCWithInUseError(pvc)

requestSize := pvc.Spec.Resources.Requests[v1.ResourceStorage]

newSize, fsResizeRequired, err := ctrl.resizer.Resize(pv, requestSize)

if err != nil {
// if this error was a in-use error then it must be tracked so as we don't retry without
// first verifying if volume is in-use
if inUseError(err) {
ctrl.usedPVCs.addPVCWithInUseError(pvc)
}
return newSize, fsResizeRequired, fmt.Errorf("resize volume %q by resizer %q failed: %v", pv.Name, ctrl.name, err)
}
klog.V(4).Infof("Resize volume succeeded for volume %q, start to update PV's capacity", pv.Name)

err = ctrl.updatePVCapacity(pv, newSize)
if err != nil {
return newSize, fsResizeRequired, err
}
klog.V(4).Infof("Update capacity of PV %q to %s succeeded", pv.Name, newSize.String())

return newSize, fsResizeRequired, nil
}

resizeVolume 方法用于更新 pv 的大小,可以看到在调用 ctrl.resizer.Resize 方法时出现了错误导致报错。进一步查看 Resize 方法,结合 provided secret is empty 的报错,发现错误发生在如下这个地方。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func (r *csiResizer) Resize(pv *v1.PersistentVolume, requestSize resource.Quantity) (resource.Quantity, bool, error) {
···

var secrets map[string]string
secreRef := source.ControllerExpandSecretRef
if secreRef != nil {
var err error
secrets, err = getCredentials(r.k8sClient, secreRef)
if err != nil {
return oldSize, false, err
}
}

···
}

上述代码逻辑是:首先获取 pv 资源中的 ControllerExpandSecretRef 这个字段,根据方法名断定这个字段定义了 Secret 的信息,但是没有找到对应的 Secret 因此报错。SecretReference 结构体如下:

1
2
3
4
5
6
7
type SecretReference struct {
// Name is unique within a namespace to reference a secret resource.
// +optional
Name string `json:"name,omitempty" protobuf:"bytes,1,opt,name=name"`
// Namespace defines the space within which the secret name must be unique.
// +optional
Namespace string `json:"namespace,omitempty"

SecretReference 结构体中定义了 Secret 的 Namespace 和 Name。
定位到了问题代码,下面进一步确定为什么 pv 的 ControllerExpandSecretRef 字段中没有 Secret 的信息。

查看 pv 的详细信息发现 pv 中没有 ControllerExpandSecretRef 这个字段,但是有个相似的 nodeStageSecretRef 字段。

解决过程

  1. 首先尝试修改 pv 的资源定义文件,增加 SecretReference 字段。

    1
    # kubectl patch pv pvc-176dd919-6a5c-46c5-9a48-82f642d34ead -o jsonpath='{.spec.volumeName}' --patch '{"spec":{"csi":{"controllerExpandSecretRef":{"name":"rook-csi-cephfs-provisioner","namespace":"rook-ceph"}}}}'

    发现操作无果,执行命令不报错也未生效。

  2. 网上有提到需要 kubelet 开启 --feature-gates "ExpandCSIVolumes=true" 这个参数,该参数在 1.16 版本才默认启用,1.16 以前需手动开启, 而我们的版本刚好是1.15, 测试发现并无效果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 3Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: cephfs-pvc
namespace: rook-ceph
resourceVersion: "438654796"
uid: 176dd919-6a5c-46c5-9a48-82f642d34ead
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: myfs
pool: myfs-data0
storage.kubernetes.io/csiProvisionerIdentity: 1615980051642-8081-rook-ceph.cephfs.csi.ceph.com
subvolumeName: csi-vol-c512defd-87b4-11eb-bcbd-26ec2e544591
volumeHandle: 0001-0009-rook-ceph-0000000000000001-c512defd-87b4-11eb-bcbd-26ec2e544591
persistentVolumeReclaimPolicy: Retain
storageClassName: rook-cephfs
volumeMode: Filesystem
  1. 各种方法无效,最终决定修改代码自己编译一个镜像。修改 csi-resizer.go 代码, 不再获取 SecretReference 字段,而是在代码中定义其结构体内容。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    secreRef := &v1.SecretReference{
    Namespace: "rook-ceph",
    Name: "rook-csi-cephfs-provisioner",
    }

    secrets, err := getCredentials(r.k8sClient, secreRef)
    if err != nil {
    return oldSize, false, err
    }
    修改完后编译制作镜像,将新镜像替换后测试扩容功能可用。扩容成功的日志如下:
    1
    2
    3
    I0322 02:26:46.445966       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"6d947adc-99b8-491d-9918-2df162b9e18c", APIVersion:"v1", ResourceVersion:"11769690", FieldPath:""}): type: 'Normal' reason: 'VolumeResizeSuccessful' Resize volume succeeded
    I0322 02:27:55.123523 1 controller.go:281] Started PVC processing "rook-ceph/cephfs-pvc"
    I0322 02:27:55.129411 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"cephfs-pvc", UID:"6d947adc-99b8-491d-9918-2df162b9e18c", APIVersion:"v1", ResourceVersion:"11769899", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-6d947adc-99b8-491d-9918-2df162b9e18c

注意:这仅仅是临时解决方案,测试环境玩一玩还可以,生产环境万万使不得。

问题总结

出现该问题的根本原因是 csi provisioner 插件需要 k8s 1.16 以上版本使用,而我们的集群版本为 1.15。

请作者喝咖啡!