ERPNext kubernetes installation pods stuck in ContainerCreating

small problem with the kubernetes deployment, the pods stuck in ContainerCreating, below is the pods listing and the erpnext pod description:

kubectl -n erpnext get pods
NAME READY STATUS RESTARTS AGE
erpnext-upstream-erpnext-7567d74895-9pp68 0/2 ContainerCreating 0 36m
erpnext-upstream-redis-cache-846cbf879d-rkffs 1/1 Running 0 36m
erpnext-upstream-redis-queue-76b459598b-f9ttl 1/1 Running 0 36m
erpnext-upstream-redis-socketio-65fb9d7968-7zdpj 1/1 Running 0 36m
erpnext-upstream-scheduler-79944484b6-jnhdv 0/1 ContainerCreating 0 36m
erpnext-upstream-socketio-679cd546cb-zqh4x 0/1 ContainerCreating 0 36m
erpnext-upstream-worker-d-6cbb54f6f7-5nxmp 0/1 ContainerCreating 0 36m
erpnext-upstream-worker-l-778ff74d-rksxj 0/1 ContainerCreating 0 36m
erpnext-upstream-worker-s-847b85c7dd-x29kz 0/1 ContainerCreating 0 36m

kubectl -n erpnext describe pods erpnext-upstream-erpnext-7567d74895-9pp68
Name: erpnext-upstream-erpnext-7567d74895-9pp68
Namespace: erpnext
Priority: 0
Node: lke46309-74203-61b716abe01a/192.168.128.122
Start Time: Wed, 15 Dec 2021 04:20:42 +0300
Labels: app.kubernetes.io/instance=erpnext-upstream-backend
app.kubernetes.io/name=erpnext-backend
pod-template-hash=7567d74895
Annotations:
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/erpnext-upstream-erpnext-7567d74895
Containers:
erpnext-assets:
Container ID:
Image: frappe/erpnext-nginx:v13.16.1
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
FRAPPE_PY: 0.0.0.0
FRAPPE_PY_PORT: 8000
FRAPPE_SOCKETIO: erpnext-upstream-socketio
SOCKETIO_PORT: 9000
Mounts:
/assets from assets-cache (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xng6p (ro)
/var/www/html/sites from sites-dir (rw)
erpnext-python:
Container ID:
Image: frappe/erpnext-worker:v13.16.1
Image ID:
Port:
Host Port:
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: tcp-socket :8000 delay=5s timeout=1s period=10s #success=1 #failure=3
Readiness: tcp-socket :8000 delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
MARIADB_HOST: mariadb.mariadb.svc.cluster.local
DB_PORT: 3306
REDIS_QUEUE: erpnext-upstream-redis-queue:12000
REDIS_CACHE: erpnext-upstream-redis-cache:13000
REDIS_SOCKETIO: erpnext-upstream-redis-socketio:11000
SOCKETIO_PORT: 9000
Mounts:
/home/frappe/frappe-bench/logs from logs (rw)
/home/frappe/frappe-bench/sites from sites-dir (rw)
/home/frappe/frappe-bench/sites/assets from assets-cache (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xng6p (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
assets-cache:
Type: EmptyDir (a temporary directory that shares a pod’s lifetime)
Medium:
SizeLimit:
sites-dir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: erpnext-upstream
ReadOnly: false
logs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: erpnext-upstream-logs
ReadOnly: false
kube-api-access-xng6p:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 34m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
Warning FailedScheduling 34m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
Normal Scheduled 34m default-scheduler Successfully assigned erpnext/erpnext-upstream-erpnext-7567d74895-9pp68 to lke46309-74203-61b716abe01a
Warning FailedAttachVolume 22m (x13 over 34m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-6e6058146c3a445c” : rpc error: code = InvalidArgument desc = ControllerPublishVolume Volume capability is not compatible: volume_id:“227816-pvc6e6058146c3a445c” node_id:“32777308” volume_capability:<mount:<fs_type:“ext4” > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:“storage.kubernetes.io/csiProvisionerIdentity” value:“1639412423746-8081-linodebs.csi.linode.com” >
Warning FailedMount 9m51s (x4 over 23m) kubelet Unable to attach or mount volumes: unmounted volumes=[logs sites-dir], unattached volumes=[kube-api-access-xng6p logs assets-cache sites-dir]: timed out waiting for the condition
Warning FailedMount 5m17s (x3 over 32m) kubelet Unable to attach or mount volumes: unmounted volumes=[sites-dir logs], unattached volumes=[sites-dir kube-api-access-xng6p logs assets-cache]: timed out waiting for the condition
Warning FailedAttachVolume 3m42s (x22 over 34m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-2771b14475004b8b” : rpc error: code = InvalidArgument desc = ControllerPublishVolume Volume capability is not compatible: volume_id:“227817-pvc2771b14475004b8b” node_id:“32777308” volume_capability:<mount:<fs_type:“ext4” > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:“storage.kubernetes.io/csiProvisionerIdentity” value:“1639412423746-8081-linodebs.csi.linode.com” >
Warning FailedMount 3m3s (x4 over 30m) kubelet Unable to attach or mount volumes: unmounted volumes=[sites-dir logs], unattached volumes=[assets-cache sites-dir kube-api-access-xng6p logs]: timed out waiting for the condition
Warning FailedMount 48s (x4 over 27m) kubelet Unable to attach or mount volumes: unmounted volumes=[logs sites-dir], unattached volumes=[logs assets-cache sites-dir kube-api-access-xng6p]: timed out waiting for the condition

for ERPNext you need ReadWriteMany (RWX) storage class. Default storage class won’t work

https://helm.erpnext.com/prepare-kubernetes/ refer shared file system part here.

hi Revant,
thanks it was a life saving advise, I just finished rook nfs in the cluster, it work like charm, and I succeeded in installing the erpnext, I follow your example in the github repository.

I just finished the adding new site, I expect when applying the ingress as suppose a managed clustering will create a loadBalancer automatically, which sadly it didn’t, could it be a missing service.yaml file from the example repository

install the Welcome - NGINX Ingress Controller. it will add the load balancer.

adding the Ingress with appropriate annotations will use the previously installed loadbalancer instead of adding one more.

hi Revant,
I was digging the cause of this issue and found that the create new site did stop with the error.

message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: “True”
type: Failed
failed: 2

and also this error show up when the backofflimit set to 1

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “11cdc73b1c270b782b778d484086206d49639799bff54370d33378e9d5f83395” network for pod “create-new-site--com-fvh2q": networkPlugin cni failed to set up pod "create-new-site--com-fvh2q_erpnext” network: error adding host side routes for interface: calif34ae3ebea5, error: route (Ifindex: 37, Dst: 10.2.0.27/32, Scope: 253) already exists for an interface other than ‘calif34ae3ebea5’

i you cannot create re create same job

kubectl delete jobs -n <namespace> <job> to delete jobs

make sure to add FORCE=1 to overwrite the already existing/created site

sorry, with my short experience I didn’t get it where to add FORCE=1 to overwrite the already existing site.

you mean this;

kubectl -n erpnext replace --force -f create-*-com.yaml

job.batch "create-new-site--com" deleted
job.batch/create-new-site-
-com replaced

FORCE is environment variable passed to container that creates new site

for job, I think you can’t patch it, you need to delete and add new or replace it.

I deleted the new site job pod, then I reviewed the listed examples in the github,
templates
and tried to rebuild the yaml file, I end up with two pods one for (frappe/erpnext-nginx:v13.16.1) is working fine the second for (frappe/erpnext-worker:v13.16.1) which return the same error _ Job has reached the specified backoff limit. plus the load balancer is not generated.

I forgot to mentioned that I switch from nfs server to, rancher and rancher longhorn, for some security and reliability, the PVC is working and reliable.

Deleting the pod doesn’t delete the job.

Did you delete the pod from erpnext deployment? That pod has 2 containers, nginx and python. That is normal.

Job has reached the specified backoff limit means job ran for specified retries and failed. Check the logs of pods created by failed “Job”.

PVC should have storage class that supports RWX, get it from anywhere.

hi Revant,
Thanks for your follow up and replies, it is appreciated,

here below you will find the create job site yaml file I rebuild.
#############
apiVersion: batch/v1
kind: Job
metadata:
name: create-new-site-w-com
namespace: erpnext
spec:
backoffLimit: 1
template:
spec:
securityContext:
supplementalGroups: [1000]
containers:
- name: create-site
image: frappe/erpnext-worker:v13.16.1
args: [“new”]
imagePullPolicy: IfNotPresent
env:
- name: “SITE_NAME”
value: w.com
- name: “DB_ROOT_USER”
value: root
- name: “MYSQL_ROOT_PASSWORD”
valueFrom:
secretKeyRef:
key: password
name: mariadb-root-password
- name: “ADMIN_PASSWORD”
value: admin
# - name: “INSTALL_APPS”
# value: “erpnext”
restartPolicy: Never
volumes:
- name: sites-dir
persistentVolumeClaim:
claimName: w-erpnext
readOnly: false
- name: logs
persistentVolumeClaim:
claimName: w-erpnext-logs
readOnly: false
- name: assets-cache
emptyDir: {}
#volumeMounts:
# - name: sites-dir
# mountPath: /home/frappe/frappe-bench/sites
# - name: assets-cache
# mountPath: /home/frappe/frappe-bench/sites/assets
# - name: logs
# mountPath: /home/frappe/frappe-bench/logs
initContainers:
- name: populate-assets
image: frappe/erpnext-nginx:v13.16.1
command: [“/bin/bash”, “-c”]
args:
- “rsync -a --delete /var/www/html/assets/frappe /assets”
volumeMounts:
- name: assets-cache
mountPath: /assets
#####################

this file create one job as below:
(job) create-new-site-w-com

this job work for two minutes, and spin two pods as below:
(pod)_[create-new-site-w-com-6tv7j]from_image_frappe/erpnext-worker:v13.16.1
(pod)
[create-new-site-w-com-bdrg7]_from_image_frappe/erpnext-worker:v13.16.1

both pods fails and return this logs
config file not created, retry 29
config file not created, retry 30
config file not created, retry 31
timeout: config file not created

after that the (job) create-new-site-w-com, fails and return this description
Warning BackoffLimitExceeded 8m job-controller Job has reached the specified backoff limit

with below command all of them deleted
kubectl -n erpnext delete -f create-w-com.yaml (deletes all of them)

you needs to set permissions on volume

make sure uid:gid 1000:1000 is able to create files

he Revant,
the volume are RWX as below:

kubectl -n erpnext describe pvc w-erpnext
Name: w-erpnext
Namespace: erpnext
StorageClass: longhorn
Status: Bound
Volume: pvc-e4884a3c-570b-4cce-b368-1a93467cad71
Labels: app=erpnext
app.kubernetes.io/managed-by=Helm
chart=erpnext-3.2.38
heritage=Helm
release=w
Annotations: meta.helm.sh/release-name: w
meta.helm.sh/release-namespace: erpnext
pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 8Gi
Access Modes: RWX
VolumeMode: Filesystem
Used By: w-erpnext-erpnext-5dd9ccddc8-wtpcg
w-erpnext-scheduler-6bb6895cdf-ttn76
w-erpnext-socketio-59564b5dc5-hbwdx
w-erpnext-worker-d-75b9bb6888-6fdsn
w-erpnext-worker-l-7497778b59-2j26s
w-erpnext-worker-s-757699d7bf-qq6cn
Events:

and in the yaml file its mentioned as below:
spec:
securityContext:
supplementalGroups: [1000]

Mount the volume in some pod with root user and chown -R 1000:1000 /vol (/vol is where pvc to change permission is mounted)

but I mentioned in the yaml file :

spec:
securityContext:
supplementalGroups: [1000]

or you mean to mention it:
— in the rbac
— RunAsUser

hi Revant,
still working around to create the site for the cluster, this was the failure log for the worker pod.

Attempt 1 to connect to mariadb.mariadb.svc.cluster.local:3306
Tue, Dec 21 2021 4:33:04 am Attempt 1 to connect to w-erpnext-redis-queue:12000
Tue, Dec 21 2021 4:33:04 am Attempt 1 to connect to w-erpnext-redis-cache:13000
Tue, Dec 21 2021 4:33:04 am Attempt 1 to connect to w-erpnext-redis-socketio:11000
Tue, Dec 21 2021 4:33:04 am Connections OK
Tue, Dec 21 2021 4:33:06 am Traceback (most recent call last):
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/commands/new.py”, line 132, in
Tue, Dec 21 2021 4:33:06 am main()
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/commands/new.py”, line 62, in main
Tue, Dec 21 2021 4:33:06 am _new_site(
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/apps/frappe/frappe/installer.py”, line 61, in _new_site
Tue, Dec 21 2021 4:33:06 am install_db(
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/apps/frappe/frappe/installer.py”, line 108, in install_db
Tue, Dec 21 2021 4:33:06 am setup_database(force, source_sql, verbose, no_mariadb_socket)
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/apps/frappe/frappe/database/init.py”, line 16, in setup_database
Tue, Dec 21 2021 4:33:06 am return frappe.database.mariadb.setup_db.setup_database(force, source_sql, verbose, no_mariadb_socket=no_mariadb_socket)
Tue, Dec 21 2021 4:33:06 am File “/home/frappe/frappe-bench/apps/frappe/frappe/database/mariadb/setup_db.py”, line 45, in setup_database
Tue, Dec 21 2021 4:33:06 am raise Exception(“Database %s already exists” % (db_name,))
Tue, Dec 21 2021 4:33:06 am Exception: Database _7e13c479c8f91c29 already exists

I appreciate any guidance in this regards,

drop the database from mariadb or use FORCE=1 environment variable to “new” command.