What is the best way to gracefully rollout a new version or fix in erpnext running on k8s?

Carlos_Rios · January 31, 2022, 5:38pm

Hello there,

Currently where I work, we have ERPNext deployed on a Kubernetes cluster, and we have the following pipeline when we deploy a new custom app version:

Compile docker image including the new version of our custom app
Deploy the new image for erpnext-erpnext, erpnext-worker-*, erpnext-schedule, using the rolling update strategy.
Run the bench migrate command.

Briefly, what stage (2) means with rolling update strategy, is that for some short period time the application will run with both versions.
For example I have running app with version A on kubernetes, and I want to deploy version B of the app using rolling update strategy. Kubernetes will deploy version B alongside version A, when kubernetes is sure that version B is running ok, will send SIGTERM to version A and will wait for some time threshold, after that time if version A is not terminated, kubernetes will send SIGKILL and shuts down version A.

Currently we have experienced that some background jobs were being processed during the update and got terminated, and some users experienced some database locked up errors while working on some doctypes while bench migrate was runnig.

Do you have ideas on how to prevent new background jobs to be processed while updating the app? should I run bench disable-scheduler
How to disable temporally the web interface while bench migrate is running?

Thank you so much for you attention,
Thanks

revant_one · January 31, 2022, 11:19pm

if you do bench --site site.name migrate or Use the migrate command from container it’ll pause scheduler and set maintenance mode in site_config.json

When these variables are set to 1, it’ll show 503 page with a message that the server is under upgrade process.

background workers will keep running unless they are manually scaled to 0 replica before upgrade and set to number of running replicas back again after upgrade.

Carlos_Rios · February 3, 2022, 11:46pm

Thanks for the informatio @revant_one your information clarifies me a lot.

Regarding to background workers, these are implemented using RQ, taking a look I found this:

Taking Down Workers

If, at any time, the worker receives SIGINT (via Ctrl+C) or SIGTERM (via kill ), the worker wait until the currently running task is finished, stop the work loop and gracefully register its own death.

If, during this takedown phase, SIGINT or SIGTERM is received again, the worker will forcefully terminate the child process (sending it SIGKILL ), but will still try to register its own death.
RQ: Workers

So I can imagine, if I’m able to catch the SIGTERM signal from kubernetes in the docker-entrypoint.sh script and pas it to the worker, the worker will finish the current process and stop taking more jobs.

Best regards

revant_one · February 4, 2022, 6:30am

I guess removing the pod manually or by kubernetes controllers will trigger same.