My long shot guess is that when you initially set up ERPNext it created only 1 based on your machine core specification. My statement maybe completely incorrect.
Yeah, I’m killing the machine right now and rerunning my installation scripts. It’s a pay by the month KVM virtual host “in da cloud”. I’ll be very displeased if it’s single threaded or something.
Your problem is in the parameter -t 120. This is too short and your workers get killed before the have the time to respond. Here is the gunicorn manual on this:
Workers silent for more than this many seconds are killed and restarted.
Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.
I think the reason why you get a response after 120 seconds is in -t 120.
6000 makes the session timeout around 1.6 hours. This may be a high number, If -t 6000 works, you may try lowering the session timeout.
Yes. The point of using gunicorn is to have workers in production mode. The issue here is, after receiving the task, the worker passes it to the server (ERPNext python) and waits. Then, it gets caught by the timeout. Since there is only 1 worker, you have to wait for the new worker to get timedout before you can get back the answer after 120 seconds.
So, the key is to increase the session -t timeout parameter to give it enough time.
You see this happen during the ERPNext version 12 start-up setup. bench start development mode does not get the request timeout issue. However, the production mode (bench restart) mode gets request timeout problems which can be solved by increasing the http_timeout parameter using bench command.
With -w 1 I had a deadlock that took 120 seconds to resolve.
With -w 4 I no longer have deadlock and everything processes in about 1 second.
Why would I want to, stay with a single worker (-w 1) and, increase the deadlock to 1.6 hours?
It seems obvious that the root cause is the attempt at two-way communication with only one worker. The “key” was to Increase the number of workers from 1 to 4.
Ok good. The recommended number of gunicorn workers is number_of_cores * 2 + 1.
So for a 4 core CPU the recommended setting is 9. For 2 core it is 3.
With regards to -t setting, the reason why you need to wait for 120 seconds when you had only one worker is that gunicorn killed the worker which was waiting for the API server response.
I think - here I am guessing - to have a precise timeout setting, you can try setting -t to a higher timeout number until you no longer have a timeout problem with one worker only. When you have this precise timeout setting, you can increase your gunicorn workers. Then you have an efficient system that does not kill workers which are just about ready to receive the API Server response.