Worker timeout (Stock Entry/transfer on Production Orders)

I am running…
Debian 9.3 stretch
ERPNext: v10.1.5 (master)
Frappe Framework: v10.1.2 (master)

I also have a test box which runs the standard production VM (ubuntu 14.04) on same versions

I imported (without any trouble, and on/onto both machines) a large number of Production Orders, in “Not Started” mode.
Most consist of a tree of BOMs+Raw Materials and these are only 1 level deep - ie. Not MultiBOM.
eg. BOM1=item1+item2+item3
BOM2=item4+item5+item6
BOM3=BOM1 + BOM2 + item7 + item8
BOM4=BOM3 + item9
They are separated like this to cater for the multiple WareHouse manufacture and material transfers
BOM1,2,3 work OK, but …
when I “finish” BOM4 (last one in that sequence), the STE- gets stuck on “Submitting” for a long time, and then gives fails/timeout.

I have checked the threads related to gunicorn session timeouts and followed the recommendations there
To import large data, Increased nginx session timeout, updated gunicorn worker but how to increase the gunicorn session timeout?
Data Import Tool hangs on ERP Next instance running in Production Mode - #6 by gvyshnya
Gunicorn Worker how to increase - #10 by adnan

I increased “maxmemory” in redis cache from 192->512
cat frappe-bench/config/redis_cache.conf | grep maxmemory
maxmemory 512mb

I tuned/tweaked the supervisor.conf and adjusted timeouts and number of workers etc
cat frappe-bench/config/supervisor.conf | grep guni

  1. first settings: (result=timeout)
    command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 4 -t 120 frappe.app:application --preload

  2. second settings: (result=timeout)
    command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 4 -t 180 frappe.app:application --preload --limit-request-line 8190

  3. third settings: (result=timeout)
    command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 5 -t 240 frappe.app:application --preload --limit-request-line 8190

These all produce errors something like this…
tail /home/frappe/frappe-bench/logs/web.error.log
[2018-03-06 02:48:25 +0000] [2367] [INFO] Starting gunicorn 19.7.1
[2018-03-06 02:48:25 +0000] [2367] [INFO] Listening at: http://127.0.0.1:8000 (2367)
[2018-03-06 02:48:25 +0000] [2367] [INFO] Using worker: sync
[2018-03-06 02:48:25 +0000] [2403] [INFO] Booting worker with pid: 2403
[2018-03-06 02:48:25 +0000] [2412] [INFO] Booting worker with pid: 2412
[2018-03-06 02:48:25 +0000] [2415] [INFO] Booting worker with pid: 2415
[2018-03-06 02:48:25 +0000] [2421] [INFO] Booting worker with pid: 2421
[2018-03-06 02:59:24 +0000] [2367] [CRITICAL] WORKER TIMEOUT (pid:2403)
[2018-03-06 02:59:25 +0000] [2580] [INFO] Booting worker with pid: 2580

If I use these settings: (raise workers, shorten timeout, remove limit?? {I think})
command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
4. I get a fail with errors as follows
[2018-03-06 14:41:37 +0000] [8616] [CRITICAL] WORKER TIMEOUT (pid:8681)
[2018-03-06 14:41:38 +0000] [8812] [INFO] Booting worker with pid: 8812
[2018-03-06 14:43:06 +0000] [8616] [CRITICAL] WORKER TIMEOUT (pid:8812)
[2018-03-06 14:43:06 +0000] [8812] [ERROR] Error handling request /
Traceback (most recent call last):
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/gunicorn/workers/sync.py”, line 135, in handle
self.handle_request(listener, req, client, addr)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/gunicorn/workers/sync.py”, line 176, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/werkzeug/local.py”, line 228, in application
return ClosingIterator(app(environ, start_response), self.cleanup)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/werkzeug/wrappers.py”, line 302, in application
return f(*args[:-2] + (request,))(*args[-2:])
File “/home/frappe/frappe-bench/apps/frappe/frappe/app.py”, line 95, in application
frappe.db.rollback()
File “/home/frappe/frappe-bench/apps/frappe/frappe/database.py”, line 768, in rollback
self.sql(“rollback”)
File “/home/frappe/frappe-bench/apps/frappe/frappe/database.py”, line 176, in sql
_self.cursor.execute(query)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/cursors.py”, line 165, in execute
_result = self.query(query)
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/cursors.py”, line 321, in query
conn.query(q)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 860, in query
_self._affected_rows = self.read_query_result(unbuffered=unbuffered)
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1061, in read_query_result
result.read()
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1349, in read
_first_packet = self.connection.read_packet()
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1005, in read_packet
_% (packet_number, self.next_seq_id))
InternalError: Packet sequence number wrong - got 8 expected 1
[2018-03-06 14:43:06 +0000] [8812] [INFO] Worker exiting (pid: 8812)
[2018-03-06 14:43:07 +0000] [8832] [INFO] Booting worker with pid: 8832

I also increased that -t 60 to -t 360 and got a timeout error

Other settings I am using:
find . -name '*.conf' | xargs grep "client_max_body_size"
./frappe-bench/config/nginx.conf: client_max_body_size 50m;
./bench-repo/bench/config/templates/nginx.conf: client_max_body_size 50m;

#On restart of supervisorctl
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl stop all
sudo supervisorctl start all

#After which, if I check with ps aux | grep gunicorn I have
frappe 8618 730 0 14:39 ? 00:00:12 /usr/bin/redis-server 127.0.0.1:11000
frappe 8619 730 0 14:39 ? 00:00:13 /usr/bin/redis-server 127.0.0.1:13000
frappe 8620 730 0 14:39 ? 00:00:11 /usr/bin/redis-server 127.0.0.1:12000
frappe 12224 730 6 16:27 ? 00:00:01 /home/frappe/frappe-bench/env/bin/python -m frappe.utils.bench_helper frappe schedule
frappe 12225 730 6 16:27 ? 00:00:01 /home/frappe/frappe-bench/env/bin/python -m frappe.utils.bench_helper frappe worker --queue default
frappe 12226 730 6 16:27 ? 00:00:01 /home/frappe/frappe-bench/env/bin/python -m frappe.utils.bench_helper frappe worker --queue long
frappe 12227 730 6 16:27 ? 00:00:01 /home/frappe/frappe-bench/env/bin/python -m frappe.utils.bench_helper frappe worker --queue short
frappe 12236 730 4 16:27 ? 00:00:01 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12237 730 3 16:27 ? 00:00:01 /usr/bin/node /home/frappe/frappe-bench/apps/frappe/socketio.js
frappe 12286 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12287 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12288 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12289 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12290 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0
frappe 12291 12236 0 16:27 ? 00:00:00 /home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 6 -t 60 frappe.app:application --preload --limit-request-line 0

I don’t think diskspace or RAM are a factor

free -h
total used free shared buff/cache available
Mem: 3.8G 1.4G 431M 73M 2.0G 2.0G
Swap: 3.9G 124K 3.9G
df -h
/dev/mapper/DebianERPNext–vg-home 426G 24G 381G 6% /home

Any pointers or parameter tuning clues would be most welcome.

1 Like

Just to add a bit of extra info… not sure if it is relevant or useful.

After the timeout fail, the gunicorn process keeps the CPU at >90% for ~5mins.
The swap was initially unused, but now shows ~500MB, so something isn’t happy, but I am not sure what. No errors in mysql log that i can see.

The machine is on a 192.168.x.x LAN, on the port 8115…
node-socketio.log has a reference to a DNS looked up address which matches the domain used in the SITE setting - although the /etc/hosts file specifies the 192.168.x.x address

{ Error: connect EHOSTUNREACH 199.195.144.26:8115
_at Object.exports.errnoException (util.js:1050:11)
_at exports.exceptionWithHostPort (util.js:1073:20)
at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1093:14)
code: ‘EHOSTUNREACH’,
errno: ‘EHOSTUNREACH’,
syscall: ‘connect’,
address: ‘199.195.144.26’,
port: 8115,
response: undefined }
No response for doc_subscribe
No response for doc_subscribe
No response for doc_subscribe
No response for doc_subscribe
listening on *: 9000
listening on *: 9000
listening on *: 9000

1 Like

I tried some more tuning and tracing.
Added a debug log to the gunicorn settings using recommendations from various searched posts on the topic


command=/home/frappe/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 8 -t 120 frappe.app:application --preload --limit-request-line 2048 –log-file /home/frappe/frappe-bench/logs/frappe.gunicorn.log --log-level debug –keep-alive 60


that log file shows (summarised)…
[2018-03-07 16:55:41 +0000] [18686] [DEBUG] GET /api/method/frappe.async.can_subscribe_doc
[2018-03-07 16:55:45 +0000] [18686] [DEBUG] POST /
[2018-03-07 16:56:07 +0000] [18693] [DEBUG] POST /
[2018-03-07 16:56:37 +0000] [18693] [DEBUG] POST /
[2018-03-07 16:57:37 +0000] [18685] [DEBUG] POST /
[2018-03-07 16:57:45 +0000] [18622] [CRITICAL] WORKER TIMEOUT (pid:18686)
[2018-03-07 16:57:46 +0000] [18781] [INFO] Booting worker with pid: 18781

If I set the timeouts to 180s, the gunicorn process sits at 91% until about 160s (20s before max timeout), it does the timeout. If I set it to 120, then it runs at ~90% for ~100s (again ~20s before the max)
Then it drops to normal once the screen timeout message appears.

It seems (to me as a non-programmer) that the POST after /api/method/frappe.async.can_subscribe_doc is triggering the timeout?
The process is up on that port…
tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN 18038/python

~/frappe-bench/sites/common_site_config.json
{
“auto_update”: false,
“background_workers”: 1,
“file_watcher_port”: 6787,
“frappe_user”: “frappe”,
“gunicorn_workers”: 2,
“rebase_on_pull”: false,
“redis_cache”: “redis://localhost:13000”,
“redis_queue”: “redis://localhost:11000”,
“redis_socketio”: “redis://localhost:12000”,
“restart_supervisor_on_update”: true,
“serve_default_site”: true,
“shallow_clone”: true,
“socketio_port”: 9000,
“update_bench_on_update”: true,
“webserver_port”: 8000
}

~/frappe-bench/sites/site.domain.co.za/site_config.json
{
“db_name”: “*****************”,
“db_password”: “****************”,
“limits”: {
“space_usage”: {
“backup_size”: 36.0,
“database_size”: 115.53,
“files_size”: 2.0,
“total”: 153.53
}
},
“nginx_port”: 8115
}

Still getting timeouts though :neutral_face:

1 Like

I am lost trying to find the issue …
You are using master in production i think, so for every change you need to use bench restart .
and to be more sure you may need to restart nginx after each changes made to config using
services nginx restart .
please try writing the problem as a list of point .

and some times when the function need alot of time you can use
bench console
or
bench execute
adding the path of the function
best regards

Thank you for the reply.

I restarted nginx for the connect_* changes I made, and used the
sudo supervisorctl reread/update/stop/start all
I tried to provide as much detail about what I did to try and fix the probelm - but (sorry) it seems I didn’t describe the problem very well.

I have a BOM based “Item To Manufacture”. It is “In Process”. From there I get to
E > Manufacturing > Production Order, where I must click the “Finish” button to manufacture the goods.
It takes me to
E > Stock > Stock Entry
where I transfer the stock from the WIP(source) W/H to the Final(target) W/H.
I submit http://erpnext:8115/desk#Form/Stock%20Entry/STE-105520 and it gives the timeout error as described above.
I will try the console options - but I am not sure what to put there. Could you point me to an example?

Thank you again

I will try to mimic your scinario and see what happen

Thank you @ahmad18189.
I have tried a few more things to identify the issue - I am not sure if they are useful or not, but for info…

I added pip gevent and eventlet as recommended in some other unrelated but similar threads, and allowed the thread to listen on any address so the line is now
command=/home/frappe/frappe-bench/env/bin/gunicorn -b 0.0.0.0:8000 -w 4 -t 120 frappe.app:application --preload --worker-class gevent --log-file /home/frappe/frappe-bench/logs/frappe.gunicorn.log --log-level DEBUG

At one point the system threw out an additional error…
Form Dict: {
_“action”: “Submit”, _
_“cmd”: “frappe.desk.form.save.savedocs”, _
_“doc”: "{"use_multi_level_bom":0, _
**********************************************************
**********************************************************
**********************************************************
_11:26:26.103122","posting_date":"2015-09-04","production_order":"PRO-100462","_last_sync_on":"2018-03-08T09:30:40.401Z","address_display":null,"purchase_receipt_no":null,"supplier_address":null,"supplier_name":null,"supplier":null,"sales_invoice_no":null,"delivery_note_no":null,"customer_address":null,"customer_name":null,"customer":null}"
}
Request Error
Traceback (most recent call last):
File “/home/frappe/frappe-bench/apps/frappe/frappe/app.py”, line 62, in application
response = frappe.handler.handle()
File “/home/frappe/frappe-bench/apps/frappe/frappe/handler.py”, line 22, in handle
data = execute_cmd(cmd)
File “/home/frappe/frappe-bench/apps/frappe/frappe/handler.py”, line 53, in execute_cmd
return frappe.call(method, **frappe.form_dict)
File “/home/frappe/frappe-bench/apps/frappe/frappe/init.py”, line 939, in call
return fn(*args, **newargs)
File “/home/frappe/frappe-bench/apps/frappe/frappe/desk/form/save.py”, line 19, in savedocs
doc.submit()
File “/home/frappe/frappe-bench/apps/frappe/frappe/model/document.py”, line 831, in submit
_self.submit()
_File “/home/frappe/frappe-bench/apps/frappe/frappe/model/document.py”, line 820, in submit
self.save()
File “/home/frappe/frappe-bench/apps/frappe/frappe/model/document.py”, line 256, in save
_return self.save(*args, **kwargs)
_File “/home/frappe/frappe-bench/apps/frappe/frappe/model/document.py”, line 286, in save
self.check_if_latest()
File “/home/frappe/frappe-bench/apps/frappe/frappe/model/document.py”, line 612, in check_if_latest
where name = %s for update"“”.format(self.doctype), self.name, as_dict=True)
File “/home/frappe/frappe-bench/apps/frappe/frappe/database.py”, line 166, in sql
_self.cursor.execute(query, values)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/cursors.py”, line 165, in execute
_result = self.query(query)
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/cursors.py”, line 321, in query
conn.query(q)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 860, in query
_self._affected_rows = self.read_query_result(unbuffered=unbuffered)
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1061, in read_query_result
result.read()
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1349, in read
_first_packet = self.connection.read_packet()
_File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 1018, in read_packet
packet.check_error()
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/connections.py”, line 384, in check_error
_err.raise_mysql_exception(self.data)
File “/home/frappe/frappe-bench/env/local/lib/python2.7/site-packages/pymysql/err.py”, line 107, in raise_mysql_exception
raise errorclass(errno, errval)
InternalError: (1205, u’Lock wait timeout exceeded; try restarting transaction’)
/var/log/mysql/error.log (which until now had no errors in it)…
2018-02-02 16:53:31 139715583025728 [Note] InnoDB: innodb_empty_free_list_algorithm has been changed to legacy because of small buffer pool size. In order to use backoff, increase buffer pool at least up to 20MB.

I tuned those parameters [I went with a more conservative than recommended option in the mysql.cnf (I have 4GB RAM, so according to what they say I could use 2048/1024/512/256, but have used only 256/128/64/32)]

(Set buffer pool size to 50-80% of your computer’s memory)
innodb_buffer_pool_size=256M
innodb_additional_mem_pool_size=128M
(Set the log file size to about 25% of the buffer pool size)
innodb_log_file_size=64M
innodb_log_buffer_size=32M

restarted the db, which allowed/created…
2018-03-08 11:53:52 140369194930752 [Note] InnoDB: Initializing buffer pool, size = 256.0M
2018-03-08 11:53:52 140369194930752 [Warning] InnoDB: Resizing redo log from 23072 to 24096 pages, LSN=865376310
2018-03-08 11:53:52 140369194930752 [Warning] InnoDB: Starting to delete and rewrite log files.
2018-03-08 11:53:52 140369194930752 [Note] InnoDB: Setting log file ./ib_logfile101 size to 64 MB
2018-03-08 11:53:52 140369194930752 [Note] InnoDB: Setting log file ./ib_logfile1 size to 64 MB
2018-03-08 11:53:52 140369194930752 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0

It looks like it may be a DB optimisation and/or connection issue, but I don’t know for sure.

1 Like

I noted this post…

which just came on. The machine giving most trouble is on latest versions.

I tested on a different VM which is on ERPNext: v10.0.18 (master); Frappe Framework: v10.0.19 (master) and transactions there are successful slightly more often.
To introduce another strange factor…if I leave the system for a few minutes at hi CPU after the timeout, a few time it did a background submit successfully??

1 Like

My apologies for constantly adding to this string of posts - I hope the info is useful somehow.

I have noted another pattern/behaviour. When I submit the last STE- in the particular product that is being manufactured, the system hangs for a long time, and then pops up a Timeout message. IF I click that, the submit fails, BUT…I tried leaving it on screen and waiting for the CPU to return to normal (takes ~2mins) and the submit then works OK. Not ideal to wait for it, but it does get the job done even if it’s slower.

Perhaps one of the selects is caught in a loop on a BOM Explosion item?
(That last one has a deeper/nested BOM which makes it an extra level deep)

1 Like

The saga continues…BUT (though I am hesitant to say this with any certainty)…I think I may have found an issue which could lead to a solution. Not by me, but someone who knows a lot more than me could probably use this…

It seemed like the BOM was triggering a circular reference on the items with a deeper BOM-tree. I can sort of confirm that this is likely to be the case after doing some experimenting.

If I go into the items+BOMs, and manually expand each level of the tree (the blue folder icons), then they open and show without any major issue.

If I click the “Expand All” button near top right of the screen where the red arrow in the above pic is, CPU goes wild, RAM gets eaten alive, and swap starts to climb (2nd pic below is one “instance”). Also nothing happens in terms of expanding.
This inspires us not so bright fellas :smirk: to click that button again, which triggers another thread, spiking CPU again (and still nothing happens)…etc…etc.


I am guessing that the internals of the PRO-/STE- might be using the same block of code as that “Expand All” to get the lists of items to process… which could lead to the problem I am experiencing?
For the record I don’t profess to know this to be true…I am just guessing and hoping it leads to something useful.

As an extra/final “test”, if you are in a BOM that has no subtrees…clicking the “Expand All” does not cause any CPU spike (ie. a 1 level tree does not suffer from this behaviour/misbehaviour).

The normally minimal
logs/nodesocketio.log
then has substantially more content. I could upload/include more of those if they are potentially useful.

  { Error: cannot GET /api/method/frappe.async.get_user_info?sid=d6a879fd58a5fb6f1b93d69bb9fd85af3de2b1573f56430a490a1594 (502)
at Response.toError (/home/frappe/frappe-bench/node_modules/superagent/lib/node/response.js:94:15)
at ResponseBase._setStatusProperties (/home/frappe/frappe-bench/node_modules/superagent/lib/response-base.js:123:16)
at new Response (/home/frappe/frappe-bench/node_modules/superagent/lib/node/response.js:41:8)
at Request._emitResponse (/home/frappe/frappe-bench/node_modules/superagent/lib/node/index.js:742:20)
at IncomingMessage.parser (/home/frappe/frappe-bench/node_modules/superagent/lib/node/index.js:906:38)
at emitNone (events.js:111:20)
at IncomingMessage.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)
at process._tickCallback (internal/process/next_tick.js:180:9)
    status: 502,

BLAH=BLAH-BLAH to save space


    method: 'GET',
    path: '/api/method/frappe.async.get_user_info?sid=d6a879fd58a5fb6f1b93d69bb9fd85af3de2b1573f56430a490a1594' },
 accepted: false,
 noContent: false,
 badRequest: false,
 unauthorized: false,
 notAcceptable: false,
 forbidden: false,
 notFound: false,
 type: 'text/html',
 links: {},
 setEncoding: [Function: bound ],
 redirects: [] } }

listening on *: 9000

1 Like

Hi trentmu,

Mind to share on how to install gevent in frappe gunicorn ?

Thanks.
Subhajit

I used

sudo pip install gevent eventlet  #I add these in for gunicorn debugging

Hi! How did you do gunicorn debugging?

It was a long time back, so I don’t remember exactly, but I think I searched in the node-socketio.log

Is sit for Vistual Studio Code?

1 Like