API deadlock in Production but not in Development! This is driving me INSANE?

tldr;

  1. custom sales_invoice.js client code invokes custom sales_invoice.py code on server.
  2. custom sales_invoice.py code calls app in local nodejs server
  3. local nodejs code tries to GET Sales_Invoice record through API … 2 minute lockup!!!

All of ERPNext remains frozen for 2 minutes.

Only happens when ERPNext launched by supervisorctl.

details

I have an app installed in ERPNext. It works in all the ways I intended except for a problem when starting with supervisorctl. There is no failure, just an in explicable temporary lock up,

The app customizes 2 files:

accounts/doctype/sales_invoice/sales_invoice.js
accounts/doctype/sales_invoice/sales_invoice.py

It has an embedded NodeJs app that connects to remote services.

When the user selects a specific menu item, the client-side tells the server side to send the sales invoice number to the NodeJs app.

The NodeJs app calls back into ERPNext API to request the full invoice record and the customer record.

After processing and remote connections, the NodeJs app PUTs updates to some fields in the Sales Invoice and then responds with a status message.

The server-side returns the message to the client-side which refreshes itself, showing the fields that were altered by the NodeJs app through the API.

This all works correctly with bench start and the NodeJs launcher added to ProcFile.

The problem occurs with supervisorctl and the NodeJs launcher added to supervisor.conf.

The server-side calls the NodeJs app with the ID of the invoice. Immediately the NodeJs app calls the API, but the entire API hangs completely for 2 minutes!!!

I use Postman to make other calls to the API at the same time, but they also get no response until finally something times out and then all the calls complete and everything works just as in development mode with bench start.

The client-side shows a popup: “Request timed out.” Refreshing the page shows that everything worked as expected.

The lockup also happens if I start the NodeJs app manually and do not include it in supervisor.conf.

I am using V13 in Ubuntu 20.04.


My questions:

Why would an outbound connection from Python lock up the entire API fo two minutes in Production mode but not in Development mode?

How can I find out the reason for it happening?

How can I see the output of Python print statements when using supervisorctl?

on production you can check web.log, web.error.log

Logged results below the dashed line are relevant to the test.
Lines above it were logged after purging all logs and then running supervisorctl restart all.

[2020-08-24 18:40:14 +0200] [1306] [INFO] Starting gunicorn 19.10.0
[2020-08-24 18:40:14 +0200] [1306] [INFO] Listening at: http://127.0.0.1:8000 (1306)
[2020-08-24 18:40:14 +0200] [1306] [INFO] Using worker: sync
/home/erpdev/frappe-bench-EELS/env/lib/python3.8/os.py:1023: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  return io.open(fd, *args, **kwargs)
[2020-08-24 18:40:14 +0200] [1348] [INFO] Booting worker with pid: 1348
< mark before test >
--------------------
/home/erpdev/frappe-bench-EELS/env/lib/python3.8/os.py:1023: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  return io.open(fd, *args, **kwargs)
[2020-08-24 18:44:42 +0200] [1306] [CRITICAL] WORKER TIMEOUT (pid:1348)
[2020-08-24 18:44:42 +0200] [1348] [INFO] Worker exiting (pid: 1348)
[2020-08-24 18:44:43 +0200] [1389] [INFO] Booting worker with pid: 1389

Here you can see that NodeJs closed the connection after waiting for ERPNext for 2 minutes…

Traceback (most recent call last):
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 719, in urlopen
    retries = retries.increment(
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/erpdev/frappe-bench-EELS/apps/frappe/frappe/app.py", line 64, in application
    response = frappe.api.handle()
  File "/home/erpdev/frappe-bench-EELS/apps/frappe/frappe/api.py", line 58, in handle
    return frappe.handler.handle()
  File "/home/erpdev/frappe-bench-EELS/apps/frappe/frappe/handler.py", line 30, in handle
    data = execute_cmd(cmd)
  File "/home/erpdev/frappe-bench-EELS/apps/frappe/frappe/handler.py", line 69, in execute_cmd
    return frappe.call(method, **frappe.form_dict)
  File "/home/erpdev/frappe-bench-EELS/apps/frappe/frappe/__init__.py", line 1086, in call
    return fn(*args, **newargs)
  File "/home/erpdev/frappe-bench-EELS/apps/erpnext/erpnext/accounts/doctype/sales_invoice/sales_invoice.py", line 1706, in electronic_invoice
    response = requests.post(url, data = payload, headers={"Content-Type": "application/json"})
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/erpdev/frappe-bench-EELS/env/lib/python3.8/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Log directory before test:

total 48
drwxr-xr-x 2 erpdev erpdev 4096 Aug 24 18:40 ./
drwxr-xr-x 7 erpdev erpdev 4096 Aug 24 18:20 ../
-rw-r--r-- 1 erpdev erpdev  273 Aug 24 18:40 bench.log
-rw-r--r-- 1 erpdev erpdev  192 Aug 24 18:40 frappe.log
-rw-r--r-- 1 erpdev erpdev  370 Aug 24 18:40 frappe.web.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 node-socketio.error.log
-rw-r--r-- 1 erpdev erpdev   76 Aug 24 18:40 node-socketio.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 redis-cache.error.log
-rw-r--r-- 1 erpdev erpdev 1241 Aug 24 18:40 redis-cache.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 redis-queue.error.log
-rw-r--r-- 1 erpdev erpdev 1241 Aug 24 18:40 redis-queue.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 redis-socketio.error.log
-rw-r--r-- 1 erpdev erpdev 1241 Aug 24 18:40 redis-socketio.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 schedule.error.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 schedule.log
-rw-r--r-- 1 erpdev erpdev 2693 Aug 24 18:40 scheduler.log
-rw-r--r-- 1 erpdev erpdev  499 Aug 24 18:40 web.error.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 web.log
-rw-r--r-- 1 erpdev erpdev    0 Aug 24 18:40 worker.error.log
-rw-r--r-- 1 erpdev erpdev  543 Aug 24 18:40 worker.log

Log directory after test:

total 84
drwxr-xr-x 2 erpdev erpdev  4096 Aug 24 18:40 ./
drwxr-xr-x 7 erpdev erpdev  4096 Aug 24 18:20 ../
-rw-r--r-- 1 erpdev erpdev   273 Aug 24 18:40 bench.log
-rw-r--r-- 1 erpdev erpdev  1535 Aug 24 18:42 frappe.log
-rw-r--r-- 1 erpdev erpdev 10654 Aug 24 18:44 frappe.web.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 node-socketio.error.log
-rw-r--r-- 1 erpdev erpdev    76 Aug 24 18:40 node-socketio.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 redis-cache.error.log
-rw-r--r-- 1 erpdev erpdev  1241 Aug 24 18:40 redis-cache.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 redis-queue.error.log
-rw-r--r-- 1 erpdev erpdev  1241 Aug 24 18:40 redis-queue.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 redis-socketio.error.log
-rw-r--r-- 1 erpdev erpdev  1241 Aug 24 18:40 redis-socketio.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 schedule.error.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 schedule.log
-rw-r--r-- 1 erpdev erpdev  9666 Aug 24 18:49 scheduler.log
-rw-r--r-- 1 erpdev erpdev   967 Aug 24 18:44 web.error.log
-rw-r--r-- 1 erpdev erpdev  4616 Aug 24 18:44 web.log
-rw-r--r-- 1 erpdev erpdev     0 Aug 24 18:40 worker.error.log
-rw-r--r-- 1 erpdev erpdev 12639 Aug 24 18:48 worker.log

So the modified files are:

  1. frappe.log
  2. frappe.web.log
  3. scheduler.log
  4. web.error.log
  5. web.log
  6. worker.log

3 & 6 :: scheduler.log and worker.log show what they always show, unchanged.
2 :: frappe.web.log shows page loads
1 :: frappe.log shows my INFO log line just prior to the call to NodeJS

2020-08-24 18:42:42,833 INFO frappe Ready to call NodeJs with payload :: { "invoiceId": "001-002-000002780", "access_key": "2707202001179217775800110010020000027801234567813" }; URL http://erp.erpnext.host:5500/mail 
Site: erp.erpnext.host
Form Dict: {'invoice_id': '001-002-000002780', 'command': 'mail', 'cmd': 'erpnext.accounts.doctype.sales_invoice.sales_invoice.custom_stuff'}

My logging from NodeJS shows execution held on call back to ERPNext:

    const options = { headers: { Authorization: `token ${process.env.ERPNEXT_TOKN}` } };
    let rslt = {};
    const urlInvoice = `${env.REST_SVC}/${env.INVOICE}/${invoiceId}`;
    try {
      console.log(`GETTING INVOICE :: '${urlInvoice}'.  tkn: ${process.env.ERPNEXT_TOKN}`);
      rslt = await axios.get(urlInvoice, options);
      console.log('GOT INVOICE');
    } catch (error) {
      msg = `ERROR when getting invoice #${invoiceId}:\n ${JSON.stringify(error, null, 2)}`;
      console.error(msg);
    }

When running from bench start there is no perceptible delay between

GETTING INVOICE :: 'http://erp.erpnext.host:8000/api/resource/Sales Invoice/001-002-000002780'.   tkn: c26.........bfe:bd8........a074
POST /mail - - ms - -
GOT INVOICE

Through all of that I found only two bits of useful information:

  1. The actual timeout happens with [CRITICAL] WORKER TIMEOUT (pid:1348)
  2. The last action before freeze up is NodeJs calling into ERPNext API

I sent the preceding in a rush because I was in the middle of something else.

I’d like to say that I do very much appreciate your willingness to have a look at it.

I am as much interested in what steps you would take to track down the cause as in learning the exact solution.

Eg, is there a DEBUG level setting I could switch on for the API?

Have you tried increasing the http_timeout?

bench config http_timeout 6000

The deadlock continues for a full 2 minutes, during which time no part of the REST API can be used.

What’s the content of rslt?

In my opinion, since it works with bench start, the API part is ok.

Problems like this is usually related to timeout parameter

On the ERPNext side, it is NGINX. bench config http_timeout 6000 takes care of this

I note that you have your NodeJS system. You also need to check the nodejs part of your code. For example, assuming you are using express:

var server = app.listen(app.get('port'), function() {
  debug('Express server listening on port ' + server.address().port);
});
server.timeout = 6000;

supervisor needs nginx?

https instead of http while production mode?

whitelisted?

Or may be system trying to send mail and expired the timeout?

Js called python 1, python called nodejs 2, nodejs called python 3. supervisor queued 3 after 1. 3 waiting for 1 and 1 waiting for 3 or 1 never finish or 3 waiting forever? raise supervisor or worker or thread or process?

In production mode, nginx acts as the proxy server.
The bench command reconfigures nginx.conf

whitelisted has nothing to do with this.
You use white list to use python function in javascript js front-ends.

Under the wraps, frappe does so much heavy lifting to make code on the server (python) and front end (javascript, jinja, html, css) work like an orchestra.

@Joseph_Marie_Alba1, @adamtang79

I appreciate your thinking about this, I really do!

However, I’m trying to think through how timing could be the root cause here. So far I don’t see it.

Currently I have a work around …

cd ~/frappe-bench
nohup bench start > bench.out  2> bench.err &

… and …

cd ~/frappe-bench/apps/my_app/my_service
nohup npm run inspect > mysrv.out 2> mysrv.err &

Where my package.json script is:

  "scripts": {
    "inspect": "DEBUG='einv-svc:*' nodemon --exec babel-node --inspect=0.0.0.0:9229 ./bin/www.js"
  },

Launched in that way, each entire sequence of operations initiated by the user takes no more than 2 seconds … consistently. On the other hand, the lockout runs for a full 120 seconds.

It is almost as if developer mode is multi-threaded and production mode is single-threaded. This, such that a call out from server-side Python which causes a remote call back into the API would lock up because the server-side Python would stall all and any other execution until it returns or times out.

Can supervisor serve rest without nginx?

If nginx is there may be calling rest with different port than 8000?

Any place to raise number of process max?

Well actually I have set it up so NGinx is not involved between ERPNext an NodeJs

While it is all working on the open Internet with HTTPS and a LetsEncrypt certificate, no port numbers are seen by end users. NGinx handles passing control to the correct port.

Meanwhile calls between ERPNext and NodeJs use port numbers explicitly.

My /etc/hosts has the host URL recorded as 127.0.0.1

So the internal calls pass effectively from ERPNext to 127.0.0.1:5500 and then from NodeJs to 127.0.0.1:8000.

Do you think supervisor intervenes in the process? I had not considered that possibility.

I had understood that supervisor is only concerned with stopping, starting and keeping process running. Does it queue calls between the processes it starts?

Ok. It is clear that the issue is not related to API deadlock in production since the NodeJS is trying to access port 8000. Production mode is on port 80 for http and port 443 for https.

Are you port based or domain based?

I have not yet confirmed or tested on below statements.

call through nginx when in production mode or supervisor is there.

There is always a delay if target backend can’t return to axios call.

But first try to raise those 3 workers numprocs to 3 or 8. I’m curious.