Intermittent 502 errors with AWS ALB

We are running our ERPNext v13 installation on AWS. Our basic architecture looks like this

DNS → ALB → EC2 instance → RDS instance

We are running redis on the EC2 instance itself. To ensure that a user visits the same EC2 instance (since his session details are stored on the redis of that instance) we are using sticky sessions on the ALB.

There are however some random issues where the application throws 502 and 504 errors.

The cause of these errors is unknown because they occur irrespective of load and particular API call.

When checked in logs, the requests that returned 502 never made it to nginx (they are not in the access logs)

On collecting and analysing tcpdump data to check if the issue was from our application or ALB it is evident that requests come to our EC2 instance, but no response is ever sent back to the load balancer (because the requests never made to nginx).

This happens for some time and during this time the requests result in 502 errors

What is causing the application to not respond to the incoming requests?

Error 502 means Bad Gateway. This means that your server can’t be found. This might be a problem with your AWS server connection. This can happen to your AWS connection, as it also happens even with sites like gmail, facebook, etc. In other words, this is a network problem. Your EC2 and software it contains are okey.

ALB is saying 502 because it is not getting a reply from the EC2 - so yes it is a network issue for ALB.

What I want to know is why is EC2 not responding to those TCP requests?

That is because an AWS EC2 is the most basic, and I might say, primitive cloud deployment option. However, ERPNext is also very well suited for AWS EC2.

I am certified in cloud computing (professional level) who specializes in corporate software using React, GraphQL, TypeScript, Python, Go for frontends and api services, MariaDB, Postgresql, and MongoDB for databases deployed as microservices on docker containers and Kubernetes clusters on cloud platforms like AWS, AlibabaCloud, Google Cloud, and Contabo.

For mission-critical, always accessible corporate software, I deploy the front-end as microservices on Kubernetes clusters in multiple regions, and the backend are database services provided by the cloud platform itself. Within a Kubernetes clusters, there are several nodes (VPS like AWS EC2) running, but Kubernetes constantly checks of each node is accessible and healthy. It it detects an unhealthy node, it is decommissioned and replaced with a new node (by Kubernetes itself). Multiple regions means, one cluster might be in US another in Singapore. This mitigates the threat that if one data center is inaccessible, there are other regions that will serve the clients. Also, clients who are in Asia will be served by the data center in Asia, while clients in the US will be served in the US. This solution is quite expensive for small or even medium sized companies.

I described this so you can understand that having occasional 502’s on low-end AWS ED2 is not too bad.

1 Like

I just noticed, is there a reason why you are using an ALB? I think it is impractical to use an ALB with ERPNExt.

Yes - ALB is there so that we can have at least 2 EC2 instances running and serving the requests.

Do you think ALB is a bad idea?

I thought that you only have one EC2 instance, this is why I wondered why you needed an ALB. But actually, I think the ALB adds an unnecessary layer of complexity to your ERPNext. It might be good to just have separate urls for your two ECS instances, so that if you cannot access one url, you can go directly and specifically to the other url.

if you’re using ALB try to configure the healthchecks.

write a simple stateless service that responds with meaningful status or error.

when these healthchecks fail, alb will get meaningful response that you can debug further.

simple health check can be to check connection ping to db and 3 redis hosts.

Yes but our main application can have only 1 URL, right?

So we were considering ALB for HA purposes just in case of an AZ failure or in case we need to add more and more servers as the request volume increases on our application

For AWS ALB - we have 2 load balancer nodes behind the scenes which make the ALB multi-az

In our case one node is able to serve traffic and get a response to its health check, but the other one dies not get a response back

Therefore although the overall health checks of the ALB does not fail, we are still unable to serve half of our traffic since the EC2 won’t reply to one ALB node

No. Your url will be two different urls.
I am making this suggestion because you have to have problems with your ALB.
You need to determine what causes your problem. Is it ALB? or is it EC2?
Are you using DNS based multitenancy? or is it port-based? I am asking this because it affects the nginx setup.
Many elements of ERPNext are tightly coupled because the system was conceptualized as an all-in-one single installation set-up.

We don’t have multi-tenancy. It’s just that once a request comes in it is proxied to port 9000 for socketio and 8000 for a normal call by the nginx

It’s a single domain name which points to the same ALB for all requests

you can try this at the dns level: point your official url to ALB
then you can have another set of urls, each pointing to an EC2.
If you have 502 on the ALB, try the EC2 urls.

I must warn you that you might think that deploying several frontends EC2s and separating the database instance might solve the accessibiity issues for ERPNext.

However, unlike the modern stateless frontends where the HTML page is composed entirely at the front-end node, Frappe’s webpage definitions are also stored at the same database which stores the data.

So, there is still the same bottleneck on the database side because the MariaDB or PostgreSQL database server has to do work to compose the webpage. This was a nice design in then ten years ago when accessibility was not an issue, and there was no alternative solution, and the upgrade solution was vertical (have a more powerful server, increase server memory) instead of today’s horizontal (add more cheap servers for distributed computing).

However today where it is simple and common place to have separate virtual or physical servers, the frontend is programmed to be as stateless as possible, and the html, css pages have been compressed and minimized, or even made static, instead of dynamic.

What I am saying is, ALB is a solution for distributed computing of today, but I do not think it works for monolithic, data-centric setups of ERPNext.

This seems insightful. Let me try this and I’ll get back to you in case these errors come back again