ERPNext High Availability Production Deployment (HA)

I would like to have a discussion about best practices for a high availability production deployment of ERPNext. This is because besides our customers who are viewing or sending orders, we also have two factories depending on having the system up and running.

We currently have ERPNext (bench + redis + mariadb, etc.) running on a bare metal Ubuntu linux server that is located in a datacenter. We do regular backups but if there was a catastrophic server failure we would need to set up a new server and use our backups of the files and the DB. The downtime would probably be about 1 hour if everything goes well. Instead of this I would like to set up a highly available webservice which ensures that in case of a server failure or internet connection problems, we can still run our company.

Our internet connection is unfortunately not 100% reliable. We have written a connector app which is reading and writing information between ERPNext and our assembly machines. If the connection to the ERPNesxt server is not possible (e.g. internet connection is down) our machines are down! We are solving this through a firewall with failover and two separate internet connections, however I have had it happen last year that both ISP were down for about 3 hours during a thunderstorm. At our second facility we have had an internet outage of both providers due to a truck crashing into the telephone lines, ripping down the DSL and fibreglass connection.

MariaDB Cluster

  1. Set up a MariaDB cluster with 3 nodes (web, factory 1, factory 2)
  2. Install Bench on 3 bare metal machines (web, factory 1, factory 2)
  3. Install a load balancer that distributes between the 3 servers depending on availability or just access the local servers by hostname like www.company.com, factory1.company.com and factory2.company.com

This should be possible with bench and the framework, does somebody have experience with a setup of two servers running bench and a MariaDB Galera Cluster? I want to prevent a Master-Slave setup. The downside would definitely be that we would need to manually bench update every server.

Another question would be how to synchronize the private and public files.

Docker

  1. Set up a node cluster online
  2. Set up 3 nodes in the availability zones (web, factory1, factory 2)
  3. Separate the services that bench starts in docker
  4. Deploy the stack

We would have to make sure the containers in the different availability zones will work when cut off from each other.

Final Thought

I don’t want to jump on a docker-hype and no, we are not google so we don’t need to scale for millions of users. I just want to think about how to have a system that will run in case of failure.

Regardless of our specific case I think a HA setup of ERPNext that is scaleable and will survive server failure without user experience problems is an interesting discussion to be had. These are just some ideas, I would appreciate some input from people who have similar experiences!

8 Likes

Unless you need greater than 4 nines of availability, going with a multi-server active/active architecture is not needed. Master/Slave with manual failover is actually a really good option for many scenarios.

  • If you’re running a MariaDB cluster, you’re going to want a minimum of 5 servers - 3 for the cluster, 2 for HAProxy (or similar) to distribute requests, with a Keepalived virtual IP floating between the HAProxy servers. In addition, you’ll want to install garbd most likely on the ERPNext Application server. Running a cluster in production over time is not for the faint of heart. Setting up a cluster is easy. Running one is complex. We actually ran a cluster for a year before moving back to a master-slave architecture, and one consideration in this decision was that we had more downtime because of cluster-related nuances rather than other issues. If you don’t have the expertise in this area, either don’t do it, or invest in a tool like SeveralNines Cluster Control (enterprise edition is NOT cheap) to help.
  • I would not install bench on the galera cluster - that will cause a lot of issues with regards to caching and sessions. You will want two separate App servers (apart from the 5 mentioned above) to handle ERPNext and Bench. Just use rsync to keep the public/private files in sync. In an ideal scenario, you would use S3 or NFS for storage, but I have not seen anyone accomplish that publicly.

For a 4 nines or less setup, a single master/slave mariadb setup, redis server, and two app servers staying in sync using rsync will honestly do. Keep it simple.

6 Likes