[Performance] 1 Sales Invoice Save makes ~700+ DB calls. ~1000+ DB calls on Submit

tmatteson · August 8, 2018, 6:14pm

This is phenomenal, great cooperation and outcome. Looking forward to the sprint before the conference.

Sachin_Mane · August 8, 2018, 7:04pm

Thank you @rmehta for opportunity to work with you guys. Lot of new things learned. We will write about it in detail. Look forward for many such collaborative sprints.

jai_kejriwal · August 9, 2018, 2:24am

@rmehta @Sachin_Mane Awesome work. Can someone write a detailed blog/article describing how this improvement was achieved?

Sachin_Mane · August 9, 2018, 2:13pm

Yes. Will summarise changes

jpp · August 9, 2018, 10:02pm

Has this improved just the sales invoice or all doctypes?

Sachin_Mane · August 10, 2018, 8:46am

~4 times lighter and faster ErpNext and more than 4 times happier Community

This article summarise the code changes done as part of the optimisation sprint we had in Frappe office 2 days back…

To start with, following is short ‘Before’ and ‘After’ summary -

Transaction - Submitting Sales Invoice containing 51 Items.

Bench Console

In [3]: si=frappe.get_last_doc('Sales Invoice')

In [4]: len(si.items)
Out[4]: 51
In [2]: %prun -s cumulative si.submit()

Results

Before optimisations
Time Taken : 4.56s
Number of SQL calls : 3625

After optimisations
Time Taken : 1.511s [First Time] and 0.981s [Subsequent Calls]
Number of SQL calls : 1070

It was evident that in a single transaction, same document was getting read from database again and again. get_doc and get_value method was getting called multiple times for the same document and same field.

General approach was to cache repetitively accessed documents in memory to reduce number of database calls. Following is the list of key changes done.

get_cached_doc instead of get_doc

First optimisation done was to cache documents in the local in-memory request level cache. A new request level cache is added in `frappe/__init__py`

def init(site, sites_path=None, new_site=False):
	"""Initialize frappe for the current site. Reset thread locals `frappe.local`"""
       ....
       ....
       ....
       local.document_cache = {}

NOTE -
frappe.local is a request level cache and gets cleared before each request.
frappe.flags is also request level cache

A new method frappe.get_cached_doc is introduced as follows -

def get_cached_doc(*args, **kwargs):
	if args and len(args) > 1 and isinstance(args[1], text_type):
		key = get_document_cache_key(args[0], args[1])
		# local cache
		doc = local.document_cache.get(key)
		if doc:
			return doc

		# redis cache
		doc = cache().hget('document_cache', key)
		if doc:
			doc = get_doc(doc)
			local.document_cache[key] = doc
			return doc

	# database
	doc = get_doc(*args, **kwargs)

	return doc

The above method implements progressive checks in local request cache then in global redis cache and then falls back to database to fetch a document.

the get_doc method is also modified to put the documents retrieved from database into cache as follows -

def get_doc(*args, **kwargs):
	"""Return a `frappe.model.document.Document` object of the given type and name.
         ...
         ...
         ...
	# set in cache
	if args and len(args) > 1:
		key = get_document_cache_key(args[0], args[1])
		local.document_cache[key] = doc                    <-- set in request level cache
		cache().hset('document_cache', key, doc.as_dict()) <-- set in global redis cache

and then there were selective find and replace based on the %prun output analysis. e.g. One sample %prun record showing lot of time spent in get_item_details method.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
50    0.002    0.000    **1.809**    0.036 get_item_details.py:16(get_item_details)

All such high time taking methods were identified and get_doc calls were replaced with get_cached_doc in the same.

Following is %prun output after this change

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
50    0.002    0.000    **0.529**    0.011 get_item_details.py:19(get_item_details)

get_cached_value instead of get_value

Similar to ```get_doc```, ```get_value``` was another method which was getting called many time for same doctype and field. This was another candidate to be cache enabled.

Following method is introduced which internally calls get_cahed_doc and avoids a database call altogether.

def get_cached_value(doctype, name, fieldname, as_dict=False):
	doc = get_cached_doc(doctype, name)        <-- Gets cached doc. 1 db call saved per call
	if isinstance(fieldname, text_type):
		if as_dict:
			throw('Cannot make dict for single fieldname')
		return doc.get(fieldname)

	values = [doc.get(f) for f in fieldname]
	if as_dict:
		return _dict(zip(fieldname, values))
	return values

Then another selective find and replaced was done after analysing %prun output. Following number of occurrences of get_value were replaced

➜  apps grep -R 'get_cached_value' . | grep -v pyc | wc -l
     186

Fixed Meta caching

Rushabh found out that doctype meta-data was not actually getting saved in redis cache. It was failing to serialise doctype metadata and silently eating up the exception.

A custom serialisation method was introduced to serialize meta data as follows -

	def as_dict(self, no_nulls = False):
		def serialize(doc):
			out = {}
			for key in doc.__dict__:
				value = doc.__dict__.get(key)

				if isinstance(value, (list, tuple)):
					if len(value) > 0 and hasattr(value[0], '__dict__'):
						value = [serialize(d) for d in value]
					else:
						# non standard list object, skip
						continue

				if (isinstance(value, (frappe.text_type, int, float, datetime, list, tuple))
					or (not no_nulls and value is None)):
					out[key] = value

			return out

		return serialize(self)

After the custom serialization, redis started caching it. This was a big improvement reducing number of database calls significantly as it was getting used very frequently.

Although, these changes would improve performance of most of the transactions because of common codebase, we should %prun analyse every transactions to identify most time taking methods and selectively replace get_doc() and get_value() with get_cached_doc() and get_cached_value().

jai_kejriwal · August 10, 2018, 11:26am

Great work

felix · August 10, 2018, 3:58pm

I wonder how this will impact recommendations on how much memory to allocate to redis cache. Right now, even for many concurrent users (100+), a small amount of memory for redis cache is enough. It looks like that will change with this.

akurungadam · August 11, 2018, 5:47am

awesome! way to go, guys!

jai_kejriwal · August 13, 2018, 4:26am

Have you joined the developer group on Telegram. Here is the link:

Himanshu_Mishra · August 16, 2018, 10:27am

I have often encountered the fact that i needed to call a get_doc only to call its instance methods. Often, these instance methods didn’t interact with any of the child tables, and these docs were loaded in hooks, hence, weren’t part of response.

Also, child tables can often be very large. Lazy loading of child tables can surely be a helpful addition to get_doc. Also, lazy loading should be a logical decision, instead of a configuration, which can be implemented by an additional argument called “lazy_load_children” which can be passed to get doc like

frappe.get_doc(doctype, docname, lazy_load_children=True)

bhavikpatel7023 · August 16, 2018, 6:00pm

Based on SP, I think we can optimize ERPNext in better way. Time may less then 1 sec.

Nice analysis @Sachin_Mane
Good work @Sachin_Mane @rmehta

bkm · August 16, 2018, 6:52pm

I worry also about the potential for increased RAM memory requirements of the host platform. Memory is the single largest contributing factor to server costs. So a system with 4 or 5 gunicorn users and 12 live users logged in and processing sales, there could be an increased memory requirement of close to 500mb? Does that sound right?

Memory usually comes in 1gb increments and costs an additional $6 to $10 per month on the server fees invoice.

Have I read this correctly? Is that a reasonable expectation of additional memory overhead?

BKM

felix · August 16, 2018, 7:30pm

If you don’t allocate more memory to redis cache, then that will just mean more cache misses and lead to performance similar to current implementations.

You don’t have to increase redis cache if you don’t want to - everything should still work the same as before (i believe - please correct me if I’m wrong).

The analogy is with the innodb_buffer_pool MariaDB variable you’ve most likely modified in other implementations. You can have the value small, but more memory means better performance. It’ll be the same for this. Finding the recommended values will be the key - and everyone’s value proposition will be different. But until this is tested and someone reports the results, we won’t really know recommendations.

Sachin_Mane · August 17, 2018, 2:47am

Lazy loading is good idea. However it may complicate the caching approach. With document cache enabled, lazy loading wont be needed.

Sachin_Mane · August 17, 2018, 2:51am

Memory is cheap nowadays. You can choose your server configuration based on your current memory utilisation.

It is a tradeoff and as @felix mentioned, it is upto you to decide how much memory should be allocated.

Himanshu_Mishra · August 17, 2018, 3:54am

get_cached_doc and get_cached_value won’t be implementing lazy loading param.

We use get cached doc for master data which is read multiple times, while lazy loading will be for documents which are read only once, and when reading its child tables is not necessary. That way, we can still use instance methods(Without having to manually create instance objects of document classes) without loading child table values (which can be in hundreds in some cases)

Sachin_Mane · August 17, 2018, 7:39am

Makes sense to me.

tmatteson · August 20, 2018, 4:19pm

A design question: currently when get_cached_doc and get_cached_value are called with a bad key they return a DoesNotExist Error rather than None. Is there a reason it works this way? I would think you’d want the two methods to have symmetrical APIs.

Himanshu_Mishra · August 25, 2018, 8:15am

Good point. get_cached_value should return None, while get_cached_doc only should return DoesNotExist Error.