[Performance] 1 Sales Invoice Save makes ~700+ DB calls. ~1000+ DB calls on Submit

Sachin_Mane · August 1, 2018, 10:59am

Dear all,
We added a print statement in database.py as follows -

def sql(self, query, values=(), as_dict = 0, as_list = 0, formatted = 0,
		debug=0, ignore_ddl=0, as_utf8=0, auto_commit=0, update=None):
		"""Execute a SQL query and fetch all rows. .... """
		print ('QUERY = ', query, values)

and measured number of queries getting fired on Sales Invoice (select sales order, click on make sales invoice) save and submit operations.

First this first -
Is it a right way to measure number of DB Calls?

Another declaration -
I am yet to analyse this data fully. Will keep on sharing finding. Staring this thread just to get some early inputs if anyone has. We may be completly wrong.

Following are the findings -

Sales Order (and hence invoice) had 9 Items in it

Save = Around 778 database calls
Submit = Around 1050 database calls.

tabSingles 	60
tabItem 	56
tabCurrency 	39
tabBin 	27
tabAccount 	26
tabAccount 	26
tabSales Invoice 	22
tabSales Order 	21
tabUOM Conversion Detail 	18
tabItem Supplier 	18 (custom)
tabItem Tax 	18
tabItem Variant 	18
tabItem Company Settings 	18 (custom)
tabItem Customer Details 	18
tabItem Quality inspection parameters 	18
tabItem Reorder 	18
tabItem Website Specification 	18
tabWebsite Item Group 	18
tabAttachment Link 	18
tabItem Group 	17
tabSales Invoice Item 	12
tabCompany 	12
tabDocType 	11
tabCustomer 	11
tabProject 	11
tabPrice List 	11
tabProject 	11
tabWarehouse 	10
tabCost Center 	10
tabSales Order Item 	9
tabSales Taxes and charges template 	9
tabItem Price 	9
tabItem 	9
tabProductBundle 	9
tabPurchase Order 	9
tabPurchase Receipt Item 	9
tabDocType 	8
tabUOM 	7
tabSales Team 	7
tabAddress 	6
tabParty Account 	5
tabJournal Entry 	5
tabPayment Schedule 	5
tabCustomer Group 	4
tabSales Invoice Payment 	4
tabSales Invoice Timesheet 	4
tabPacked Item 	4
tabCommunicaiton 	4
tabVersion 	3
tabContact 	2
tabSales Person 	1

We need to understand why so many DB calls are being made. There could be some of our custom code (We dont have much). However we need to relook at the code design in general. We are still digesting this data. In case anyone has looked into this aspect, can you please comment? We will keep on sharing our findings here.

Our first observation is about the tables linked to Item. There are many child tables in the Item doctype on which there are repeat select queries happening. A quick traceback.print_stack() showed select queries are originating from

File "/workspace/Dev/frappe-bench/apps/erpnext/erpnext/controllers/accounts_controller.py", line 200, in set_missing_item_details

Even queries to tables like Purchase order and Purchase Order Items``, Purchase Receipt Item``` are originating from set_missing_item_details.

lasalesi · August 1, 2018, 11:12am

Hi @Sachin_Mane,

this seems unusual, do you use a special configuration? Can you comment on your setup and version? Independent of the number of queries being executed, do you see performance issues?

adnan · August 1, 2018, 11:12am

Hi Sachin,

you are scarring us…

it’s a huge number of DB Calls. we do understand there some repetitive DB Calls to make sure Data accuracy and consistency.

since beginning, i have a question, why dont we use Stored Procedures and functions to improve performance?

is it hard to manage them or what…

any how, nice observation and i guess we can optimize the code to improve performance.

Sachin_Mane · August 1, 2018, 11:27am

Another reading with and without set_missing_item_details

With set_missing_item_details - No of calls - 607
Without set_missing_item_details - No of calls - 520

Sachin_Mane · August 1, 2018, 11:37am

We are running ERPNext: v10.0.8

rmehta · August 1, 2018, 12:07pm

@Sachin_Mane Great analysis. I guess a lot of this could be avoided with request-level caching.

Sachin_Mane · August 1, 2018, 1:29pm

@rmehta is request level caching something we need to implement or is it already available and we can configure?

I was thinking if we can mix-in cache functionality declaratively. e.g. mark a ‘doctype’ to be connected to redis/memory cache. All the read api method calls to this doctype would be then wrapped to first try to return from cache and if not available follow the due course of gettting from dB. Inserts and updates would invalidate the cache. This will help in cross request calls as well.

Doctypes having master data like currency conversion, uom conversion, uom can be read from cache.

This will address the multiple calls problem. However we need to see if there is any opportunity to tune the functionality itself. Can we do something concrete on it?.. how about a code sprint for this? Changes like this would need in-depth knowledge of existing codebase.

Sachin_Mane · August 1, 2018, 1:40pm

@adnan there is nothing to be scared. Frappe framework is a meta data driven framework (doctype) and there are nice common defined ways of doing crud operations. Large number of queries is I believe a small side effect of meta data driven design. It can be managed. I believe when we are dealing with meta data driven framework we need to forget about the database and concentrate on our meta data(domain doctypes) Other things should fall in place. Hence I don’t think stored procedure approach should be even considered.

rmehta · August 1, 2018, 2:55pm

Sprint is a good idea… maybe you guys can drive to mumbai this time

Sachin_Mane · August 1, 2018, 5:05pm

Sounds good. Early next week? We will also prepare ourself in the meantime.

Sachin_Mane · August 2, 2018, 12:24pm

tabSingles
One of the frequent queries to tabSingles is coming from print_settings_for_item_table function.

This function is called in the almost all child table doctypes. Quick grep shows following -

➜  erpnext git:(10.x.x-withrun) ✗ grep -R print_settings_for_item_table . | grep -v pyc | grep -v import
./erpnext/accounts/doctype/purchase_invoice_item/purchase_invoice_item.py:		print_settings_for_item_table(self)
./erpnext/accounts/doctype/sales_invoice_item/sales_invoice_item.py:		print_settings_for_item_table(self)
./erpnext/buying/doctype/purchase_order_item/purchase_order_item.py:		print_settings_for_item_table(self)
./erpnext/buying/doctype/supplier_quotation_item/supplier_quotation_item.py:		print_settings_for_item_table(self)
./erpnext/controllers/print_settings.py:def print_settings_for_item_table(doc):
./erpnext/selling/doctype/quotation_item/quotation_item.py:		print_settings_for_item_table(self)
./erpnext/selling/doctype/sales_order_item/sales_order_item.py:		print_settings_for_item_table(self)
./erpnext/stock/doctype/delivery_note_item/delivery_note_item.py:		print_settings_for_item_table(self)
./erpnext/stock/doctype/purchase_receipt_item/purchase_receipt_item.py:		print_settings_for_item_table(self)

The function is getting called in the constructor of the individual class e.g.

class SalesInvoiceItem(Document):
	def __setup__(self):
		print_settings_for_item_table(self)

This function makes following db call -

	doc.flags.compact_item_print = cint(frappe.db.get_value("Print Settings", None, "compact_item_print"))

Print Settings being singleton doctype is retrieved from tabSingles.

Can we simply comment this out? This will eliminate per item query to tabSingles table in above all doctypes?

Ideally this should be called only once per document. Or even can be deferred till we actually print the document?

max_morais_dmm · August 2, 2018, 1:44pm

@Sachin_Mane, I dont think comment this is an nice idea, but what about, memoize the database methods, with some memoization lib, like functools.lru for python3.4 and above, and for more general options, there is things like this Redis-backed LRU cache decorator in Python. · GitHub, where you can send the results of the call to redis, and query redis instead the database.

It don’t will solve all the issues regarding the query of results into the database, but should create an better performance, since every time that you re-send the same args for the database, it will return the result from the cache.

Also, LRU is better, because it will automatic cleanup the last result after more calls into the database, so no management needed to ensure you have most updated value.

Sachin_Mane · August 3, 2018, 1:10pm

Possible Solution #1 - Lazy load child tables

frappe.get_doc results in select queries onto all child tables. e.g.

In [2]: frappe.get_doc ('Item', 'toor')
(u'QUERY =', u'select * from `tabItem` where `name` = %(name)s order by modified desc', {u'name': u'toor'})
(u'QUERY =', u'select * from `tabItem Reorder` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'reorder_levels'})
(u'QUERY =', u'select * from `tabUOM Conversion Detail` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'uoms'})
(u'QUERY =', u'select * from `tabItem Variant Attribute` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'attributes'})
(u'QUERY =', u'select * from `tabSoft Attributes` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'soft_attributes'})
(u'QUERY =', u'select * from `tabItem Supplier` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'supplier_items'})
(u'QUERY =', u'select * from `tabItem Customer Detail` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'customer_items'})
(u'QUERY =', u'select * from `tabDelivery Option` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'delivery_option_mapping'})
(u'QUERY =', u'select * from `tabItem Tax` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'taxes'})
(u'QUERY =', u'select * from `tabItem Quality Inspection Parameter` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'quality_parameters'})
(u'QUERY =', u'select * from `tabWebsite Item Group` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'website_item_groups'})
(u'QUERY =', u'select * from `tabItem Website Specification` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'website_specifications'})
(u'QUERY =', u'select * from `tabAttachment Link` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'attachment_links'})
(u'QUERY =', u'select * from `tabItem Company Settings` where `parenttype` = %(parenttype)s and `parent` = %(parent)s and `parentfield` = %(parentfield)s order by idx asc', {u'parenttype': u'Item', u'parent': u'toor', u'parentfield': u'company_settings'})
Out[2]: <erpnext.stock.doctype.item.item.Item at 0x10b327a50>

Can we lazy load child tables? Can we use property setters and getters to lazily load child table data when it is getting actually accessed.

when 1 Sales Invoice is saved, following is the list of get_doc calls -

18:52:23 web.1            | get_doc called -  Sales Invoice
18:52:24 web.1            | get_doc called -  User
18:52:24 web.1            | get_doc called -  Sales Invoice
18:52:24 web.1            | get_doc called -  Customer
18:52:24 web.1            | get_doc called -  Contact
18:52:24 web.1            | get_doc called -  Item
18:52:24 web.1            | get_doc called -  Item
18:52:24 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:25 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:26 web.1            | get_doc called -  Item
18:52:27 web.1            | get_doc called -  Activity Log
18:52:27 web.1            | get_doc called -  Activity Log
18:52:27 web.1            | get_doc called -  Sales Invoice
18:52:27 web.1            | get_doc called -  Version
18:52:27 web.1            | get_doc called -  Version
18:52:27 web.1            | get_doc called -  Sales Invoice

max_morais_dmm · August 3, 2018, 1:19pm

@Sachin_Mane sugared_document.py · GitHub
This project is an LazyLoader for Links

Need one improvements for child_tables, but is doable.
But I dont think lazy_loading will solve the issue, because, for the browser, you need to load everything, so issue persist

Sachin_Mane · August 3, 2018, 2:08pm

Okay. @max_morais_dmm. I was just listing possible options. Need to do more deep diving. May be we can just cache the document completely along with child table data.

Or modify UI framework such that it loads basic document initially and as and when child table is expanded it retrieves its data.

We can even think of a mixed approach wherein some child tables are loaded along with main document and some are lazy loaded based on some child table config.

E.g sales order item is an important table in sales order. It can be loaded along with the order but other tables like sale team etc can be lazy loaded.

max_morais_dmm · August 3, 2018, 2:38pm

@Sachin_Mane, I agree, but if you look for the calls, items tables are the issues

szufisher · August 3, 2018, 3:32pm

In SAP’s material master (Item in ERPNext) table MARC, there is one maintenance status field, this field keeps track of which views （purchasing, sales, accounting etc) have been maintained for the material,

Considering the fact there are a lot of items with associated empty child tables( optional table type field not maintained), to avoid the useless SQL call to these empty child table, refer to above, the proposed draft idea is as following

add one childtable status field to Item table, assign one letter to each child table( if hard code is not desired, this one letter can be assigned in the child table’s doctype definition as meta )
in get_doc, check the above childtable status field content to determine whether to trigger the actual SQL call to the corresponding child table
in doc save method, update the childtable status based on whether relevant child table field has value or not

For above mentioned solution, with a little storage space increase and overhead of auto populating this field when update/save each doc, the overall doc read via get_doc performance will be drastically improved, if the redis cache solution can be implemented, it will be great too.

tmatteson · August 3, 2018, 3:55pm

@Sachin_Mane Really nice sleuthing.
@max_morais_dmm Is there a solution that is both UI and API friendly? Memoization via Redis caching sounds like it might be a better solution than lazy loading when both cases are considered.

Sachin_Mane · August 3, 2018, 5:32pm

We are doing following POC. Will share results.

Wrap get_doc() method to serve document from cache if available. If not, populate cache using the output.

Use Redis Queue to broadcast update events so that individual worker processes can invalidate/reload documents

Worker processes would be listening on the broadcast queue for updates

We are trying this with Item doctype first. We can make this declarative in doctype by introducing a setting. Something like below -

Doctype: Item

cache_in_memory= 1 / 0
or
cache_mode=none / in-memory / redis

Local in-memory caching would be most efficient as compared to Redis caching. Redis can be used only for messaging purposes.

The size of 1 Item object is approximately 12kb in our case. Derived as below -

So, even if we have 5000 Items in our database, we will end up consuming around 12 kb * 5000 = 60Mb to cache per worker. If we run 4 workers it would be 240Mb additional memory required.

Updates to Item master are rare in our case. Any updates would be broadcasted on redis queue and workers can reload that Item in local cache.

Sachin_Mane · August 3, 2018, 5:52pm

Good idea. This will help to reduce problem by not trying to fetch empty child tables. However, in our case the problem is child table have data but it is not required at many places. So there is no point in fetching the child table data.

Anyway, If the caching approach is successful, then fetching child tables may not be a problem. We can just fetch it once and cache.