Trouble with Global Search and Chinese Characters

I am having trouble with the Global Search. We have many Items, Suppliers, etc with Chinese Characters and I have trouble finding them via the Global Search.

I see the same behavior on a local v11, as well as a v12 instance on erpnext.com. Can anyone advise whether there is anything on a global level what can be done to (like character encoding on the OS or database level for example?)

Add the following Bold parameters via
sudo nano /etc/mysql/mariadb.cnf file

innodb_ft_min_token_size=2
ft_min_word_len=2

[mysqld]
innodb_ft_min_token_size=2
ft_min_word_len=2

I’ll check that out tx. Just curious … is that related to length of the Search string?

yes. try and let me know the result?

this did not solve the issue. The behaviour is pretty illogical.

Some examples:

  • I can not find: 常琴
  • I can not find: 吴玲华
  • I can find: 上海
  • I can find: JL

I’ll prepare a small demo in the coming days. In order to not reveal our real data I have to prepare some sort of demo data which will take a bit.

my test finding is: search OK if searched text is at beginning of the field, failed if middle or end of the field.

seems that extra handling of Chinese words split needed.

what I have done to solve the above problem

add fulltext search for Chinese 中文分词支持
install python library pkuseg https://github.com/lancopku/PKUSeg-python 安装中文分词库
./env/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
update the library to most updated version 升级
./env/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg

adapt 对写入__global_search的字段内容进行分词

in any custom app hooks.py file add the below code

from frappe.utils import cint, strip_html_tags
from six import text_type
import pkuseg

seg = pkuseg.pkuseg()

from frappe.utils import global_search

def get_formatted_value(value, field):
	"""
	Prepare field from raw data
	:param value:
	:param field:
	:return:
	"""

	from six.moves.html_parser import HTMLParser

	if getattr(field, 'fieldtype', None) in ["Text", "Text Editor"]:
		h = HTMLParser()
		value = h.unescape(frappe.safe_decode(value))
		value = (re.subn(r'<[\s]*(script|style).*?</\1>(?s)', '', text_type(value))[0])
		value = ' '.join(value.split())
	value =  strip_html_tags(text_type(value))
	try:
		value = ' '.join(seg.cut(value))
	except:
		pass
	return field.label + " : " + value

global_search.get_formatted_value = get_formatted_value

the end result is like this

1st of all … Thanks a ton for getting so involved with this 非常感谢! I really appreciate this.

I am clear about these 2 steps above.

but what do you mean with this?

it is a kind of my remarks which changes needed for this solution to work.

Please check the Name field to confirm if it’s mixed with First name and last name.
常琴 ,is first name or last name or first + last name?

I have installed https://github.com/lancopku/PKUSeg-python as suggested but unfortunately nothing has change in my scenario. I have noted that the Search behaves a little weird when using Chinese Characters to begin with.

Then in the end I can find some strings and other I can not

here is a short demo of the behavior

Any ideas what this can be?

2 suggestions:

  1. search 零花 instead of 吴零花

  2. create new supplier 吴玲花, then search 吴玲花。

share you result, then maybe I can explain to you something.

Fisher

same behavior as in the first 7 seconds in the gif I shared yesterday for "零花“ as well as for “吴玲华”

have you changed this setting and reloaded mariadb? then after this create new docs to see the result, also you can check the content by bench mariadb
then select name,doctype, content from __global_search,

remember to install a new custom app and with the above code in hooks.py file.

oh sorry. I didn’t get that. I thought you meant, if there was any custom app, you needed to add this to that existing app’s hooks.py file. I’ll look into it and see what I can come up with.

Need to understand how to create a custom app.

no, if you already have existing custom app, you can add the code to the hooks.py , no problem, but you got to bench restart to make the new updated code to reload.

Good Luck, waiting for good news.