Supporting full text for Japanese and other idiographic languages

it33 · January 18, 2016, 5:58pm

By default, full text search in English requires a minimum of three characters and requires a * to find subterms, (e.g. “real” doesn’t find “realtime” but “real*” does find "realtime).

This doesn’t work well for Japanese, Korean, Chinese or other idiographic languages.

Because this behavior is defined in the full text search capabilities of the underlying database (MySQL or Postgres), we’re hoping someone familiar with Japanese would be able to test to see if certain configurations of the database could be used to optimize for searching Japanese:
http://textsearch-ja.projects.pgfoundry.org/index-ja.html

For example, could are there full text search parameters for using Japanese in Postgres that would change the default behavior so that * wouldn’t be needed to match characters, and by default search would support single characters–unlike the default English settings.

Would anyone be open to helping test and share advice for other international users?

pocket · January 20, 2016, 2:51am

EDIT: after talking about this with my friend, I think I was answering something completely different. I guess I’ll rewrite my answer when I have spare time again

I’ve noticed you can’t search the chat in Japanese, so I’ll give my 2cents to help the problem.
fwiw, I am a Japanese, no experience in Postgres, a bit in Oracle and MySQL.

I’m not sure if I understood your problem, but here’s what I think you are saying.

How do we optimize full text search (which is presumably costly) in multi-byte character languages?
We aren’t able to search multi-byte strings.
I am not fully updated about the current state of full text search optimization, but there are many full text search engines such as “mroonga”, which does optimization. I know this particular product works for Japanese, but I’d assume any popular full text search engine would work for almost any language anyways.
An SQL like "SELECT * FROM FOO WHERE FOO.BAR LIKE “(some random Japanese string)%” works perfectly fine, just like for English. Reg-ex works fine the same way, so for example, “(some random Japanese string).*” would match any (some random Japanese string) followed by any characters (any Japanese or English).

There is no difference in idiographic languages (as you put them) and English, because programs just interpret them as bytes, and the bytes just happen to be single bytes or multiple bytes, depending on the language.

Hope this helps.

Topic		Replies	Views
What search support is available for Chinese characters? Troubleshooting	7	3524	April 16, 2016
Can not search channel by text Troubleshooting	6	2343	March 1, 2019
Default_text_search_config setting not used Troubleshooting	2	1170	May 26, 2020
Title: Unable to Search Thai Language by Individual Words Troubleshooting	6	560	March 27, 2024
Full Text Search Troubleshooting	11	4881	May 22, 2018

Supporting full text for Japanese and other idiographic languages

Related topics