Strange Behavior with non-english chars

Michael_Schroder · August 17, 2023, 4:15pm

Hello everybody on this forum,

first of all I want to say that Tiddlywiki is really great software, very productive, very flexible. Thanks for that.

…and now straight to my issue:

This issue will have been noticed by many non-English users, but it seems that until now it has never been reported, as it is difficult to describe exactly what is not working well there. In addition, the failure is rarely visible.

I’m working on a wiki with vietnamese texts about acupuncture and do partial translations into German. During editing, I’ve sometimes noticed that the search didn’t work as expected. Occasionally I also had the problem that links to existing tiddlers were displayed as “missing tiddler”. But that were links that were obviously not misspelled.

To further illustrate this issue I made this demo wiki at tiddlyhost: https://issue-with-nonenglish-unicode-chars.tiddlyhost.com/

For some time I had no idea what could be causing this and how to correct it.

Recently, while doing some research on this topic, I found the following article, which describes the problem very aptly and even offers a possible solution:

https://tech.glints.com/vietnamese-for-engineers/

There seems to be only rather simple changes needed, namely processing all input texts through the string.normalize() method. I don’t have the Javascript experience or time to dig very deep into the Tiddlywiki code. So I would like to ask the experienced code gurus here to take care of this problem.

Is it possible to change the core of an existing wiki with a plugin, or should it be included in the next update? Both would help me a lot, as it would help many other non-English users.

Best regards Michael Schroeder

more articles are here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

https://www.stefanjudis.com/today-i-learned/string-prototype-normalize-for-safer-string-comparison/

Springer · August 17, 2023, 5:02pm

Thanks for setting up an example wiki to walk us through the problem! I see what you mean; type that is visually identical is not behaving identically.

Alas, your “vietnamese for engineers” link is getting me to a 404. And I’m not among the folks who could tackle the technical details here. Just would love to see world languages supported well.

Meanwhile welcome to the community!

Michael_Schroder · August 17, 2023, 7:24pm

Thank you, Springer
the link is misspelled.

https://tech.glints.com/vietnamese-for-enginners/ should work now…

TW_Tones · August 18, 2023, 1:16am

Just some research material that should help;

Inspect title of initial ## Chí âm
tiddler, copy url

<h2 class="tc-title">Chí âm</h2>
https://issue-with-nonenglish-unicode-chars.tiddlyhost.com/#Ch%C3%AD%20%C3%A2m

Inspect title of the B tiddler, after creating it, copy url
## Chí âm

<h2 class="tc-title">Chí âm</h2>
https://issue-with-nonenglish-unicode-chars.tiddlyhost.com/#Chi%CC%81%20%C3%A2m

Note: the variation in the encoding the tiddler title.

see %C3%AD% vs %CC%81 in hex
I use only english, I wonder if someone who uses other alphabets may not get the same.
Unicode now has many symbols and modifiers. I suggest creating tiddlers from the pasted title, and dont expect typing the title to use the same symbols although they may look the same.
I think the issue involves the character set keyboard used, differs.

The above hints at a workaround, but we need to concider if it can be simplified to avoid these symptoms. (But I doubt it because its about the source)

https://unicodelookup.com/

Used in pasted name

Ã	latin capital letter a with tilde	0303	195	0xC3	&Atilde;
	soft hyphen	                        0255	173	0xAD	&shy;

Used in typed name

Ì	latin capital letter i with grave	0314	204	0xCC	&Igrave;
 \x{81}	          xxx	                        0201	129	0x81	&#129;

Conclusion: there are two different tiddler titles that look the same visualy.

xcazin · August 18, 2023, 12:33pm

Hey @Michael_Schroder, that’s great findings!
I tested with a few normalised strings in the String.prototype.normalize() - JavaScript | MDN sandbox and it looks like normalize() would solve the issues that were discussed on GitHub, especially regarding the ability to use indexOf() in search…

Michael_Schroder · August 18, 2023, 2:55pm

Hey @xcazin , yes that discussion on github runs around the same problem. I also was stunned to read in the above article about such a simple solution, that is already inbuilt in javascript! Could it be that easy?

A first consideration for implementation would be to normalize() each input string, that comes from any input-element in TW. Then all text data in all tiddlers would consist of normalized strings and would also be compared to normalized strings. Of course the import function of TW also should normalize() it’s input then.

What do you think?

Mark_S · August 18, 2023, 4:53pm

Hmm. I wonder how “normalize” determines which unicode set is “normal” ?

Here’s a tool you can try for normalising your own titles while waiting for an official answer. I tested it with your example and it seemed to work.

Be sure to have a backup of your working file/tiddlers. Import the attached JSON. Save. Reload. You can paste or type your title into the input field of the “Normalise” tiddler and then use the copy-to-clipboard button to copy the normalised text into your clipboard. You can then paste the text wherever you need it.

normalize-filter.json (1.0 KB)

Michael_Schroder · August 18, 2023, 7:18pm

Hello Mark,
I will try your tool immediately. Thank you very much!

I think it will be a vital help for me…

regards Michael