Diacritic insensitive searching and sorting of fields for non-English languages

In a previous discussion with @
@tw-FRed and @TW_Tones we agreed it would be great for Tiddly Wiki to include diacritic-insensitive flags for searching and sorting.

Both @tw-FRed and I have a solution of generating “accent free” duplicates of the fields of interest and searching them alongside the original field. And for sorting, sorting by the “accent free” version.

I am creating this thred following @TW_Tones suggestion, but I am limited in the technical contributions I can make.

I can share my current solution for two languages, French and German:

/*\
title: mymacros/deaccent.js
type: application/javascript
module-type: macro
\*/

/*
Macro to deaccent a string
*/

"use strict";
exports.name = "deaccent";
exports.params = [{name: "input"}];


exports.run = function(input) {
    // Define the mapping of accented characters to their simplest equivalents
    const accentMap = {
        // French
        'à': 'a',
        'â': 'a',
        'ä': 'a',
        'ç': 'c',
        'é': 'e',
        'è': 'e',
        'ê': 'e',
        'ë': 'e',
        'î': 'i',
        'ï': 'i',
        'ô': 'o',
        'ö': 'o',
        'ù': 'u',
        'û': 'u',
        'œ': 'oe',

        // German
        'ä': 'a',
        'ö': 'o',
        'ü': 'u',
        'ß': 'ss',

    };

    // Convert input string to lowercase and split into individual characters
    const characters = input.toLowerCase().split('');

    // Replace each character with its simplest equivalent if it's in the accentMap
    const replacedCharacters = characters.map(char => accentMap[char] || char);

    // Join the characters back into a single string
    return replacedCharacters.join('');
};

The tiddler type must be “application/javascript” and there must be a “module-type” field with value “macro”

I think having a script like this is clean as the letter replacements are easy to read and edit and enhance for supported languages.

But really we should not have to create duplicate diacritic free data, and there should be a way to search or sort that is insensitive to diacritics.

If it helps finding the solution going forwards there is possibly a javascript function already available to do this whether it is built into JavaScript, the browser implementation or somewhere in the tiddlywiki core. We just need to make it an option.

  • The current JavaScript is in $:/core/modules/filters/search.js

It is highly likely this can be implemented with very little changes and even more importantly with very few bytes to implement.

For any core developers reading this the following should not be taken literally but provides great insight as to what to consider.

Notes

Workaround

On some research with an LLM it came to me that you can already construct such searches using regular expressions so that is the immediate solution.

Operator and or core change

ChatGPT is inspirational for this. I asked the following question informed already by my work in tiddlywiki.

Please consider this https://tiddlywiki.com/#%24%3A%2Fcore%2Fmodules%2Ffilters%2Fsearch.js I want some ideas on how to allow searches to be sensitive/insensitive to the use of diacritical markers or words using accents.

The result was as follows and is of sharable / publishable value in my view, it is informative and raises a number of issues;

  • Add to current flag list
  • Modify options.wiki.search() also defined in $:/core/modules/filters/search.js
  • add a config tiddler default so users can set “ignore diacritics” globally
    • Only fold (decode diacritics) when the flag is enabled.
  • List Edge cases like * ß vs ss

Of note as raised here we need to also look at;

  • Languages with diacritics could set a global $:/config/Search/IgnoreDiacritics = yes|no if it contains diacriticas.
  • Where else must we deal with this?
    • search and replace enable same search but also to perhaps to remove diacritical’s?

Looking at $:/core/modules/filters/search.js, the filter operator itself is basically a thin wrapper: it parses the field list + flags, and then hands everything off to options.wiki.search() along with options like caseSensitive, literal, words, regexp, etc. (osmarks projects hosting)

So the cleanest way to add “accent/diacritic insensitive” searching is:

  1. Add a new flag to the existing flag list (parallel to casesensitive, literal, words, …).
  2. Implement the behaviour in wiki.search(), because that’s where the actual matching happens. The operator already forwards flags into wiki.search() in a structured way. (osmarks projects hosting)
  3. Optionally, add a config tiddler default so users can set “ignore diacritics” globally.

Below are a few concrete design options (from “simple + practical” to “fancy + linguistically correct”).

Option A: Unicode “folding” (recommended baseline)

Idea

When the new flag is enabled (say diacritic or ignorediacritics), you normalize to a decomposed form and strip combining marks on both the search text and the field text before matching.

In modern JS you can do:

  • str.normalize("NFD") (or "NFKD")
  • remove combining marks with /\p{M}+/gu (Unicode Mark category)

Example folding:

  • cafécafe
  • naïvenaive
  • ŻółćZolc

Where it fits in TW

  • Add diacritics (or similar) to the operator flags, and pass it into options.wiki.search() alongside caseSensitive etc. The operator is already built for this pattern. (osmarks projects hosting)
  • In wiki.search(), create a small preprocessor function that applies:
    • case folding (if not casesensitive)
    • diacritic folding (if flag enabled)

Interaction with existing search modes

The search operator supports multiple modes (literal, words, whitespace, regexp, etc.). (tiddlywiki.com)

  • literal / whitespace / words / some: works great: you just fold both sides and do the same matching logic as today.
  • regexp: tricky. If you fold the haystack but not the regex, results become surprising. You have three reasonable choices:
    1. If regexp + ignorediacritics, fold the regex source too (but that changes character classes, escapes, etc.—can be unsafe/weird).
    2. If regexp + ignorediacritics, disable diacritic folding (document it).
    3. If regexp + ignorediacritics, reject/ignore the flag (fail fast or warn).

Practically: I’d implement A for non-regexp modes first, and decide policy for regexp explicitly.

Performance note

Folding every field of every tiddler every search can be expensive. Two easy optimisations:

  • Only fold when the flag is enabled.
  • Cache a “folded” version of each searched field per tiddler (invalidate cache on tiddler change).

Option B: “Locale aware” base-letter matching (Intl.Collator)

Intl.Collator can compare strings with sensitivity: "base" (ignores accents/case differences). This is linguistically nicer in some locales.

But: collators are built for equality and sorting, not substring search. You can still implement substring-ish matching, but it usually devolves into scanning/sliding windows and gets slower/complex.

So: I’d only reach for this if you specifically want language-sensitive behaviour (e.g. Turkish dotted/dotless i edge cases).

Option C: Expand equivalence classes (best for regexp users, most work)

If you want regexp mode to be “accent-insensitive” without folding away information, you can rewrite the regex so that e.g. e matches [eéèêëēėę…] and likewise for other letters.

Pros:

  • Works with regex.
    Cons:
  • Big mapping table, tricky escaping rules, more chance of bugs.

A hybrid approach is common:

  • Option A for non-regexp modes (fast + simple)
  • Option C only when someone explicitly requests accent-insensitive regex.

How I’d expose it in TW (concrete)

1) New flag

Add a flag like ignorediacritics (name bikeshed) to the list the operator checks, and pass it through to wiki.search() exactly like casesensitive is passed today. (osmarks projects hosting)

Example usage (conceptually):

  • [search:title,text:words,ignorediacritics[cafe]] matches café in title/text
  • [search::literal,ignorediacritics[naive]] matches naïve

This mirrors the existing “flag list” mechanism described in the docs. (tiddlywiki.com)

2) Global default config (optional but nice)

Add something like:

  • $:/config/Search/IgnoreDiacritics = yes|no

Then in wiki.search():

  • If the flag is explicitly present, it wins
  • Else fall back to config default

This lets users make accent-insensitive search the default without changing every filter.

Edge cases to decide up-front

Unicode diacritic stripping handles combining mark accents, but not always “special letters”:

  • ß vs ss
  • æ vs ae
  • œ vs oe
  • ø vs o
  • đ vs d (sometimes decomposes, sometimes not depending on normalization)

So you may want a second, optional step in folding:

  • a small transliteration table for common special cases

I’d keep that separate (e.g. foldSpecials: true) because it’s more opinionated than just removing diacritics.

If you tell me which behaviour you want for regexp mode (ignore flag / disable folding / expand regex), I can sketch a minimal patch structure that fits TW’s existing search.js → wiki.search() flow and doesn’t blow up performance.

1 Like