Levenshtein Distance Operator and Approximate String Matching

vilc · January 5, 2024, 11:43am

It does more after all, in some specific cases. Let’s say we have tiddlers foo bar and bar foo, and input query foo b. Thanks to Levenshtein, foo bar will be show first and bar foo second. If it were simple sorting by length, both results being equal length, they would be shown alphabetically or random. This way the first result most closely matches the search query.

So there is some small benefit to using Levenshtein instead of length sort.

Scott_Sauyet · January 5, 2024, 11:49am

Although the output will mostly be sorted by length, at least for short-ish search strings, what’s being sorted is important, and here that’s a much more helpful search result. Kudos!

saqimtiaz · January 5, 2024, 11:52am

I recommend starting with a ticket and getting feedback before investing the time in a PR.

saqimtiaz · January 5, 2024, 11:57am

That is pretty interesting and I will need to play around with it at some point. My notes wiki uses a custom search operator that prioritizes prefix matches (ignoring stop words) in order of length before all other matches, since that fits my usage pattern the best.

vilc · January 5, 2024, 12:34pm

This seems useful as well, thanks! That’s something I would welcome in my setup. I’ll experiment on combining it with my idea.

@pmario perhaps it is indeed best to create a discussion/ GH issue to gather all such ideas first, and then try to construct a more “intuitive” search filter for tw-com or core. I think that sorting by matched prefix first is just as useful as sorting by length/LD.

For clarification, by sorting by matched prefix I understand that e.g. when searching for ground it will show groundbreaking (14 chars) first, because it begins with the search query, and underground (11 chars) second; whereas my LD-sort filter alone would show underground first as it is shorter.

pmario · January 5, 2024, 12:51pm

Yea. I’ll probably need 5 minutes to create a PR and then the we can spend days to discuss it, if that’s what we want.

There have been several discussions here in the group about different possibilities to improve the search results. Nothing happened, so I think a PR is as good as a new ticket – I may create both

Also see:

pmario · January 5, 2024, 2:24pm

Here is the GH issue: [BUG] The core search result list should return relevant info for basic search terms - early · Issue #7917 · Jermolene/TiddlyWiki5 · GitHub for discussions

Here is the PR: improve default search filter-string by pmario · Pull Request #7918 · Jermolene/TiddlyWiki5 · GitHub

I think the new filter string has good results with 3,4 and 5 characters already. And it seems to be strong, if the user knows the search term.

Demo: https://tiddlywiki5-git-fork-pmario-7917-improve-defau-8958bd-jermolene.vercel.app/

have fun!
mario

vilc · January 6, 2024, 5:05pm

I have constructed this expression to prioritize prefix matches. It is not ideal, as it prioritizes results with any prefix match over those with no prefix match, but at least it is something to start with. This is as far as I could come with pure filter expression (without nesting functions or defining a new operator in JS). I have to see how good it is in actual real life usage.

[!is[system]search:title<userInput>]
:sort:integer[levenshtein<userInput>]
:sort:integer:reverse[split[ ]search:title:some,anchored<userInput>first[1]count[]]
:and[limit[250]]

The first two runs are identical as in examples above.
The third does the following:

Split each result on spaces. Possible to add additional steps e.g. split[/] if one uses/considers / as a “whitespace” between words.
For each word, check if some tokens from the query match its prefix.
Limit to first result, so that titles with multiple occurrences of the same word are not ranked higher, e.g. so that “TiddlyWeb JSON tiddler format” will not rank higher than “Tiddler” for query “tid”.
Count the results – will be 0 or 1 – and sort by this number in reverse, that is results with any prefix match before all those without prefix match.

Results for queries sort and widg, to compare with my examples above:

One could argue that for the query widg the simpler way (just Levenshtein, without prefix matching) yields better/ more useful results, but I think this is due to CamelCase titles like LinkWidget. Were the title “Link Widget”, it would show up in the first results.

Any hints on how to get it further (if at all possible without custom JS solution) welcome!

I think this might be the point where further optimization depends on a particular wiki and individual style of searching. CamelCase titles change a lot as in the example above (maybe results could be split on capital letters to match prefixes as in single words?).

CodaCoder · January 6, 2024, 5:27pm

I personally like what’s coming out of this discussion. And I really like the demo (thanks @pmario!) But…

In the light of this discussion and the numerous galvanizing threads that it spawned, I do wonder if we’re locking ourselves into an echo chamber.

So, “pretending” I’m new to TiddlyWiki and further pretending I want to learn about variables, I just visited
the demo and typed “variable” into the search. What appears is a promising list of choices performing much better than before (I’d say).

Then, with my “I’m a noob” hat on, I started opening the listed choices…

In a word, bewildering. This looks like developer documentation, not anything a noob would benefit from. There are so many tangled opportunities to “get lost” (breadcrumbs anyone?) it’s the kind of thing that makes students nervous – “Have I read everything? What if there’s more here somewhere? How do I tell?”

And then I spot a tag, “Variable Usage”, the same name as a tiddler I already have open, listing even more weird things like “Behaviour of invoked variables depends on how the variable was declared” – which, by the way, is waaaay down the list of choices offered by the search. If the key threshold concept (which, as a noob, I don’t even know I need) is buried that far down the list…

Well, you get the point.

Again, I love what’s coming out of this effort. It’s going to be extremely useful – for me. And of course, my ponderings above could just be a reflection more on the nature and characteristics of the documentation itself – it’s hard to say for certain. But I do wonder if it’s surfacing the stuff we want and ignoring what a notional noob user actually needs.

vilc · January 6, 2024, 6:05pm

Well, in this example with the Levenshtein sorting at least the tiddler “Variables” (the only one potentially useful to a newbie out of all results) is the first one, whereas in the default alphabetic search it is near the end.

It is similar for other important keywords, with Levenshtein the general tiddlers that a newbie might be searching are usually on top or near top.

It is far from ideal, but still better than the current alphabetic sorting imo, also for newbies.

vilc · January 6, 2024, 8:37pm

Of course, considering the new user’s point of view is very important and easily overlooked, as you point out.
In my previous post I just wanted to say that I think the discussion here is a step in a good direction, even if a small one.

Perhaps it would be better to consider these two issues separately:

Improving default core search mechanism, while keeping it generally-oriented, so that it remains solid across many possible TW use cases.
Improving tw-com search with focus on easing access to general/introductory tiddlers that might be of interest to newcomers.

In the discussion above and on GH we have tried achieving 2. with 1., which might not be the best approach. Maybe tw-com needs additional care and some custom setup, like showing up search results from a defined “important” set of tiddlers first.

CodaCoder · January 6, 2024, 8:52pm

Thanks. That’s a very considered response.

TW_Tones · January 6, 2024, 10:39pm

I agree @CodaCoder As tiresome as it may seem sometimes we do need to keep returning to see how a diverse audience will see things. I think @vilc is also “on track”.

The default search is a key access point to the information in any wiki. I have multiple custom search tabs in my key wikis to help, but it is overloaded.

I faced this issue some years ago and asked myself how could we add extra search tools, including alternatives like discussed here, without overloading the standard search?

The approach I came up with back than was what I called “search indicators”, basically it sat above the search terms entry. As soon as a search term was entered it would display different icons according to different conditions like the search term is a prefix to one or more tiddlers, or suffix. Clicking an icon in the indicators would open advanced search with the appropriate filter.

I think this and other innovative approaches should be considered. My own skills have improved and we have new features to make use of.

Perhaps a drop down to select alternate sort’s such as Levenstein
I may build the mechanism for a search indicators solution for which designers can tag their own indicators.
- A more general indicators method could be used elsewhere in tiddlywiki.

The biggest search improvement I would like to see is the ability so scope the search first, with a filter, such as “scope” eg search only tiddlers in the TOC, and invoke this in context where desired, eg in a project tiddler be able to search all project tasks etc…

I think its time to move the standard search ahead

[Edited]
If a new user said search “filtered transclusions” a help indicator could appear linking them to a curated tiddler for that subject rather than leave it to the user to look at all the search hits?

CodaCoder · January 6, 2024, 11:51pm

Thanks, Tones. Another great response.

Listen, if you/we think this is worth continuing, we should tear this off into another thread. Let me know. Otherwise,

sorry for the interruption, normal service is now resumed.

Scott_Sauyet · January 7, 2024, 12:33am

For those who didn’t see it in the GitHub Issue, I presented my own version as an alternative to Levenshtein distance. It’s plain JS for now, and not created as a tiddler. Perhaps I can try that tomorrow. But for now, you can see, running against all the non-system/shadow tiddlers from the main site, the levenshtein version and my metric. You can comment/uncomment some search strings in there, or add/edit your own.

While, yes, Levenshtein puts “Variables” first, it then goes off the rails, as far as I’m concerned:

// Searching for "Variable"
Variables
Apple
Articles
ReadMe
Cascades
FirstOne
Friday
GitLab
Hamlet
Markdown
MathML
Milk
Modals
Pragma

My alternative seems much more reasonable:

// Searching for "Variable"
Variables
Core Variables
Variable Usage
modifier Variable
SetVariableWidget
namespace Variable
variables Operator
dumpvariables Macro
getvariable Operator
thisTiddler Variable
ImportVariablesWidget
storyTiddler Variable
transclusion Variable
tv-wikilinks Variable
Variables in WikiText

It’s also flexible about typos. A really simple extra letter doesn’t hurt much:

// Searching for "Widgdet" (note extra "d")
Widgets
LetWidget
LogWidget
SetWidget
EditWidget
FillWidget
LinkWidget
ListWidget
SlotWidget
TextWidget
VarsWidget
ViewWidget
CodeBlockWidget
CountWidget
ErrorWidget
ImageWidget

And a worse misspelling is not terrible:

// Searching for "Veryable"
variables Operator
Variables
getvariable Operator
DraggableWidget
storyTiddler Variable
modifier Variable
Story River
abs Operator
Icon Gallery
WebServer Parameter: csrf-disable
move Operator
table-example
levenshtein Operator
thisTiddler Variable
Core Variables

I’m trying to figure out a way to emphasize prefix matches a bit more with a similar technique. But I haven’t quite wrapped my head around it.

I haven’t really tried, but I expect that this is much more suitable for title matches than full-text matches.

vilc · January 7, 2024, 8:00am

Your examples look very promising.

Matching by Levenshtein yes, it goes off the rails. But the current default search sorted by Levenshtein is quite good for being so simple and available right way, and it isn’t so “deranged”.

CodaCoder · January 8, 2024, 4:01am

@Scott_Sauyet, keeping your previous in mind and using the same search, “Variable”, try renaming “Core Variables” to “Variables in the core”. When I did that on the demo, it was moved a long way down the list. (I think @vilc had something to say about that upthread.) It would be interesting to hear what your mod returns…

Scott_Sauyet · January 8, 2024, 4:44am

This is what I get:

Variables
Variable Usage
modifier Variable
SetVariableWidget
namespace Variable
variables Operator
dumpvariables Macro
getvariable Operator
thisTiddler Variable
Variables in the core     <----- here
ImportVariablesWidget
storyTiddler Variable
transclusion Variable
tv-wikilinks Variable
Variables in WikiText
actionTiddler Variable

I don’t know if that seems good or bad to you. There is a bias not towards short matches exactly but towards ones about the length of the search string – in practice, the shorter ones.

This is online if you want to try your own tests. The search string is at the top and the list of tiddler names is around line 36. You won’t change anyone else’s view, and you can share your results by copying the (very long) URL. (The built-in shortener has been broken since Google shut down their service; we’ll fix it one of these days. You can always paste into an existing shortener.)

CodaCoder · January 8, 2024, 5:04am

On the face of it, I’d say that’s good. The demo seems to favor shorter results which is not necessarily a good thing. In fact, (though it may not be so relevant here) I tend to favor results returned within some surrounding context.

Scott_Sauyet · January 8, 2024, 2:24pm

If you searched for “Variable Core”, the first result would be “CoreVariables”.

The method creates a measure of similarity between two strings, with results between 0 and 1. It does this by breaking, say, “CoreVariables”, into bigrams, (["co", "or", "re", "ev", "va", "ar", "ri", "ia", "ab", "bl", "le", "es"] ), and then calculating the ratio of the count of the intersection between the two words, and the sum of the count of bigrams for the two stings. This is Dice’s Coefficient. My add to this is to take into account bigram multiplicities, so that, for instance, “cancan” is closer to “cancan” than to “canc”. This is probably only useful for relatively short strings, such as titles and search strings. For full text searches, this likely won’t help.

To demonstrate that shortest doesn’t always win, if we take a longer search string, such as "Widget tm message", we get this result:

WidgetMessage: tm-modal
WidgetMessage: tm-home
WidgetMessage: tm-login
WidgetMessage: tm-print
WidgetMessage: tm-logout
WidgetMessage: tm-notify
WidgetMessage: tm-scroll
WidgetMessage: tm-add-tag
WidgetMessage: tm-navigate
WidgetMessage: tm-add-field
WidgetMessage: tm-permalink
WidgetMessage: tm-permaview
WidgetMessage: tm-save-wiki
WidgetMessage: tm-edit-tiddler
WidgetMessage: tm-set-password

I don’t want to push this too hard. I think of it as a potentially useful alternative to Levenshtein more than as a fantastic search result itself. And I don’t think either one offers much for full text searches.