Levenshtein Distance Operator and Approximate String Matching

TW_Tones · January 6, 2024, 10:39pm

I agree @CodaCoder As tiresome as it may seem sometimes we do need to keep returning to see how a diverse audience will see things. I think @vilc is also “on track”.

The default search is a key access point to the information in any wiki. I have multiple custom search tabs in my key wikis to help, but it is overloaded.

I faced this issue some years ago and asked myself how could we add extra search tools, including alternatives like discussed here, without overloading the standard search?

The approach I came up with back than was what I called “search indicators”, basically it sat above the search terms entry. As soon as a search term was entered it would display different icons according to different conditions like the search term is a prefix to one or more tiddlers, or suffix. Clicking an icon in the indicators would open advanced search with the appropriate filter.

I think this and other innovative approaches should be considered. My own skills have improved and we have new features to make use of.

Perhaps a drop down to select alternate sort’s such as Levenstein
I may build the mechanism for a search indicators solution for which designers can tag their own indicators.
- A more general indicators method could be used elsewhere in tiddlywiki.

The biggest search improvement I would like to see is the ability so scope the search first, with a filter, such as “scope” eg search only tiddlers in the TOC, and invoke this in context where desired, eg in a project tiddler be able to search all project tasks etc…

I think its time to move the standard search ahead

[Edited]
If a new user said search “filtered transclusions” a help indicator could appear linking them to a curated tiddler for that subject rather than leave it to the user to look at all the search hits?

CodaCoder · January 6, 2024, 11:51pm

Thanks, Tones. Another great response.

Listen, if you/we think this is worth continuing, we should tear this off into another thread. Let me know. Otherwise,

sorry for the interruption, normal service is now resumed.

Scott_Sauyet · January 7, 2024, 12:33am

For those who didn’t see it in the GitHub Issue, I presented my own version as an alternative to Levenshtein distance. It’s plain JS for now, and not created as a tiddler. Perhaps I can try that tomorrow. But for now, you can see, running against all the non-system/shadow tiddlers from the main site, the levenshtein version and my metric. You can comment/uncomment some search strings in there, or add/edit your own.

While, yes, Levenshtein puts “Variables” first, it then goes off the rails, as far as I’m concerned:

// Searching for "Variable"
Variables
Apple
Articles
ReadMe
Cascades
FirstOne
Friday
GitLab
Hamlet
Markdown
MathML
Milk
Modals
Pragma

My alternative seems much more reasonable:

// Searching for "Variable"
Variables
Core Variables
Variable Usage
modifier Variable
SetVariableWidget
namespace Variable
variables Operator
dumpvariables Macro
getvariable Operator
thisTiddler Variable
ImportVariablesWidget
storyTiddler Variable
transclusion Variable
tv-wikilinks Variable
Variables in WikiText

It’s also flexible about typos. A really simple extra letter doesn’t hurt much:

// Searching for "Widgdet" (note extra "d")
Widgets
LetWidget
LogWidget
SetWidget
EditWidget
FillWidget
LinkWidget
ListWidget
SlotWidget
TextWidget
VarsWidget
ViewWidget
CodeBlockWidget
CountWidget
ErrorWidget
ImageWidget

And a worse misspelling is not terrible:

// Searching for "Veryable"
variables Operator
Variables
getvariable Operator
DraggableWidget
storyTiddler Variable
modifier Variable
Story River
abs Operator
Icon Gallery
WebServer Parameter: csrf-disable
move Operator
table-example
levenshtein Operator
thisTiddler Variable
Core Variables

I’m trying to figure out a way to emphasize prefix matches a bit more with a similar technique. But I haven’t quite wrapped my head around it.

I haven’t really tried, but I expect that this is much more suitable for title matches than full-text matches.

vilc · January 7, 2024, 8:00am

Your examples look very promising.

Matching by Levenshtein yes, it goes off the rails. But the current default search sorted by Levenshtein is quite good for being so simple and available right way, and it isn’t so “deranged”.

CodaCoder · January 8, 2024, 4:01am

@Scott_Sauyet, keeping your previous in mind and using the same search, “Variable”, try renaming “Core Variables” to “Variables in the core”. When I did that on the demo, it was moved a long way down the list. (I think @vilc had something to say about that upthread.) It would be interesting to hear what your mod returns…

Scott_Sauyet · January 8, 2024, 4:44am

This is what I get:

Variables
Variable Usage
modifier Variable
SetVariableWidget
namespace Variable
variables Operator
dumpvariables Macro
getvariable Operator
thisTiddler Variable
Variables in the core     <----- here
ImportVariablesWidget
storyTiddler Variable
transclusion Variable
tv-wikilinks Variable
Variables in WikiText
actionTiddler Variable

I don’t know if that seems good or bad to you. There is a bias not towards short matches exactly but towards ones about the length of the search string – in practice, the shorter ones.

This is online if you want to try your own tests. The search string is at the top and the list of tiddler names is around line 36. You won’t change anyone else’s view, and you can share your results by copying the (very long) URL. (The built-in shortener has been broken since Google shut down their service; we’ll fix it one of these days. You can always paste into an existing shortener.)

CodaCoder · January 8, 2024, 5:04am

On the face of it, I’d say that’s good. The demo seems to favor shorter results which is not necessarily a good thing. In fact, (though it may not be so relevant here) I tend to favor results returned within some surrounding context.

Scott_Sauyet · January 8, 2024, 2:24pm

If you searched for “Variable Core”, the first result would be “CoreVariables”.

The method creates a measure of similarity between two strings, with results between 0 and 1. It does this by breaking, say, “CoreVariables”, into bigrams, (["co", "or", "re", "ev", "va", "ar", "ri", "ia", "ab", "bl", "le", "es"] ), and then calculating the ratio of the count of the intersection between the two words, and the sum of the count of bigrams for the two stings. This is Dice’s Coefficient. My add to this is to take into account bigram multiplicities, so that, for instance, “cancan” is closer to “cancan” than to “canc”. This is probably only useful for relatively short strings, such as titles and search strings. For full text searches, this likely won’t help.

To demonstrate that shortest doesn’t always win, if we take a longer search string, such as "Widget tm message", we get this result:

WidgetMessage: tm-modal
WidgetMessage: tm-home
WidgetMessage: tm-login
WidgetMessage: tm-print
WidgetMessage: tm-logout
WidgetMessage: tm-notify
WidgetMessage: tm-scroll
WidgetMessage: tm-add-tag
WidgetMessage: tm-navigate
WidgetMessage: tm-add-field
WidgetMessage: tm-permalink
WidgetMessage: tm-permaview
WidgetMessage: tm-save-wiki
WidgetMessage: tm-edit-tiddler
WidgetMessage: tm-set-password

I don’t want to push this too hard. I think of it as a potentially useful alternative to Levenshtein more than as a fantastic search result itself. And I don’t think either one offers much for full text searches.

CodaCoder · January 8, 2024, 2:46pm

Thanks for the digest. Appreciated.

I don’t see anything “wrong” with the output here. Like I intimated above, this (or your minor adaptation) is going to be a bonus going forward.

TY again.

TiddlyTitch · January 9, 2024, 8:48pm

I once met Levenshtein in Palestine.
She had a very nice moustache.