Levenshtein Distance Operator and Approximate String Matching

I played with levenshtein Operator in below wikitext. I tried to find tiddlers with title similar to the keywords I enter in the searchbox.

<$edit-text tiddler="$:/temp/lev" field=text tag=input type=search />

<$set name=term tiddler="$:/temp/lev" field=text>
<$list filter="[<term>trim[]limit[3]] :map:flat[all[tiddlers]!is[system]]
               :filter[levenshtein<term>compare:number:lteq[5]]
               :sort:number[levenshtein<term>] :and[first[250]]">
<$link /> (<$text text={{{ [<currentTiddler>levenshtein<term>] }}} />)
<br>
</$list>
</$set>
  • Now enter something in searchbox like Tiddly and see the results

Remarks

  • filter first checks the term and ignore empty or blank search term
  • it next selects all non-system tiddlers
  • it then evaluates selection by distance and remove those have a levenshtein distance more than 5
  • finally sorts the result by levenshtein distance and limit output to first 250 matches.
2 Likes

I tried to see if one can implement a fuzzy-like search by approximate string matching!

See this modified code: It force to have the case insensitive keyword in tiddler title

<$edit-text tiddler="$:/temp/lev" field=text tag=input type=search />

<$set name=term tiddler="$:/temp/lev" field=text>
<$list filter="[<term>trim[]limit[3]] :map:flat[all[tiddlers]!is[system]]
               :filter[search:title:caseinsitive<term>levenshtein<term>compare:number:lteq[5]]
               :sort:number[levenshtein<term>] :and[first[250]]
">
<$link /> (<$text text={{{ [<currentTiddler>levenshtein<term>] }}} />)
<br>
</$list>
</$set>
1 Like

12 posts were split to a new topic: Fuzzy Search – use diff-match-patch or other algorithms

If anyone can share some “test data examples” for search, match and patch or source with changes that I can publish at the existing site https://test-data.tiddlyhost.com/ please let me know (eg private message).

This will allow us to test against common examples.

I have found a different interesting use for the levenshtein operator in searching.

While it is not so good for finding matches, due to all the reasons discussed above, I found it is quite useful for sorting matches.

Import this $__core_ui_DefaultSearchResultList.json (1.3 KB) to tiddlywiki.com or edit $:/core/ui/DefaultSearchResultList and change the field first-search-filter to [!is[system]search:title<userInput>limit[250]] :sort:integer[levenshtein<userInput>].
This sorts the title matches by the Levenshtein distance to the query. The effect is that the shorter titles will be shown first.

For example, searching for widg in default search:

Searching for widg with sorting of title matches by Levenshtein distance:

Searching for sort in default search:
image

Searching for sort with Levenshtein sorting:
image

Whether this sorting is useful will largely depend on the type of the wiki, what title conventions are in use, what searches are performed most frequently.
It performs well on my domain-specific wiki with lots of acronym titles and where I search for the “main” tiddler about “Foo” more often than “Specific thing about foo and bar”.
It also performs well for my searches done on tiddlywiki.com, as I look for the general tiddlers on concepts like procedures or filter expression much more often than the long specific titles.


Edit: I just realized this whole thing might not be doing much more than sorting by length :man_shrugging:. Well, at least I have found something useful for myself.

But maybe sorting fuzzy search results by Levenshtein would make more sense? So that the exact matches will be shown before fuzzy matches in case of similar words? If the fuzzy-match mechanisms don’t cover it already.

1 Like

Nice idea @vilc I followed your lead but rather than edit the core tiddler renamed your tiddler and it works nicely as a custom search tab, in the search dropdown, or advanced search standard tab.

Its also easy to compare the differences. levenshtein_DefaultSearchResultList.json (1.2 KB)

1 Like

I think it would be very useful for tw-com, but I would write it a bit differently. I would set the limit at the very end of the filter eg:

[!is[system]search:title<userInput>] :sort:number[levenshtein<userInput>] :and[first[250]]

I’ll create a PR for the core. I think this filter already is a 100 times more useful than the current search, since it “prefers” shorter title matches.

According to the the feedback here in the group we already know that users do not find stuff if they do not know, what to exactly search for.

So IMO a fuzzy or “weighted” search as mentioned in the discussion still would make sense.

5 Likes

Yes! Shorter first in the core would be great. Thanks all.

It does more after all, in some specific cases. Let’s say we have tiddlers foo bar and bar foo, and input query foo b. Thanks to Levenshtein, foo bar will be show first and bar foo second. If it were simple sorting by length, both results being equal length, they would be shown alphabetically or random. This way the first result most closely matches the search query.

So there is some small benefit to using Levenshtein instead of length sort.

Although the output will mostly be sorted by length, at least for short-ish search strings, what’s being sorted is important, and here that’s a much more helpful search result. Kudos!

1 Like

I recommend starting with a ticket and getting feedback before investing the time in a PR.

That is pretty interesting and I will need to play around with it at some point. My notes wiki uses a custom search operator that prioritizes prefix matches (ignoring stop words) in order of length before all other matches, since that fits my usage pattern the best.

1 Like

This seems useful as well, thanks! That’s something I would welcome in my setup. I’ll experiment on combining it with my idea.

@pmario perhaps it is indeed best to create a discussion/ GH issue to gather all such ideas first, and then try to construct a more “intuitive” search filter for tw-com or core. I think that sorting by matched prefix first is just as useful as sorting by length/LD.

For clarification, by sorting by matched prefix I understand that e.g. when searching for ground it will show groundbreaking (14 chars) first, because it begins with the search query, and underground (11 chars) second; whereas my LD-sort filter alone would show underground first as it is shorter.

Yea. I’ll probably need 5 minutes to create a PR and then the we can spend days to discuss it, if that’s what we want.

There have been several discussions here in the group about different possibilities to improve the search results. Nothing happened, so I think a PR is as good as a new ticket – I may create both


Also see:

2 Likes

Here is the GH issue: [BUG] The core search result list should return relevant info for basic search terms - early · Issue #7917 · Jermolene/TiddlyWiki5 · GitHub for discussions

Here is the PR: improve default search filter-string by pmario · Pull Request #7918 · Jermolene/TiddlyWiki5 · GitHub

I think the new filter string has good results with 3,4 and 5 characters already. And it seems to be strong, if the user knows the search term.

Demo: https://tiddlywiki5-git-fork-pmario-7917-improve-defau-8958bd-jermolene.vercel.app/

have fun!
mario

3 Likes

I have constructed this expression to prioritize prefix matches. It is not ideal, as it prioritizes results with any prefix match over those with no prefix match, but at least it is something to start with. This is as far as I could come with pure filter expression (without nesting functions or defining a new operator in JS). I have to see how good it is in actual real life usage.

[!is[system]search:title<userInput>]
:sort:integer[levenshtein<userInput>]
:sort:integer:reverse[split[ ]search:title:some,anchored<userInput>first[1]count[]]
:and[limit[250]]

The first two runs are identical as in examples above.
The third does the following:

  • Split each result on spaces. Possible to add additional steps e.g. split[/] if one uses/considers / as a “whitespace” between words.
  • For each word, check if some tokens from the query match its prefix.
  • Limit to first result, so that titles with multiple occurrences of the same word are not ranked higher, e.g. so that “TiddlyWeb JSON tiddler format” will not rank higher than “Tiddler” for query “tid”.
  • Count the results – will be 0 or 1 – and sort by this number in reverse, that is results with any prefix match before all those without prefix match.

Results for queries sort and widg, to compare with my examples above:

image

One could argue that for the query widg the simpler way (just Levenshtein, without prefix matching) yields better/ more useful results, but I think this is due to CamelCase titles like LinkWidget. Were the title “Link Widget”, it would show up in the first results.

Any hints on how to get it further (if at all possible without custom JS solution) welcome!

I think this might be the point where further optimization depends on a particular wiki and individual style of searching. CamelCase titles change a lot as in the example above (maybe results could be split on capital letters to match prefixes as in single words?).

I personally like what’s coming out of this discussion. And I really like the demo (thanks @pmario!) But…

In the light of this discussion and the numerous galvanizing threads that it spawned, I do wonder if we’re locking ourselves into an echo chamber.

So, “pretending” I’m new to TiddlyWiki and further pretending I want to learn about variables, I just visited
the demo and typed “variable” into the search. What appears is a promising list of choices performing much better than before (I’d say).

Then, with my “I’m a noob” hat on, I started opening the listed choices…

In a word, bewildering. This looks like developer documentation, not anything a noob would benefit from. There are so many tangled opportunities to “get lost” (breadcrumbs anyone?) it’s the kind of thing that makes students nervous – “Have I read everything? What if there’s more here somewhere? How do I tell?”

And then I spot a tag, “Variable Usage”, the same name as a tiddler I already have open, listing even more weird things like “Behaviour of invoked variables depends on how the variable was declared” – which, by the way, is waaaay down the list of choices offered by the search. If the key threshold concept (which, as a noob, I don’t even know I need) is buried that far down the list…

Well, you get the point.

Again, I love what’s coming out of this effort. It’s going to be extremely useful – for me. And of course, my ponderings above could just be a reflection more on the nature and characteristics of the documentation itself – it’s hard to say for certain. But I do wonder if it’s surfacing the stuff we want and ignoring what a notional noob user actually needs.

1 Like

Well, in this example with the Levenshtein sorting at least the tiddler “Variables” (the only one potentially useful to a newbie out of all results) is the first one, whereas in the default alphabetic search it is near the end.

It is similar for other important keywords, with Levenshtein the general tiddlers that a newbie might be searching are usually on top or near top.

It is far from ideal, but still better than the current alphabetic sorting imo, also for newbies.

Of course, considering the new user’s point of view is very important and easily overlooked, as you point out.
In my previous post I just wanted to say that I think the discussion here is a step in a good direction, even if a small one.

Perhaps it would be better to consider these two issues separately:

  1. Improving default core search mechanism, while keeping it generally-oriented, so that it remains solid across many possible TW use cases.
  2. Improving tw-com search with focus on easing access to general/introductory tiddlers that might be of interest to newcomers.

In the discussion above and on GH we have tried achieving 2. with 1., which might not be the best approach. Maybe tw-com needs additional care and some custom setup, like showing up search results from a defined “important” set of tiddlers first.

1 Like

Thanks. That’s a very considered response. :+1: