Fuzzy Search -- use diff-match-patch or other algorithms

Yaisog · March 28, 2023, 5:17pm

Hi @Mohammad, a little while back I also proposed adding a fuzzy mode to search:

[IDEA] Implement fuzzy search based on diff-match-patch

opened 05:40AM - 04 Mar 23 UTC

The diff-match-patch library used in #7290 also provides a fuzzy search function…ality (the _match_ part). The `search` operator could be extended with a fuzzy search option based on this already integrated library. ``` search::fuzzy[term],[threshold] ``` The `match_main` function requires an "expected location", though and also weighs / selects results based on their distance to this expected location. A difficulty might be to provide a sensible value for that. See https://github.com/google/diff-match-patch/wiki

Not sure if this will ever go anywhere, but at least it’s out there.

Mohammad · March 28, 2023, 5:33pm

Hi @Yaisog

This is amazing! I also checked the link you provided ( Diff, Match and Patch: Demo of Match (fraser.name)). This can empower the Tiddlywiki search operator.

Scott_Sauyet · March 28, 2023, 6:46pm

I’d be interested in seeing matching using other algorithms than Levenshtein distance, like perhaps something based on Dice’s Coefficient. The NPM package string-similarity does this, and I published what I think is an improvement on StackOverflow. This would give us faster searching, and what sometimes seems like a better fuzzy-matching technique than Levenshtein.

TW_Tones · March 28, 2023, 10:34pm

Looking at Dices co-efficient;

where |X | and |Y | are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

I have written filters that retrieve all fieldnames in two or more tiddlers, then compares this set to the current tiddlers fields, this calculation could be done this way comparing tiddlers.

Scott_Sauyet · March 28, 2023, 11:46pm

Just to be clear, the sets to be considered here are the two-letter combinations in the original strings.

So "tiddlywiki" would become ['ti', 'id', 'dd', 'dl', 'ly', 'yw', 'wi', 'ik', 'ki']. In the standard Dice’s Coefficient, we would simply count the existence of each of these items in one string or the other and the number that are in both, and apply that simple formula. My version is slightly more complex as it counts the multiplicities of these pairs (called bigrams.) Thus mine would convert "'Tiddlywiki users tidied their tiddlers" into

{
  "ti":3,"id":3,"dd":2,"dl":2,"ly":1,"yw":1,"wi":1,"ik":1,"ki":1,"i ":1," u":1,
  "us":1,"se":1,"er":2,"rs":2,"s ":1," t":3,"di":1,"ie":1,"ed":1,"d ":1,"th":1,
  "he":1,"ei":1,"ir":1,"r ":1,"le":1
}

and convert "Try to tidy the tiddlers in your Tiddlywiki" into

{
  "tr":1,"ry":1,"y ":2," t":5,"to":1,"o ":1,"ti":3,"id":3,"dy":1,"th":1,"he":1,
  "e ":1,"dd":2,"dl":2,"le":1,"er":1,"rs":1,"s ":1," i":1,"in":1,"n ":1," y":1,
  "yo":1,"ou":1,"ur":1,"r ":1,"ly":1,"yw":1,"wi":1,"ik":1,"ki":1
}

and apply a formula more complex than the original, but still fairly simple to give a distance of 0.6329113924050633. The raw numbers of these distances are not all that important. They get 0 and 1 values appropriately and they are ordered in a way that tends to match our intuition. That should be enough for fuzzy matching.

(I notice looking back on this that there’s a bug in that implementation. Single-character or empty strings will have a division-by-zero error… Be right back… Ok, fixed!)

There are possible improvements to this. We might want to remove bigrams that contain a space, for instance. We might want to consider punctuation as well as regular characters. Those would be easy.

While we could do this with TW’s tools, this is a place I would expect to descend into raw JS. I would be curious as to the speed differences. It probably wouldn’t matter for testing a few fields, but if we were testing the text of tiddlers, JS’s raw speed would probably be important.

intrinsical · March 29, 2023, 4:14am

The issue with levenshtein distance is that it is designed to measure the character-by-character similarity between a pair of strings, thus it is much stronger at measuring the similarity of word-pairs than phrase-pairs. The algorithm is not meant to measure similarity of phrases or sentences. For example, levenshtein would give “color” vs “colours” a high similarity score, but give a low similarity score to “red blue” vs “blue red”

What’s needed is to couple a phrase-pair similarity algorithm with levenshtein’s word-pair similarity algorithm. One such phrase-pair similarity algorithm I have used in the past is the Smith-Waterman-Gotoh algorithm. That’s what the algorithm was referred to back in 2004 when I used it. However, I believe there has been improvements made to the algorithm since then and it works much more efficiently these days.

I believe combining Smith-Waterman and Levenshtein should be what’s needed for fuzzy searching.

Yaisog · March 29, 2023, 5:22am

When creating the PR referenced above, I originally thought that we could simply use the match_main() function of the already included diff-match-patch library, with some arbitrary threshold. This would probably add the fewest possible lines to the core code.
However, that function requires an estimated location of the match and applies scoring penalties when the match is further away. So, it’s probably not straightforward to use as a fuzzy search function, even if it were a suboptimal one.
Since we already have the levenshtein filter operator, we can make a fuzzy search based on that by combining it with e.g. sortsub or :sort. So there’s really no need to add suffixes to search.

Scott_Sauyet · March 29, 2023, 1:07pm

I’d not used this algorithm, and only just read the Wikipedia page. I can’t tell from that how appropriate it is for text, which is probably less noisy than genetic sequences.

I’d be interested to hear about the improvements; I can’t see a way to reduce this from O (m * n), whereas Dice’s Coefficient, and my variant of it, should be O (m + n). But I tried something with my code that I had never tried before, and I’m now much less happy with it: I tested longer (paragraph-length) strings, one from Alice in Wonderland, and one from the Gettysburg Address, and the similarity score between them was above .6, which seems way too high. It was only slightly lower than a comparison between two nearby paragraphs from Alice. So the Dice techniques are fine for short, title-length, or perhaps sentence-length sequences, it looks like they break down for longer ones.

Ah well, so it goes.

intrinsical · March 29, 2023, 2:39pm

My apologies, I should have provided a clearer explanation - there were certain things I did not spell out in my haste to make the previous post. The key point I am trying to make is if you are trying to measure the similarity of word phrases, you need to measure similarity at character level, and again at word level.

Here’s a simple example of two phrases, “Neo took the red pill” versus “Neo took the blue pill”.

Using Levenshtein/edit distance at character level, the difference between both phrases is 5 because you need five deletions and insertions to edit one phrase into the other… delete the characters “r” & “d”, and insert the characters “b”, “l” & “u”.

Just purely as an example, we can continue using Levenshtein/edit distance at word level too. The edit distance between both phrases is just 1 because all you need to do is substitute one word, “red” for “blue”.

I hope this example illustrates why it is not ideal to use character-level similarity for any string that is comprised of more than one word. Thus when it comes to comparing similarity of titles, you need a similarity measure that considers both character and word level similarity. If you are comparing paragraphs or entire articles, you will need to consider sentence level similarity, and even paragraph level similarity too.

As an aside, the difference between character level and word level is just the size of the vocabulary. At character level, there are just the 26 letters of the alphabet to consider thus the vocabulary size is 26. Whereas at word level, the vocabulary size is much larger. However, the same edit distance algorithms would continue to work at both levels.

The Smith-Waterman-Gotoh algorithm is just another edit distance measuring algorithm, because if you looked at its guts it is still counting the number of insertions and deletions required between two string sequences. All I am doing is applying it at word level, instead of at character level. Theoretically, you could continue using levenshtein distance too, just with a much expanded vocabulary. I just jumped the gun in my original post and mentioned the Smith-Waterman-Gotoh because it was what I used in my computational linguistics masters thesis back in 2004-2005 for a similar purpose and it had some nicer properties that I found useful in my thesis. The S-W-G algorithm itself isn’t critical here for this use case and can be substituted with any edit-distance algorithm.

The final point I need to make is that once you have an algorithm for character level similarity and word level similarity, you will need to combine both similarity scores for both levels to get a single similarity score. And that will be your algorithm for fuzzy matching titles.

Scott_Sauyet · March 29, 2023, 3:36pm

Hmm, it would be very interesting to apply this same idea with the Dice technique, because each level should be O(m + n) (with some slight fudging for the cost of splitting strings into characters, bigrams, words, and paragraphs.) Does S-W-G yield a true metric? Because Dice only supplies a pseudo-metric, not that I think it matters much here. But it would be interesting to know.

intrinsical · March 29, 2023, 5:25pm

I am a little iffy about using dice’s coefficient for one reason, but I am not certain if the reason is a valid one for the specific case of measuring word similarity.

My training is in the use of machine learning for natural language processing. And I came from the era back in the mid-2000s where most of the techniques heavily used N-grams (unigrams, bigrams, trigrams, and so on). The primary reason why the entire field moved away from using N-grams is because of what is colloquially known as the “bag of words” issue. That is, N-grams do not care explicitly about the ordering of stuff. Whereas we know the ordering of characters and words is vital in language. The use of bigrams, trigrams and higher-order-grams were attempts at implicitly addressing this issue. They certainly helped, but did not completely get rid of all the issues the field had with using n-grams for language tasks.

The above is why I am iffy about using Dice’s coefficient. That said, at the moment I cannot recall any specific negative issues with using it as an “edit distance” algorithm. So iffy concerns aside, I guess it should be okay?

As for your question regarding “true metrics”, I presume you want to treat the algorithm’s distance as a pseudo-probability that always take a value between 0 and 1? It shouldn’t be an issue. The worst case in edit distance is two completely different sequences, which means requiring m deletes and n inserts to edit one sequence into the other, or m+n edits. Just divide the edit distance algorithm’s numeric output by m+n should give you a workable pseudo-probability that always falls between 0 and 1.

Scott_Sauyet · March 29, 2023, 5:50pm

Right, and that’s what bit me when I tried longer samples. Too many common bigrams in English mean that very different texts would end up reporting as matched. I may, when I have time, try to extend this to work with words, sentences, and paragraphs as well, combining the results from the various levels. But it would be mostly an ad hoc attempt at solving the problem in a speedy algorithm.

No, that’s a bit different. There’s a million ways to turn a distance metric into something useful for matching on [0, 1]. But there are nice feature available to metric spaces when the distance function fulfills the metric laws. However, n-gram based distances usually (always?) fail to satisfy the triangle inequality. I didn’t know if S-W-G happens to generate a true metric. It’s probably only of academic concern, but I was curious if you happened to know.

Yaisog · April 9, 2025, 2:53pm

I created a PR that implements fuzzy search using the diff-match-patch library that is already included in TW:
Add a fuzzy search mode to the search operator by yaisog · Pull Request #9023 · TiddlyWiki/TiddlyWiki5

Scott_Sauyet · April 9, 2025, 7:24pm

Thank you. I’m glad to see this.

But I would still argue that we should use a different algorithm for scoring fuzzy matches. Edit distance gets some odd results fairly quickly. For instance, the first ten title matches for "Widgets Set" in Levenshtein are

FillWidget, Filter Run, Filter Step, FirstOne, FirstThree, ImageWidget, License, LinkWidget, ListWidget

and in String similarity, they are

SetWidget, LetWidget, Widgets, BrowseWidget, SelectWidget, LogWidget, SetVariableWidget, EditWidget, FillWidget, LinkWidget

which feels a lot more reasonable.

Yaisog · April 10, 2025, 5:53am

Hi @Scott_Sauyet, I’m not a big fan of the Bitap algorithm and the Levenshtein distance myself. I simply wanted to show that there is additional functionality in the diff-match-patch library that is already part of the core, which could be used to create a fuzzy search of sorts.

This will work best for single typos, with a low-enough threshold. Higher thresholds will spam unusable results. I don’t expect that this PR will ever make it into the core, but it’s out there as a demo now for those who want to use it.

Your digram-based algorithm looks interesting. If I ever feel inspired enough, I might try to monkey-wrench this into the search code (once I understand it).

zhurnaly · April 10, 2025, 12:50pm

this Fuzzy Search approach looks AWESOME GOOD!!!

could somebody kindly explain to a hatchling like me how to install “fuzzy” into my own TiddlyWiki journal so i can experiment with it? – i have a GitHub account and can log in and read the discussion Add a fuzzy search mode to the `search` operator by yaisog · Pull Request #9023 · TiddlyWiki/TiddlyWiki5 · GitHub – but i need a kind teacher (thank you in advance, Sensei!) to tell me step-by-step how to install the appropriate code to make fuzzy work as a flag in the search operator (e.g., is there a plugin or other special tiddler that i can drag-and-drop into my existing TiddlyWiki journal? or do i download something to replace a *.js file somewhere?)

if this is too complex or dangerous an operation for a newbie like me to attempt, no worries – just tell me to wait a bit and i shall – and thank you all!

Yaisog · April 10, 2025, 1:29pm

Hi @Scott_Sauyet, so I starting thinking how I would go about implementing Dice’s algorithm. The thing is that a search does more than just compare strings. First, you have to find substrings in a possibly very large text to compare against. This is the main function of diff-match-patch in the current implementation of search::fuzzy.
In your example, you only compare a reference term to all titles to find the best matches. How would you find the term in the tiddler text field?

At a quick glance, I couldn’t find libraries that use Dice to find and score substring matches in large texts. If you know of any, let me know. Otherwise I’d have to splice Dice scoring into the heart of diff-match-patch, which is something I’m not very keen to do.

Scott_Sauyet · April 10, 2025, 3:51pm

I may be confusing things. We discussed problems with the Levenshtein operator in #11237, and that came down to an incorrect implementation of Levenshtein in diff-match-patch. My extension of the Dice’s Coefficient technique is only an alternative to Levenshtein. I never spent any real time on the other behaviors of diff-match-patch; I have no idea how it finds matching substrings in large text. I assumed it was something similar to Myer’s algorithm, which is used in git. But that’s for finding the most efficient series of edits to convert one string into another, not for finding the closest match to a target within a given source. At some point I’ll dig into the code.

TiddlyTitch · April 10, 2025, 4:40pm

Right. I noticed on results in TW v. an external L. tool.

IF it were ever corrected in d-m-p would you be okay? Faulty Levenshtein → Levenshtein might be a lot better?

Just asking, TT

Scott_Sauyet · April 10, 2025, 5:00pm

The faulty Levenshtein isn’t too big a deal, I believe, where it’s all about ordering and thresholds.

But I don’t think that Levenshtein in general is a particularly good fit for fuzzy matching. My examples above with fuzzy matching for "Widget Set" are using a correct Levenshtein implementation, and to my eyes the list from my extension to Dice’s Coefficient is much better than the list from Levenshtein.

As @Yaisog points out, though, this is more about fuzzy searching, and I don’t think my technique extends there particularly well, unless it’s the scoring mechanism within another algorithm.