Fuzzy Search -- use diff-match-patch or other algorithms

Hi @Mohammad, a little while back I also proposed adding a fuzzy mode to search:

Not sure if this will ever go anywhere, but at least it’s out there.

1 Like

Hi @Yaisog

This is amazing! I also checked the link you provided ( Diff, Match and Patch: Demo of Match (fraser.name)). This can empower the Tiddlywiki search operator.

1 Like

I’d be interested in seeing matching using other algorithms than Levenshtein distance, like perhaps something based on Dice’s Coefficient. The NPM package string-similarity does this, and I published what I think is an improvement on StackOverflow. This would give us faster searching, and what sometimes seems like a better fuzzy-matching technique than Levenshtein.

1 Like

Looking at Dices co-efficient;

where |X | and |Y | are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

I have written filters that retrieve all fieldnames in two or more tiddlers, then compares this set to the current tiddlers fields, this calculation could be done this way comparing tiddlers.

Just to be clear, the sets to be considered here are the two-letter combinations in the original strings.

So "tiddlywiki" would become ['ti', 'id', 'dd', 'dl', 'ly', 'yw', 'wi', 'ik', 'ki']. In the standard Dice’s Coefficient, we would simply count the existence of each of these items in one string or the other and the number that are in both, and apply that simple formula. My version is slightly more complex as it counts the multiplicities of these pairs (called bigrams.) Thus mine would convert "'Tiddlywiki users tidied their tiddlers" into

{
  "ti":3,"id":3,"dd":2,"dl":2,"ly":1,"yw":1,"wi":1,"ik":1,"ki":1,"i ":1," u":1,
  "us":1,"se":1,"er":2,"rs":2,"s ":1," t":3,"di":1,"ie":1,"ed":1,"d ":1,"th":1,
  "he":1,"ei":1,"ir":1,"r ":1,"le":1
}

and convert "Try to tidy the tiddlers in your Tiddlywiki" into

{
  "tr":1,"ry":1,"y ":2," t":5,"to":1,"o ":1,"ti":3,"id":3,"dy":1,"th":1,"he":1,
  "e ":1,"dd":2,"dl":2,"le":1,"er":1,"rs":1,"s ":1," i":1,"in":1,"n ":1," y":1,
  "yo":1,"ou":1,"ur":1,"r ":1,"ly":1,"yw":1,"wi":1,"ik":1,"ki":1
}

and apply a formula more complex than the original, but still fairly simple to give a distance of 0.6329113924050633. The raw numbers of these distances are not all that important. They get 0 and 1 values appropriately and they are ordered in a way that tends to match our intuition. That should be enough for fuzzy matching.

(I notice looking back on this that there’s a bug in that implementation. Single-character or empty strings will have a division-by-zero error… Be right back… Ok, fixed!)

There are possible improvements to this. We might want to remove bigrams that contain a space, for instance. We might want to consider punctuation as well as regular characters. Those would be easy.


While we could do this with TW’s tools, this is a place I would expect to descend into raw JS. I would be curious as to the speed differences. It probably wouldn’t matter for testing a few fields, but if we were testing the text of tiddlers, JS’s raw speed would probably be important.

The issue with levenshtein distance is that it is designed to measure the character-by-character similarity between a pair of strings, thus it is much stronger at measuring the similarity of word-pairs than phrase-pairs. The algorithm is not meant to measure similarity of phrases or sentences. For example, levenshtein would give “color” vs “colours” a high similarity score, but give a low similarity score to “red blue” vs “blue red”

What’s needed is to couple a phrase-pair similarity algorithm with levenshtein’s word-pair similarity algorithm. One such phrase-pair similarity algorithm I have used in the past is the Smith-Waterman-Gotoh algorithm. That’s what the algorithm was referred to back in 2004 when I used it. However, I believe there has been improvements made to the algorithm since then and it works much more efficiently these days.

I believe combining Smith-Waterman and Levenshtein should be what’s needed for fuzzy searching.

When creating the PR referenced above, I originally thought that we could simply use the match_main() function of the already included diff-match-patch library, with some arbitrary threshold. This would probably add the fewest possible lines to the core code.
However, that function requires an estimated location of the match and applies scoring penalties when the match is further away. So, it’s probably not straightforward to use as a fuzzy search function, even if it were a suboptimal one.
Since we already have the levenshtein filter operator, we can make a fuzzy search based on that by combining it with e.g. sortsub or :sort. So there’s really no need to add suffixes to search.

1 Like

I’d not used this algorithm, and only just read the Wikipedia page. I can’t tell from that how appropriate it is for text, which is probably less noisy than genetic sequences.

I’d be interested to hear about the improvements; I can’t see a way to reduce this from O (m * n), whereas Dice’s Coefficient, and my variant of it, should be O (m + n). But I tried something with my code that I had never tried before, and I’m now much less happy with it: I tested longer (paragraph-length) strings, one from Alice in Wonderland, and one from the Gettysburg Address, and the similarity score between them was above .6, which seems way too high. It was only slightly lower than a comparison between two nearby paragraphs from Alice. So the Dice techniques are fine for short, title-length, or perhaps sentence-length sequences, it looks like they break down for longer ones.

Ah well, so it goes.

My apologies, I should have provided a clearer explanation - there were certain things I did not spell out in my haste to make the previous post. The key point I am trying to make is if you are trying to measure the similarity of word phrases, you need to measure similarity at character level, and again at word level.

Here’s a simple example of two phrases, “Neo took the red pill” versus “Neo took the blue pill”.

Using Levenshtein/edit distance at character level, the difference between both phrases is 5 because you need five deletions and insertions to edit one phrase into the other… delete the characters “r” & “d”, and insert the characters “b”, “l” & “u”.

Just purely as an example, we can continue using Levenshtein/edit distance at word level too. The edit distance between both phrases is just 1 because all you need to do is substitute one word, “red” for “blue”.

I hope this example illustrates why it is not ideal to use character-level similarity for any string that is comprised of more than one word. Thus when it comes to comparing similarity of titles, you need a similarity measure that considers both character and word level similarity. If you are comparing paragraphs or entire articles, you will need to consider sentence level similarity, and even paragraph level similarity too.

As an aside, the difference between character level and word level is just the size of the vocabulary. At character level, there are just the 26 letters of the alphabet to consider thus the vocabulary size is 26. Whereas at word level, the vocabulary size is much larger. However, the same edit distance algorithms would continue to work at both levels.

The Smith-Waterman-Gotoh algorithm is just another edit distance measuring algorithm, because if you looked at its guts it is still counting the number of insertions and deletions required between two string sequences. All I am doing is applying it at word level, instead of at character level. Theoretically, you could continue using levenshtein distance too, just with a much expanded vocabulary. I just jumped the gun in my original post and mentioned the Smith-Waterman-Gotoh because it was what I used in my computational linguistics masters thesis back in 2004-2005 for a similar purpose and it had some nicer properties that I found useful in my thesis. The S-W-G algorithm itself isn’t critical here for this use case and can be substituted with any edit-distance algorithm.

The final point I need to make is that once you have an algorithm for character level similarity and word level similarity, you will need to combine both similarity scores for both levels to get a single similarity score. And that will be your algorithm for fuzzy matching titles.

2 Likes

Hmm, it would be very interesting to apply this same idea with the Dice technique, because each level should be O(m + n) (with some slight fudging for the cost of splitting strings into characters, bigrams, words, and paragraphs.) Does S-W-G yield a true metric? Because Dice only supplies a pseudo-metric, not that I think it matters much here. But it would be interesting to know.

I am a little iffy about using dice’s coefficient for one reason, but I am not certain if the reason is a valid one for the specific case of measuring word similarity.

My training is in the use of machine learning for natural language processing. And I came from the era back in the mid-2000s where most of the techniques heavily used N-grams (unigrams, bigrams, trigrams, and so on). The primary reason why the entire field moved away from using N-grams is because of what is colloquially known as the “bag of words” issue. That is, N-grams do not care explicitly about the ordering of stuff. Whereas we know the ordering of characters and words is vital in language. The use of bigrams, trigrams and higher-order-grams were attempts at implicitly addressing this issue. They certainly helped, but did not completely get rid of all the issues the field had with using n-grams for language tasks.

The above is why I am iffy about using Dice’s coefficient. That said, at the moment I cannot recall any specific negative issues with using it as an “edit distance” algorithm. So iffy concerns aside, I guess it should be okay?

As for your question regarding “true metrics”, I presume you want to treat the algorithm’s distance as a pseudo-probability that always take a value between 0 and 1? It shouldn’t be an issue. The worst case in edit distance is two completely different sequences, which means requiring m deletes and n inserts to edit one sequence into the other, or m+n edits. Just divide the edit distance algorithm’s numeric output by m+n should give you a workable pseudo-probability that always falls between 0 and 1.

1 Like

Right, and that’s what bit me when I tried longer samples. Too many common bigrams in English mean that very different texts would end up reporting as matched. I may, when I have time, try to extend this to work with words, sentences, and paragraphs as well, combining the results from the various levels. But it would be mostly an ad hoc attempt at solving the problem in a speedy algorithm.

No, that’s a bit different. There’s a million ways to turn a distance metric into something useful for matching on [0, 1]. But there are nice feature available to metric spaces when the distance function fulfills the metric laws. However, n-gram based distances usually (always?) fail to satisfy the triangle inequality. I didn’t know if S-W-G happens to generate a true metric. It’s probably only of academic concern, but I was curious if you happened to know.