Are there any tools for clustering of tiddlers in a wiki?

I was cleaning up a big digital archive with fslint and wondered if there are similar tools/plugins for TiddlyWiki related activities.

Let’s say there’s a single HTML file wiki (if we talk about a Node.js one, then each tiddler is a file and regular file tools - same fslint - can be used for such tasks, but it’s not always possible to use Node.js) which has tiddlers with copypasted text fields. And I don’t mean just one tiddler copypasted multiple times, I mean multiple tiddlers copypasted into more tiddlers. Does the TiddlyWiki ecosystem contain any tools for clustering tiddlers into subsets of tiddlers with identical text field for example?

The logical assumption here is to use wikitext scripting, but that’s not an end user tool, that’s a programming process.

Closest I know for tiddler management tasks is Tiddler Commander plugin, can it do such things?

I needed to google that. Duplicate file finder.

I suppose it would depend on the nature of the data, Duplicate files are more likely to exist in the file system raher than tiddler wiki because you can have the same name in any folder. In tiddlywiki each tiddler requires a unique titles.

If however you me compare the text fields to see if any tiddler has the identical text field, a batch process can be written in tiddlywiki, the advantage being you can then delete the offending tiddlers.

  • But if you can limit the tiddlers to compare you can make it more performant.

This sounds more like a one off data cleanup so I suggest using the tool you feel most comfortable with. Once its done you can forget it.

The lead question “Are there any tools for clustering of tiddlers in a wiki?” I would answer yes, as list manipulation is in the heart of tiddlywiki.

I elaborated above. Two files with same content but different names are still duplicates. In contest of TiddlyWiki, I stated it’s about a text field match. A practical scenario is having journal tiddlers with copypasted text field. For a low amount of them (when you only start a journal wiki) it’s not a problem. But when you copypaste the same daily routine over and over, size starts to grow to the point of you wanting to have just one text piece (or more later) and use transclusion in journal tiddlers.

Ah,Ok I used to use a template for journal tiddler with regular items. So each Journal entry would be very repetative, especialy if I did not “do the standing tasks” the trick in tiddlywiki is to use a template that only overlays the journal tiddler, I only stores intomation edited, and standing tasks are only setting a field name in the journal tiddler. If I dont use the Journal it will be empty.

The issue you have is you may need to compare every journal tiddlers text field with every other journal tiddlers text field, then do this for every journal tiddler.

  • However if you know what a “neglected journal tiddler” text field looks like you could just campare all Journal tiddlers with that text. That would be easier.

Its all about your data :nerd_face:

~~ any one who can revert my original post plz do so ~~

~unintended draft/edit …~

:roll_eyes:

1 Like

this may be needed but if it’s a simple duplicate you can ask something like this

[all[current]get[text]match{example tiddler}]

this is a whole of text comparison.

in these cases if the number of tiddlers is high, I recommend you set a limit and after testing make it restartable. to continue where you left off.

Are you happy enough to include trivial matches, such as the empty string or "yes" / "no"?

If so, then I think the suggestion from @wiki_user is appropriate: hash the values and use the hashes as keys paired with values that are lists of titles.

I think Levenshtein and DiffText would not be suitable for even moderately large wikis, as they would require full application of non-trivial algorithms to every pair of tiddlers in the wiki, so O(n2), with a large constant. Hashing algorithms like MD5 or SHA make much more sense. And I’m pretty sure at least one is built into TW.

But if you don’t want to run this inside the wiki browser process, it should be pretty easy to parse the HTML file, find the tiddler store, extract the JSON, and do a similar comparison on the tiddlers found that way.

This idea sounds very reasonable and I’m familiar with handling JSON in Python. Indeed, this is an option that would spare me from both trying to solve the problem inside a single file wiki, using pure wikitext programming, as well as digging into Node.js TiddlyWiki to get a bunch of regular files and work with them. Thank you!

PS: hypothetically, it shall be a tad harder to work with encrypted wikis (because I don’t know how to do crypto voodoo in Python), but for one-time linting, it’s possible to save a decrypted copy on a trusted device, do the cleanup, then save it back encrypted.

marlon

“You probably don’t want to do that – just trust me, okay?”

I’d prefer you to elaborate.

Don’t place on an unsecured medium (e.g. disk), that which should remain encrypted. IOW, decrypted/unencrypted data should be placed in volatile memory only.

Of course, you can do whatever you want, but that will serve you well as a guideline.

Here is a quick attempt with jupyter notebook using BeautifulSoup to compare tiddler text of a standalone wiki. No clue how to handle if encrypted.

https://github.com/Drevarr/Notebooks/blob/main/Compare_Tid_Text.ipynb

Perhaps I should’ve elaborated too. By trusted computer I mean a computer with full disk encryption and no physical access for anyone but me. In this scenario, is keeping a wiki file not encrypted (by TiddlyWiki’s encryption mechanism) really less secure?

print(f"✅ Found {len(tiddlers)} tiddlers")

# --- Grouping and comparison

Non-ASCII characters in printed string and the --- comment starting with uppercase patterns triggered my paranoia twice (that this was written by AI) :smile:

This may be useful for you. It searches for identical text field among the list of tiddlers you specified in “allValueList” and lists them out. Put the codes below in a tiddler and update “allValueList” accordingly to check your list of tiddlers.

I wrote it for myself and have to adapt it to work with text field, which may contain linefeeds that messed up run list operations. I don’t have a lot of tiddlers with big text field to test though, so don’t know how well this will work out. Give it a try.

\function line.feed() [charcode[10]]
\procedure findDuplicate( fieldToCheck )
<$let allValueList={{{ =[!is[system]has<fieldToCheck>get<fieldToCheck>!is[blank]search-replace:gm:regexp[\n],[␊]format:titlelist[]join[ ]]}}}  >
    <$list filter="=[enlist:raw<allValueList>] -[enlist<allValueList>] +[unique[]]" variable="dupValue" counter=ct >
            <$let textContent={{{[<dupValue>search-replace:gim[␊],<line.feed>]}}} > 

!!! <<ct>>. <<fieldToCheck>> : "<$list filter="[<textContent>split[]limit[80]join[]]"><$view field="title"/>"</$list>

         <$transclude $variable="list-links" filter=`[has$(fieldToCheck)$field:$(fieldToCheck)$[$(textContent)$]]` />
             </$let>
    </$list>
</$let>
\end

<<findDuplicate "text">>

I copypasted this into a new tiddler in my wiki, but text values in the displayed list contain tiny pieces consisting of parts from single lines from my actual tiddlers. They are tiddlers with multiline wikitext, including [[ and ]] markers, which have special meaning in filters. At least if the value of text in a list item ends with ]], the next list item is Filter error: Syntax error in filter expression.

Unfortunately this is a private wiki with personal data, so I can’t paste it here, but I may try to assemble a minimal test case that replicates the issue.

fwiw afaik it is possible to “export” to a folder

tiddlywiki --help savewikifolder

–savewikifolder <wikifolderpath> [<filter>] [ [<name>=<value>] ]*

Try this. It should handle those pesky brackets. I went to bed after sending the post and it occurred to me also that those brackets might break the filter as my test data are all text, but I was too tired to get out. There might still be other cases that will break the filter though, but it should be obvious when it does.

\function line.feed() [charcode[10]]
\define rsb() ]]
\define lsb() [[
\procedure findDuplicate( fieldToCheck )
<$let allValueList={{{ =[!is[system]has<fieldToCheck>get<fieldToCheck>!is[blank]search-replace:gm:regexp[\n],[␊]search-replace:gm<rsb>,[》]search-replace:gm<lsb>,[《]format:titlelist[]join[ ]]}}}  >
    <$list filter="=[enlist:raw<allValueList>] -[enlist<allValueList>] +[unique[]]" variable="dupValue" counter=ct >
            <$let textContent={{{[<dupValue>search-replace:gim[␊],<line.feed>search-replace:gm[》],<rsb>search-replace:gm[《],<lsb>]}}} > 

!!! <<ct>>. <<fieldToCheck>> : '<$list filter="[<textContent>split[]limit[80]join[]]"><$view field="title"/> ...'</$list>
         <$transclude $variable="list-links" filter=`[has$(fieldToCheck)$field:$(fieldToCheck)$<textContent>]` />

             </$let>
    </$list>
</$let>
\end

<<findDuplicate "text">>

This is for completeness. I’m doing this for my own use, and found more hidden characters (sigh) that will break the filters. This version is tested against 50mb of tiddlers (h0p3 philosopher TW) plus the official TW 5.3.8 tiddlers, so should be reasonably robust against regular wikitext.

\function line.feed() [charcode[10]]
\function carriage.return() [charcode[13]]
\function line.terminator() [charcode[8232]]
\function paragraph.terminator() [charcode[8233]]
\define lsb() [[
\define rsb() ]]
\procedure findDuplicate( fieldToCheck )
<$let allValueList={{{ =[!is[system]has<fieldToCheck>get<fieldToCheck>!is[blank]search-replace:g<line.feed>,[␊]search-replace:g<carriage.return>,[␍]search-replace:g<lsb>,[⦕]search-replace:g<rsb>,[⦖]search-replace:g<line.terminator>,[␤]search-replace:g<paragraph.terminator>,[¶]format:titlelist[]join[ ]]}}}  >
    <$list filter="=[enlist:raw<allValueList>] -[enlist<allValueList>] +[unique[]]" variable=dupValue counter=ct >
            <$let textContent={{{[<dupValue>search-replace:g[␊],<line.feed>search-replace:g[␍],<carriage.return>search-replace:g[⦕],<lsb>search-replace:g[⦖],<rsb>search-replace:g[␤],<line.terminator>search-replace:g[¶],<paragraph.terminator>]}}} >

!!! <<ct>>. <<fieldToCheck>> : <$list filter="[<textContent>split[]limit[80]join[]]"> '<$view field="title"/> ...' </$list>
                 <$transclude $variable="list-links" filter=`[all[]!is[system]has[$(fieldToCheck)$]field:$(fieldToCheck)$<textContent>]` />
            </$let>
    </$list>
</$let>
\end

<<findDuplicate "text">>
1 Like

If you’re running in relatively modern JS environments, then you might look to regexp escape to avoid doing this maually.