Unique word count to ignore punctuation (and mystery question)

Hi,

I wanted a unique word count but then decided to look at the list of words that was produced:

{{{ [all[tiddlers]!is[system]get[text]splitregexp[\s+]unique[]] }}}

and noticed that ‘apple’ and ‘apple,’ count as two words.

So what additions are needed to ignore punctuation?

I also found that the above produced what appeared to be an encrypted tiddler (so very long) beginning with /9j/4AAQSkZJRg

This was strange as I don’t have any encrypted tiddlers and its presence is presumably also distorting the word count.

However, if you apply the code above to tiddlywiki.com the same tiddler comes up. So curious to know what this is.

Thanks
Jon

For me it just finds the starlight theme!

OK, so if you do an advanced search, that’s what you find. But if you do it using the code above as suggested, it looks very different and I’d like to know how it impacts the word count.

Try this (with caution—it will be very slow in any sizable wiki! Personally, I replaced all[tiddlers]!is[system] with tag[HelloThere] for testing on tiddlywiki.com.)

\define non-alpha() [^A-Za-z]+

{{{ [all[tiddlers]!is[system]get[text]splitregexp<non-alpha>lowercase[]unique[]] }}}

You can add +[sort[]] to your original filter and this one to make it easy to compare results.

I also added lowercase[] to prevent A and a from being counted separately… this might eliminate a few proper nouns that should arguably be counted as distinct words, but I think it’s a reasonable trade-off.

This fragment seems to be coming from $:/themes/tiddlywiki/starlight/styles.tid, where it’s the first part of the embedded background image used by the Starlight theme. I’m not sure why it does come up in the results; it ought to be excluded by !is[system]. But if you have embedded images (or fonts, or other files you’ve dropped into your wiki) not excluded by the filter, I’d expect them to impact your total word count as well. In fact, my punctuation-stripping filter might make the issue exponentially worse, as it would split the encoded file data at every slash and digit.

That works great and I take your point about the performance bit so I’m counting using tags as you suggest. I can see all the punctuation is stripped out.

Ah - of course - embedded images, that’s what they look like.

Many thanks
Jon

Just a quick note: For most purposes the @etardiff solution (strip away the non-alpha) is efficient and will do the job!

Some odd bits will show up when the name Martínez tuns into mart and nez in the worldlist, and wouldn’t turns into wouldn and t (etc.). And we may or may not want alphanumeric strings to show up as whole words on our list — as in 4’33", 1970s, TRS-80 and 50th…

Alas, apostrophes, ending single-quotes, and prime-marks (“dumb” apostrophes) can’t accurately be simply parsed as text vs punctuation; they can appear within a word (when the word is a contraction), and they can also serve as punctuation right up against a word on either end. We can safely say of any apostrophe or prime-mark that it’ll count as part of a word IFF it’s sandwiched by alpha-numeric characters on both sides (letting the word ’twas become twas, small loss!). At any rate, the more precision matters — and the more diversity of real-world input we’re fielding — the more complex our task becomes. :slight_smile:

EDIT: One tweak I might experiment with is: not imposing lower-case from the get-go, but finding unique strings, and then going through to drop the capitalized variant of any string that appears in both capitalized and non-capitalized forms… It’s still not perfect, especially for short texts… but you’d be hedging against capitalized versions of many common words (assuming the sample is decently long), and for most names and other proper nouns you’d get only the capitalized appearance… Perhaps AI could give us a compact list of ordinary words that are statistically (disproportionately) most likely to appear only in sentence-initial positions, such as nevertheless, however, firstly… Then again, if your subject-matter tends to have only limited and predictable use of strings that shouldn’t appear in lowercase, it would indeed be wise to start with etardiff’s pattern of imposing lowercase, allowing an ad-hoc list of word-strings whose case needs fixing (McCarthy, IBM, VoIP, I [the pronoun]) before the list is rendered…

2 Likes