How do I count the total number of words in my tiddlywiki tiddlers title

Dear friends

I want to see the total number of words for each tiddlers title

For example, if I have three tiddlers and each header is 3,4,5, I get 3+4+5=12

The following code is to view the number of words in the knowledge base that you created

<mark>My Knowledge base Total word count:<$text text={{{ [!is[system]get[text]split[ ]splitregexp[(.)]!is[blank]count[]] }}}  /> </mark>

Any reply would be greatly appreciated

Wouldn’t you have to make one big string out of all the titles first?

Also, shouldn’t you use the title field instead of the text field (which is the main text typed into the main field in the tiddler)?

Maybe the first part of your filter run could be like this:

[!is[system]get[title]join[ ]]

That seems to make a huge string with all the titles together. Then maybe you can count it. Or maybe you don’t need to join it first before counting it. I don’t know. Good luck.

1 Like

Thank you very much, you can count the total number of words in all the headings, it seems that there is an illusion that I did not succeed in this attempt before, which is a little strange

[!is[system]get[title]split[ ]splitregexp[(.)]!is[blank]count[]]

Not seeing why you would need regexp here.

{{{ {{{ [!is[system]get[title]] +[enlist-input[]count[]] }}}

The logic of the previous inline filter is correct

Somewhat less

I have more than 1,500 notes, the length of each note title is mostly 6 to 9 characters

I suspect what you actually want is a filter that will count space-separated words (English) + individual characters (Chinese). But I don’t know the regex to find Chinese characters only, so someone else will have to help you with that.

Are you looking for the total number of words/characters used in all your titles, or the total number of unique words? If you care about unique words, you’ll probably also want to strip out numbers and punctuation (so Tiddlywiki and Tiddlywiki: aren’t considered different words).

The number of characters contained in all headings of all notes created by the user

I’m afraid there may be some language barriers here.

When you say “notes”, do you mean “tiddlers”? Or is this some subset of the tiddlers, perhaps those tagged “Note”

When you say “headings” do you mean the “title” field? Or are you talking about the (possibly generated) H1, H2, … H7 tags, which might come from wikitext such as !! Heading Level Two, or !!!! Heading Level Four?

Finally, I know that breaking Chinese text into words is more complex than in English or other Latin alphabet languages. So I would expect that splitting on spaces is not enough, but it’s the only technique I know. So would this be acceptable?:

const text = "你说到这是一个测试"
text.split('')
//=> ["你", "说", "到", "这", "是", "一", "个", "测", "试"] // giving nine "words"

If so, that can be done with the split operator, using various kinds of input:

[{!!title}split[]]
// or
[<my-text>split[]]
// or
[[你说到这是一个测试]split[]]
1 Like

This has been corrected to the correct expression

Yes, it means that

By using the ‘[{!!title}split[]]’ filter the result is

That’s a somewhat bizarre result. When I try that, I do not get the two and's, just the expected e's.

But if you want to split English strings (or those of any Latin-script language, I believe) into words, you can split on spaces, with split[ ]. (Note the empty space between the brackets.) That will have some flaws, for instance, including starting/ending punctuation with the words… I don’t think there’s anything entirely perfect, capturing all possible strings, but this should be pretty good: splitregexp[\W+].

For instance

[[This--and that--for "those" 123 people who care?]splitregexp[\W+]]

yields

  • This
  • and
  • that
  • for
  • those
  • 123
  • people
  • who
  • care

(Note, though, that this still has issues. daughter-in-law would become three distinct words.)

But this doesn’t work with Chinese characters. I don’t know Chinese, so I may be mistaken, but I don’t think there is any simple rule to split a string of Chinese characters into distinct words.

I imagine there’s a technique with regex to split a string into sections of Latin characters and Chinese ones. So if you have a string → words technique for Chinese, we might be able to break into language-grouped sections, applying the appropriate rule to each section and combining the results back into a single string. But I don’t know such a technique.

The result is the same

This can be achieved through python’s Chinese lexicon

pip install jieba

Yes, but that’s a big library, not a “simple rule”. To do this in Tiddlywiki, we could import the JS port of jieba to do this work, but that would involve including an 11+ MB dictionary. This is not for the faint-of-heart.

1 Like

In JavaScript there is this API Intl.Segmenter - JavaScript | MDN

Nice! I’ve never run across that.