How to extract words with regular expressions

TW_Tones · February 18, 2024, 4:49am

Folks, I still have a form of dyslexia when reading regular expressions. I have followed multiple links inside talk.tiddlywiki and google. So I hope someone may help.

Here is a list widget I am using to extract words from the documentation tiddlers I placed in a plugin.

<$list filter="[[TiddlyWiki-Documentation]plugintiddlers[]limit[5]get[text]splitregexp[\s]sort[]unique[]!subfilter{English Stop words}!subfilter{Common English verbs}]">

</$list>

The problem is it does not remove punctuation at the end of words such as , and . and a few others like " and < ( { ! ?

Can you tell me an appropriate regex to use in the regexp operator please?

buggyj · February 18, 2024, 9:43am

maybe

\define nonwords() [^a-zA-Z]
<$list filter="[[atiddler]get[text]splitregexp<nonwords>sort[]unique[]]">

TW_Tones · February 18, 2024, 9:57am

Yes, I had thought of that but for some reason excluded it.

I am curious why you call it nonwords, when it results in words?

Thanks, that seems to do what I expect.

buggyj · February 18, 2024, 12:03pm

it splits on the non-words, so the resulting fragments are words. The regex has a [^…] = not a … ,

pmario · February 18, 2024, 2:22pm

If you want to learn regexp. IMO this is the best source I know. https://www.regular-expressions.info/

Start with the QuickStart info and you have to work through it. Reading is not enough to understand it.

Secret-HQ · February 18, 2024, 8:11pm

@TW_Tones —

I was thinking you may want to include digits in @buggyj’s nonwords definition, updating it to: [^A-Za-z0-9].

That way, you wouldn’t be splitting words with numbers in them, like 1Password or Voice2TXT.

But the drawback there is that you’re still splitting on characters like -, á, &, ¢, €, §, etc. — so you break up works like 50¢, 90°, super-hero, and AT&T. Or, in your use case here, TiddlyWiki-Documentation.

Maybe the best split would be at: (?<!...)\b(?!...) where you replace the ... with all the characters you don’t want to split on, separated by the pipe (|).

For example:

(?<!-|&|¢)\b(?!-|&|¢)

This tells your filter:

Break on word boundaries that are neither ("!") preceded by ("?<") nor followed by ("?") the characters you specify ("-|&|¢").

If you want to be comprehensive, you’ll end up with a really long string where -|&|¢ is — but since you’re using it in two places (before and after \b), you may want to store it in a tiddler and reference it from there anyway, so it would be easy to add to.

And, of course, you don’t have to try to anticipate every wacky “w̄õ%řd́” you may want to preserve in the future. You can just add exceptions as you come to them.

TW_Tones · February 18, 2024, 11:32pm

Thanks @Secret-HQ I will keep these notes handy should I need to enhance the definition of words. In the current application I am trying to get plain language words and compare them against plan language word lists. So at this stage I would not other with the other forms, although later I may.

One peculiar pattern I saw so called words consisting of repeat letters were in the sample eg aaa and aaaaaaaaaaaaa etc…

I wonder if there is a way to remove this from belonging to the list of words found? perhaps 3 or more letter repeats.

Another thing to look at that is related is to spell check the resulting words, because we need to correct these.

Secret-HQ · February 19, 2024, 3:14am

Hmmm. That might be a challenge, especially in the same expression.

Once you have the list, though, this should find any words with three repeated letters — a sure sign of something amiss (at least in English):

\b.*?(.)\1{2}.*?\b

(.)\1{2} is the potent bit, with

. being a character,
( and ) “capturing” it for use in the rest of our expression,
\1 referring to the first (and here, only) thing we captured, and
{2} indicating two instances of it back to back (following the first instance, where we captured it — for a total of three instances)
.*? at the beginning and end allow for other characters that could be before or after the triple-character string we’re searching for

Bob_Jansen · February 19, 2024, 11:54pm

@TW-Tones,

with my Proper Noun Extraction app, I have a similar problem but most probably a bit easier than you.

I basically split the text into words at space characters and then check the first character of each ‘word’ for upper case. Noun Phrases are handled by assuming subsequent words, with upper case first characters, belong to the proper noun phrase. I remove punctuation at the beginning and end of words but this also terminates noun identification.

I also have a dictionary of stopwords, so that individual words that start with upper case are not proper nouns if they also have an entry in this dictionary.

For your info, here is my code (for Xojo, essentially modern Basic)

var i, j, numWords as integer
var properNoun, word, nextWord  as string
var alphanumeric as string = "ABCDEFGHIJKLMNOPQRSTUVWZabcdefghijklmnopqrstuvwxyz0123456789"
var endofProperNoun, b as boolean
numWords = fileText.text.CountFields(" ")

properNoun = ""
endofProperNoun = false
properNounsExtracted = false

for i = 0 to numWords-1
  word = fileText.Text.NthField(" ",i).Trim
  
  'if word.Contains("Arp") then //debug code
  'word = word
  'end if
  
  if IsNumeric(word) then
    continue
  end if
  
   While  not alphanumeric.Contains(word.Right(1))
    word = word.Left(word.Length - 1)
    endofProperNoun = true
  Wend
  While  not alphanumeric.Contains(word.Left(1))
    word = word.Right(word.Length - 1)
    'endofProperNoun = true
  Wend
  
  if word.Length < 2 then
    continue
  end if
  if word.Right(2) = "'s" then
    word = word.Left(word.Length - 2)
  end if
  
  if word.Left(1).Asc >= 65 then 'uppercase A
     if word.Left(1).Asc <=90 then 'uppercase Z
      if endofProperNoun then
        if properNoun = "" then
          properNoun = word
        else
          properNoun = properNoun + " " + word
        end if
        addtoDictionary(properNoun)
        properNoun = ""
        endofProperNoun = false
      else
        if properNoun = "" then
          properNoun = word
        else
          properNoun = properNoun + " " + word
        end if
      end if
    else
      addtoDictionary(properNoun)
      properNoun = ""
      endofProperNoun = false
    end if
  else
    addtoDictionary(properNoun)
    properNoun = ""
    endofProperNoun = false
  end if
next

The stopword dictionary is amended after every document is processed to incorporate the new stop words.

TW_Tones · February 20, 2024, 12:44am

Thanks for sharing @Bob_Jansen

I is interesting to see a different approach;

I will have to look at Xojo as in the past I was an advanced basic programmer. Long time ago, but I have being looking for a good version, primarily for data conversion like you are doing.

However at this point I am trying to implement this in tiddlywiki script.
For other reasons, I would like to write executable programs as data filters at the command prompt.

I am also taking a slightly different approach here, whilst I ignore case, even loose the case, in wordlists, I plan later to test if an occurrence of a word is capitalised in the current content thus discover if it is possibly a pronoun.

Bob_Jansen · February 20, 2024, 2:19am

@TW_Tones, happy to return the favour.

I have settled on Xojo (or RealBasic as it was originally called) many years ago and now only build stand alone apps using it as I can, after purchasing the appropriate license, deliver on OSX, Windows, Android and iOS. However, the IDE is free to download and you can run your app in debug mode as often as you like for no charge with full functionality. This is how I am using it at the moment. I will only purchase a license for compiling if and when I have a paying client.

bobj

Scott_Sauyet · February 20, 2024, 5:26pm

I don’t know that there is a perfect solution. Regular expressions are for regular languages and English is definitely not one.

But this seems to be a reasonably good version:

<$let text={{{ [["But," he said, "the rain, in Spáin (my dear-one!), falls mainly 'in the plain.'"]] }}}>
<<list-links filter:"[<text>splitregexp[(?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)]!match[]]" >>
</$let>

Which yields

But
he
said
the
rain
in
Spáin
my
dear-one
falls
mainly
in
the
plain

An explanation

The regex is (?:^\W*)|(?:\W*\s+\W*)|(?:\W*$), which is three separate possibilities, joined by or separators (|):

               (?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)
First group ---\______/ \___________/ \______/--- Third group
                       |      |      |                
         Separator ---+       |      +---  Separator
                        Second group

All groups use (?: something ). The parentheses establish a group, and the ?: means that the group will not be captured itself; we’re only uses these parentheses to group our matches.

Inside the first group is ^\W*. The initial ^ matches the beginning of the string. Where \w matches word characters, the capital inverts it, so that \W matches all non-word characters, including punctuation. When followed by an asterisk (*), this matches any number of non-word characters. So this group matches any punctuation at the beginning of a string.

The third group is much the same. $ represents the end of the string., so \W*$ matches all punctuation at the end of the string.

The middle group does the bulk of the work. \W*\s+\W* matches sequences of at lease one space surrounded by optional punctuation.

Used together these split our string almost perfectly. The only trouble is that the first and third groups add empty strings captured at the beginning and end. So we add !match[] to remove them.

I know it’s not perfect. But if we find additional cases to consider, we can probably add to the regex or to additional !match clauses.

I would need a lot more data to test on, but I would expect this to by a performance improvement:

get[text]splitregexp[\s]unique[]sort[]

If we do the unique call first, there will be fewer things to sort. It probably wouldn’t show itself until the number of words gets into the thousands, but eventually it could make a difference.

TW_Tones · February 20, 2024, 11:59pm

Yes, I noticed that after posting.

Thanks for the comprehensive solution and explanation @Scott_Sauyet

I have applied this to the tiddlywiki.com documentation tiddlers and seem to be getting the following results;

Total words found 81,812
Removing stop words, common verbs, common adjectives I now have ONLY 1,392 words (Including version numbers). Quite a good “compression rate”.

But the devil is in the details.

Perhaps we can also reduce the numbers?

Examples;

1
1,2,3
2
2016
2017
25年を目指すtiddlywiki
3
3,4,5
4
4,5,6
5
5.0.0-alpha.11
5.0.0-alpha.12
5.0.0-alpha.13
5.0.0-alpha.14
5.0.0-alpha.15
5.1.0
5.1.1
5.1.10
5.1.11
tags/rawmarkupwikified/bottombody
tags/rawmarkupwikified/topbody
tags/rawmarkupwikified/tophead

Looking at the resulting words it would be quite easy to build/acquire a few additional word lists that we do not need to include such as;

tiddlywiki specific butfirst or actionconfirmwidget
Obvious filenames apocalypse.png
Hypenated words authenticated-user-header

But I thought I would ask ChatGPT which of the resulting words could be pronouns and for the first time ever, I put it into a loop and it started repeating itself like a “mad man”. With further questions it went on to “hallucinate”.

However it is still valid, once extracting these unique words it should be quite easy to filter it based on other lists.
The key trick is to use any analyses that groups words and store them, then that list is available going forward and can be used to reduce the list of found words in the current or new content. That is the “knowledge” grows over time, and the exceptions will diminish over time.

Scott_Sauyet · February 21, 2024, 3:53pm

That is very impressive compression!

We can add another filter step: !search::regexp[\d|\/|\.|\-]. The regex tests for digits (\d) or slashes (\/) or periods (\.) or hyphens (\-), joined with the vertical bar (’|’) representing “or”. (Note that \d is different from the others: we’re not escaping “d”, but using the character class \d, which is all digits. The others are simply escaping the characters /, ., and -, all of which have special meaning inside a regex if they are not escaped.)

We can see it in this:

<$let text={{{ [["But," he said, "the rain, in Spáin (my dear-one!), falls mainly 'in the plain.'"  
                 With 42 numbers, 11teen version numbers (such as 5.0.0 and 5.1.2019) and some lists of paths/through/the/system that we 
                 might/care/about, plus hyphenated-words and file.names and some I just don't understand, like "25年を目指すtiddlywiki".]] }}}>
<$list filter=
    "[<text>splitregexp[(?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)]!match[]!search::regexp[\d|\/|\.|\-]]" 
><$text text={{{ [{!!title}] }}}/> </$list>

</$let>

which yields these words, that still need to be de-duplicated:

But he said the rain in Spáin my falls mainly in the plain 
With numbers version numbers such as and and some lists of that we 
plus and and some I just don't understand like

(line breaks added)

This would not help you with butfirst and the like; you’d need a separate mechanism for that.

I also would worry a bit about removing the hyphenated words, as I personally find this to be a super-useful () bit of punctuation in my writing. And your other removal targets might or might not be useful depending on how you want to use the results. I imagine that in searching the documentation, I might really want version numbers, for instance.

TW_Tones · February 26, 2024, 1:04am

Actually I only want to create a word list from the content where those words may be of “interest”. I am yet to work on the second phase, of then revisiting the original text and highlight the words of interest, and provide a set of tools to manipulate this;
Select one or more words and

create a tiddler. or missing tiddler as a link.
indicate it is a pronoun, person or organisation name
Identify sentences containing verbs or questions

I have realised we totally underutilise plan text, after all it contains information embedded in its words and phrases. What we need is ways to add further value by taking such content and collecting information from it and representing it in tiddlywikis structure, like titles, tiddlers, tags, categories, links…

Sure language is complex, but the users understand the language and as soon as someone identifies and “calls out” information eg like a person or organisation name, it can be captured and remembered forever. This provides a way to accrue and build on, which is in some ways, the opposite to “diminishing returns”. Over time less and less exceptions will appear “diminishing exceptions”.