I donât know that there is a perfect solution. Regular expressions are for regular languages and English is definitely not one.
But this seems to be a reasonably good version:
<$let text={{{ [["But," he said, "the rain, in SpĂĄin (my dear-one!), falls mainly 'in the plain.'"]] }}}>
<<list-links filter:"[<text>splitregexp[(?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)]!match[]]" >>
</$let>
Which yields
- But
- he
- said
- the
- rain
- in
- SpĂĄin
- my
- dear-one
- falls
- mainly
- in
- the
- plain
An explanation
The regex is (?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)
, which is three separate possibilities, joined by or separators (|
):
(?:^\W*)|(?:\W*\s+\W*)|(?:\W*$)
First group ---\______/ \___________/ \______/--- Third group
| | |
Separator ---+ | +--- Separator
Second group
All groups use (?: something )
. The parentheses establish a group, and the ?:
means that the group will not be captured itself; weâre only uses these parentheses to group our matches.
Inside the first group is ^\W*
. The initial ^
matches the beginning of the string. Where \w
matches word characters, the capital inverts it, so that \W
matches all non-word characters, including punctuation. When followed by an asterisk (*
), this matches any number of non-word characters. So this group matches any punctuation at the beginning of a string.
The third group is much the same. $
represents the end of the string., so \W*$
matches all punctuation at the end of the string.
The middle group does the bulk of the work. \W*\s+\W*
matches sequences of at lease one space surrounded by optional punctuation.
Used together these split our string almost perfectly. The only trouble is that the first and third groups add empty strings captured at the beginning and end. So we add !match[]
to remove them.
I know itâs not perfect. But if we find additional cases to consider, we can probably add to the regex or to additional !match
clauses.
I would need a lot more data to test on, but I would expect this to by a performance improvement:
get[text]splitregexp[\s]unique[]sort[]
If we do the unique
call first, there will be fewer things to sort
. It probably wouldnât show itself until the number of words gets into the thousands, but eventually it could make a difference.