How to Extract Delimited Substrings from a Longer String

jacng · September 20, 2024, 8:46am

I saw the other thread " The User Defined Extract Filter Operator", started to reply, then realized it might be more appropriate if I elaborate more on my solution here instead.

Firstly, for searching substrings marked by two delimiter strings, regexp seems suitable for the task. For other tools, the usual approach is a regexp search for all occurrences of shortest strings that matched the pattern "<start_delimiter>.*?<end_delimiter>" where “.*” = any string, “?” = ‘non-greedy’ or shortest, and requesting for all matching results, excluding the delimiters, to be returned in an array.

An example is in this stackoverflow Q&A on regexp “Regex Match all characters between two strings - Stack Overflow” which uses the more advanced “lookahead” and lookbehind syntax to search for delimiters but not including them in the returned result.

However, in TW, if we search using filter expression, it will return a list of titles that matched the expression, but not the matched strings within the titles. The way around that I know is to do a global search and replace on a title for a search pattern, remove all non-matching portions so as to return only the matched strings from a title.

To fullfill “remove ALL non-matching portions” is a little tricky. For this purpose, it helps to visualize a title as consisting of a series of strings, like this (.*? means shortest string) :

".*?<start_delimiter>.*?<end_delimiter>"
".*?<start_delimiter>.*?<end_delimiter>"
…
".*?<start_delimiter>.*?<end_delimiter>"
"remainder unmatched text".

The search pattern then becomes ".*?<start_delimiter>(.*?)<end_delimiter>". The replacement pattern is $1 which is the matched string delimited by the (first) set of brackets in the search pattern.
The first search for ".*?<start_delimiter>(.*?)<end_delimiter>" will find the first string in that sequence by virtue of being the first shortest string that matched the criteria, then the next, and the next and so on. A global search and replace will replace that title with "<matched string 1><matched string 2><matched string 3>... remainder text", like this :

<$let t="This is ##a very@@ ##short sample@@ text.">
{{{[<t>search-replace:g:regexp[.*?##(.*?)@@],[($1)]]}}}

Output: (a very)(short sample) text.

That left the matter of removing the remaining unmatched string, what to do if nothing matched, and how to return the matched strings. I do not know the best or usual approach to do this. What I ended up with is to append a rare unused character “┋” to each matched string as separator, return blank if no-match by searching for that character, then split and remove the remaining unmatched text, which also returns the matched strings nicely in a list like this:

<$let t="This is ##a very@@ ##short sample@@ text.">
{{{[<t>search-replace:g:regexp[.*?##(.*?)@@],[$1┋]search[┋]split[┋]butlast[]format:titlelist[]join[ ]]}}}

Output: [[a very]] [[short sample]]

Then if a title is not just a string but a multi-line text, what I found that works is to replace linefeeds with another rare unused character, and restore it back later, like this

\function line.feed() [charcode[10]]
<$let t="""
This is ##a very@ short ##sample@ text.
This is another very short ##sample text.
This is@ a third short sample ##text@.
"""
>
{{{[<t>search-replace:g:regexp<line.feed>,[♭]search-replace:g:regexp[.*?##(.*?)@],[$1┋]search[┋]search-replace:g:regexp[♭],<line.feed>split[┋]butlast[]format:titlelist[]join[ ]]}}}

Output: [[a very]] sample [[sample text. This is]] text

Oh, the delimiters to be used in regexp should be escaped with “escaperegexp[]” which I didn’t for brevity.

That’s it. Hope this is clearer and I didn’t miss anything.

Mohammad · September 20, 2024, 9:09am

Hello @jacng
Very nice writeup! I really enjoyed reading it!
It shows how one can extract substring using the current features of TiddlyWiki.

p.m.: I will add this to TW-Scripts.

Readers may like to see the other discussion here A Simpler Solution for Find Macro in modern TiddlyWiki 5.3.5 - Discussion - Talk TW

Mohammad · September 20, 2024, 9:28am

It’s unfortunate that TW doesn’t recognize the single line s flag. I’m referring to something like search-replace:gs:regexp.

CodaCoder · September 20, 2024, 1:01pm

Indeed! Excellent breakdown, @jacng.

I do a lot of this and have a few time wondered how the code would look redone with .functions[]. You almost did it for me!

jacng · September 21, 2024, 3:43am

Function version could be something like this ? Someone can improve on it.

Have a great weekend !

\function substring.withDelimiters( str1, str2 )
[all[]] :map[function[search.string],<str1>,<str2>function[swap_param],<currentTiddler>]
\end

\function search.string( str1, str2 ) [[.*?]] [<str1>escaperegexp[]] [[(.*?)]] [<str2>escaperegexp[]] +[join[]]

\function line.feed() [charcode[10]]

\function swap_param( inputText )
[all[]] :map[<inputText>search-replace:g:regexp<line.feed>,[♭]search-replace:g:regexp<currentTiddler>,[$1┋]search[┋]search-replace:g:regexp[♭],<line.feed>split[┋]butlast[]format:titlelist[]join[ ]]
\end

<$let t="""
This is ^a very$ short ^sample$ text.
This is another very short sample ^text.
This is$ a third short sample text.
"""
>
t: <$text text=<<t>> /> <br><br>
output: {{{[<t>function[substring.withDelimiters],[^],[$]]}}}

t: This is ^a very$ short ^sample$ text. This is another very short sample ^text. This is$ a third short sample text.

output: [[[a very]] sample [[text. This is]]]

atronoush · April 16, 2025, 4:36am

Great solution! How can I use this code for Extract Text Between Two Delimiters?

I mean can I use this code to find @@.red content here... @@ if and only of @@.red@@ and @@` are on their own line like:

This is @@.red Hello TW.@@ now I start another line
The second line
@@.red
I am the right section.
@@
End of this tiddler

Expected output: I am the right section.

jacng · April 17, 2025, 1:06am

Mmm, try using “@@.red♭” and “@@♭” as delimiters. Linefeed characters are globally replaced by the “♭” character in the beginning to enable search across lines, and restored later in the returned titles. Then, to handle the case where your last line is “@@” without a linefeed, preemptively add a linefeed to the end of your text using addsuffix like this:

\function swap_param( inputText )
[all[]] :map[<inputText>addsuffix[♭]search-replace:g:regexp<line.feed>,[♭]......
\end

Any unmatched text at the end of the text will be removed anyway, so that extra linefeed shouldn’t affect the output.

This is a quick response, didn’t actually test this myself.

jacng · April 21, 2025, 5:22am

Yes, the workaround using global replace of linefeed is somewhat clumsy.

I check it up. In regexp, the ‘s’ flag (I think it’s commonly called ‘dotall’) allows ‘.’ in regexp to represent any characters including newline. Mmm, found that using “(.|\n)” (i.e. “.” or linefeed) in place of “.” works for searching across lines.

So my earlier example code is now :

<$let t="""
This is ##a very@ short ##sample@ text.
This is another very short ##sample text.
This is@ a third short sample ##text@.
"""
>

<<t>> <br><br>
output: <$text text=
{{{[<t>search-replace:g:regexp[(.|\n)*?##((.|\n)*?)@],[$2┋]search[┋]split[┋]butlast[]format:titlelist[]join[ ]]}}}  />

Note that $1 has been changed to $2 to refer to the second set of brackets in the regexp.

The “function” version (still a little clumsy with the need to swap parameters ) :

\function substring.withDelimiters( str1, str2 )
[all[]] :map[function[search.expr],<str1>,<str2>function[swap_param],<currentTiddler>]
\end

\function search.expr( str1, str2 ) [[(.|\n)*?]] [<str1>escaperegexp[]] [[((.|\n)*?)]] [<str2>escaperegexp[]] +[join[]]

\function swap_param( inputText )
[all[]] :map[<inputText>search-replace:g:regexp<currentTiddler>,[$2┋]search[┋]split[┋]butlast[]format:titlelist[]join[ ]]
\end

<$let t="""
This is ^a very$ short ^sample$ text.
This is another very short sample ^text.
This is$ a third short sample text.
"""
>
t: <$text text=<<t>> /> <br><br>
output: {{{[<t>function[substring.withDelimiters],[^],[$]]}}}

Mohammad · April 21, 2025, 5:07pm

Excellent work, @jacng.
Your solution was truly enjoyable. I think now one can extract any complex form of text.