How to Extract Delimited Substrings from a Longer String

I saw the other thread " The User Defined Extract Filter Operator", started to reply, then realized it might be more appropriate if I elaborate more on my solution here instead.

Firstly, for searching substrings marked by two delimiter strings, regexp seems suitable for the task. For other tools, the usual approach is a regexp search for all occurrences of shortest strings that matched the pattern "<start_delimiter>.*?<end_delimiter>" where “.*” = any string, “?” = ‘non-greedy’ or shortest, and requesting for all matching results, excluding the delimiters, to be returned in an array.

An example is in this stackoverflow Q&A on regexp “Regex Match all characters between two strings - Stack Overflow” which uses the more advanced “lookahead” and lookbehind syntax to search for delimiters but not including them in the returned result.

However, in TW, if we search using filter expression, it will return a list of titles that matched the expression, but not the matched strings within the titles. The way around that I know is to do a global search and replace on a title for a search pattern, remove all non-matching portions so as to return only the matched strings from a title.

To fullfill “remove ALL non-matching portions” is a little tricky. For this purpose, it helps to visualize a title as consisting of a series of strings, like this (.*? means shortest string) :

".*?<start_delimiter>.*?<end_delimiter>"
".*?<start_delimiter>.*?<end_delimiter>"

".*?<start_delimiter>.*?<end_delimiter>"
"remainder unmatched text".

  • The search pattern then becomes ".*?<start_delimiter>(.*?)<end_delimiter>". The replacement pattern is $1 which is the matched string delimited by the (first) set of brackets in the search pattern.

  • The first search for ".*?<start_delimiter>(.*?)<end_delimiter>" will find the first string in that sequence by virtue of being the first shortest string that matched the criteria, then the next, and the next and so on. A global search and replace will replace that title with "<matched string 1><matched string 2><matched string 3>... remainder text", like this :

<$let t="This is ##a very@@ ##short sample@@ text.">
{{{[<t>search-replace:g:regexp[.*?##(.*?)@@],[($1)]]}}}

Output: (a very)(short sample) text.

That left the matter of removing the remaining unmatched string, what to do if nothing matched, and how to return the matched strings. I do not know the best or usual approach to do this. What I ended up with is to append a rare unused character “┋” to each matched string as separator, return blank if no-match by searching for that character, then split and remove the remaining unmatched text, which also returns the matched strings nicely in a list like this:

<$let t="This is ##a very@@ ##short sample@@ text.">
{{{[<t>search-replace:g:regexp[.*?##(.*?)@@],[$1┋]search[┋]split[┋]butlast[]format:titlelist[]join[ ]]}}}

Output: [[a very]] [[short sample]]

Then if a title is not just a string but a multi-line text, what I found that works is to replace linefeeds with another rare unused character, and restore it back later, like this

\function line.feed() [charcode[10]]
<$let t="""
This is ##a very@ short ##sample@ text.
This is another very short ##sample text.
This is@ a third short sample ##text@.
"""
>
{{{[<t>search-replace:g:regexp<line.feed>,[♭]search-replace:g:regexp[.*?##(.*?)@],[$1┋]search[┋]search-replace:g:regexp[♭],<line.feed>split[┋]butlast[]format:titlelist[]join[ ]]}}}

Output: [[a very]] sample [[sample text. This is]] text

Oh, the delimiters to be used in regexp should be escaped with “escaperegexp[]” which I didn’t for brevity.

That’s it. Hope this is clearer and I didn’t miss anything.

2 Likes

Hello @jacng
Very nice writeup! I really enjoyed reading it!
It shows how one can extract substring using the current features of TiddlyWiki.

p.m.: I will add this to TW-Scripts.


Readers may like to see the other discussion here A Simpler Solution for Find Macro in modern TiddlyWiki 5.3.5 - Discussion - Talk TW

It’s unfortunate that TW doesn’t recognize the single line s flag. I’m referring to something like search-replace:gs:regexp.

Indeed! Excellent breakdown, @jacng.

I do a lot of this and have a few time wondered how the code would look redone with .functions[]. You almost did it for me! :wink:

Function version could be something like this ? Someone can improve on it.

Have a great weekend !

\function substring.withDelimiters( str1, str2 )
[all[]] :map[function[search.string],<str1>,<str2>function[swap_param],<currentTiddler>]
\end

\function search.string( str1, str2 ) [[.*?]] [<str1>escaperegexp[]] [[(.*?)]] [<str2>escaperegexp[]] +[join[]]

\function line.feed() [charcode[10]]

\function swap_param( inputText )
[all[]] :map[<inputText>search-replace:g:regexp<line.feed>,[♭]search-replace:g:regexp<currentTiddler>,[$1┋]search[┋]search-replace:g:regexp[♭],<line.feed>split[┋]butlast[]format:titlelist[]join[ ]]
\end

<$let t="""
This is ^a very$ short ^sample$ text.
This is another very short sample ^text.
This is$ a third short sample text.
"""
>
t: <$text text=<<t>> /> <br><br>
output: {{{[<t>function[substring.withDelimiters],[^],[$]]}}}
t: This is ^a very$ short ^sample$ text. This is another very short sample ^text. This is$ a third short sample text.

output: [[[a very]] sample [[text. This is]]]
1 Like