Should there be a substring filter operator?

tw-FRed · April 26, 2024, 11:05pm

Even though I didn’t comment, I read this thread with great interest, and I would gladly benefit from additions to the string filter operators library.
For example I often miss a substring operator. Even though existing operators already allow to build it, performance-wise a javascript implementation would make sense.

Fred

TW_Tones · April 26, 2024, 11:24pm

Can you point me to a definition of this in plain language?

I think I am looking for something similar

Since I have started using ChatGPT to generate filter operators it is becoming obvious that string operators in particular are really simple JavaScript functions.

Also custom operators offer a lot of opportunities to build sophisticated and easy to use sting operators, but we need to insure we have practical and fundamental operators to do this easily. Like the need to sort strings by length (and more) before processing,

tw-FRed · April 27, 2024, 12:57am

A substring operator specification could be:

A substring operator returns the part of a string (its input) located between a starting and an ending position.
A missing ending parameter means the substring continues to the end of the input.
Examples in pseudo-filter syntax: [[ABCDE]substring[3],[4]] would be “CD”. [[ABCDE]substring[4]] would be “DE”.
Some variants of substring functions give negative starting or ending position values a special meaning, they count the position from the end of the string instead of the start.
Example of negative ending position: [[ABCDE]substring[2],[-1]] would be “BCD”.
Other variants may consider the second parameter is the length of the substring, instead of the ending position, thus [[ABCDE]substring[3],[1]] would be “C”.
One point worth considering is whether the parameters should be 0- or 1-based. Some languages even let you choose! Option base 1 anyone?

Fred

Charlie_Veniot · April 27, 2024, 1:07am

https://support.microsoft.com/en-us/office/mid-function-2eba57be-0c05-4bdc-bf81-5ecf4421eb8a

Just because “sub” is already used in a few TW things (sortsub, subfilter, substitute, subtiddlerfields, subtract), it might be worth referring to it as “MidString” since “Mid” isn’t used anywhere. “Midstring” might not have the semantic goodness of “SubString”, but “MidString” would fit well with a buffet of string functions, like “LeftString” and “RightString”.

TW_Tones · April 27, 2024, 1:18am

@Charlie_Veniot yes that was my experience with advanced basic. left, right mid.

@tw-FRed I think This is all quite achievable with custom operators however I need to check the handling of spaces in the string. [Edit] Yes fine with spaces

There are places where Spaces will sometimes be trimmed and sometimes not
Perhaps a javascript one would be faster?

[Edit] Keep in mind the existing Extended Listops Filters and others.

[Edit2] Here are simple examples of first and last operators as two cases of a substring already available but it would be nice reduce the repetition by moving it into a function where you only provide first[4] or last[4] or from[2],[5]

{{{ [[AB C D E]split[]first[4]join[]] }}}
{{{ [[AB C D E]split[]last[4]join[]] }}}

Scott_Sauyet · April 27, 2024, 3:20pm

Not just for performance, but also for readability. When I brought this up before, @saqimtiaz pointed me to the following discussion, which bogged down trying to find the correct API: https://github.com/Jermolene/TiddlyWiki5/issues/5824.

I personally would be fine with the JS API, shifted for TW’s 1-based indices, which is pretty well what you described in your first few bullets. I don’t know if I would also include negative indices; I like the idea in that thread of an :end suffix for that case, but it’s not as flexible as, say, .slice(2, -2).

TW_Tones:

Here are simple examples of first and last operators as two cases of a substring already available but it would be nice reduce the repetition by moving it into a function where you only provide first[4] or last[4] or from[2],[5]
{{{ [[AB C D E]split[]first[4]join[]] }}}
{{{ [[AB C D E]split[]last[4]join[]] }}}

I don’t know if the internals can be extended to handle it in any simple manner, but it would be really useful to recognize a more abstract type of finite ordered collections, and allow first, last and the like to operate on them, so that first[3] applied to a list of numbers would give you a list of the first three number in the list, but first[3] applied to a string would give you a new string containing its first three characters.

That is, I would love it if this:

{{{ I don't like guns, so take away your foolish barbarian bazookas! +[last[3]] :map[first[3]] }}}

would yield

foo bar baz

because

{{{ I don't like guns, so take away your foolish barbarian bazookas! +[last[3]] }}}

yields

foolish barbarian bazookas!

(which already works)

and

{{{ foolish barbarian bazookas! :map[first[3]] }}}

yields

foo bar baz

(which is the change I’d love to see.)

As noted, :map[split[]first[3]join[]] already works, but it’s much harder to write and to read than :map[first[3]].

If this sounds to you like a contradiction to my point of simplicity in Idea: Enhance the length operator, then first, congratulations on paying attention! But second, it’s not really a contradiction. Here I am simply proposing that we think of our function as acting on a more abstract type than a list of strings; we think of it operating on any (finite) ordered collections of similar items. A list of strings in an ordered collection of strings. And a string is an ordered collection of characters.

But if I’m not careful here, I’ll end up discussing Algebraic Data Types, Category Theory, and the FantasyLand Specification. And no on wants that!

tw-FRed · April 27, 2024, 3:40pm

Scott_Sauyet:

I don’t know if the internals can be extended to handle it in any simple manner, but it would be really useful to recognize a more abstract type of finite ordered collections, and allow first, last and the like to operate on them, so that first[3] applied to a list of numbers would give you a list of the first three number in the list, but first[3] applied to a string would give you a new string containing its first three characters.

That is, I would love it if this:
{{{ I don't like guns, so take away your foolish barbarian bazookas! +[last[3]] :map[first[3]] }}}
would yield
foo bar baz

I like the idea, but can’t help wonder how difficult this would be to explain to newcomers, especially in next example where first wouldn’t seem to do the same thing depending on its position in the filter:

{{{ foolish barbarian bazookas short example +[first[3]] :map[first[3]] }}}

My two cents,

Fred

Scott_Sauyet · April 27, 2024, 3:50pm

Looking again, I’m not sure it’s even a coherent idea in TW, where the notion of lists of strings is often carried by single JS strings.

{{{ xylophone kazoo bongos +[first[1]] }}}

yields

xylophone

But what should this yield?:

{{{ xylophone +[first[1]] }}}

xylophone or x? Oops.

Another ~~good~~ idea died aborning! But that doesn’t stop me wanting it, damn it!

And I still want a substring operator, regardless.

Springer · April 27, 2024, 6:36pm

@Scott_Sauyet: this is the best use ever of “foo bar baz” — to which I am usually allergic — in this forum!

But it should use :map[split[]first[3]join[]] to get down to character level; otherwise we have lost the thread of what a list means…

As noted in another recent thread, I migrated to TiddlyWiki from FileMaker, and I have often missed not having an analogue to the Middle(string,startposition,charcount) function there.

Cobbling the effect together out of butfirst[startposition] and first[charcount] works, but it does feel like one too many slice with Occam’s razor.

TW_Tones · April 27, 2024, 9:38pm

This middle is quite easy to move into a Custom operator, especialy if the second parameter is a length not a position.

Of course we can use split[] / join[] but it seems we could do with a mechanisium or annotation that switches from title/string to character level manipulation. In many ways they are the same thing but when blended it gets difficult.

We can use filter runs for this, so I wonder if there was a stringify filter run for which each title is given at a time, then within the run it acts on the characters? This allows reuse of existing filters but now acting on characters or within the string. Importiantly making this clear to the reader of the filter?

I think we need to try and find a systematic solution to this.

I recently came up with an approach for fixed length strings that I am developing primarily to validate user input or or data cleaning.
this is not sufficient for many variable length fields.

Scott_Sauyet · April 28, 2024, 12:25am

Springer:

Scott_Sauyet:
{{{ I don't like guns, so take away your foolish barbarian bazookas! +[last[3]] :map[first[3]] }}}
would yield
foo bar baz
@Scott_Sauyet: this is the best use ever of “foo bar baz” — to which I am usually allergic — in this forum!

I know that there are many here who don’t like those metasyntactic variables. For programmers, they are often quite useful, serving the same role that Lorem ipsum does in layout and design: holding the place of real variables (or text) without distracting the reader with irrelevant content.

This was an attempt to restore them in people’s good graces, here. I actually wanted a sentence where they came in the beginning and was going to use it to demonstrate my short-lived notion of two different uses of first. I couldn’t come up with anything, but this worked for last/first which was close enough. Now it seems easy enough: Foolish barbarian bazooka have no place in a civilized household; get rid of the guns! Oh, well. Too late!

Right this was my abortive attempt to suggest that first, last and their friends could extend to strings as well as title lists. I quickly realized that won’t work. It was a short-lived detour on the discussion about having a useful substring operator.

bluepenguindeveloper · April 28, 2024, 11:00am

I don’t mind them in principle, but I do think they can sound like a secret language to some and be off-putting or an obstacle to those who are fully capable of programming but don’t know anything about it. I think that they can actually be more distracting and not less.

As such, I’m a little more fond of the names of such metasyntactic variables used in the Python community (spam, ham, and eggs).

Ste_W · April 28, 2024, 11:44am

I did not know that! …but of course they would.

Mario · April 29, 2024, 6:33am

7 posts were split to a new topic: Is it useful to use foo, bar, baz in examples?

pmario · April 29, 2024, 6:42am

Please let us stay on topic here. Should there be a substring filter operator?

I think there should be one and it should have the same functionality as the underlaying js substring function, but it should be adapted to be used in a TW context

Yaisog · April 29, 2024, 11:39am

Please also consider that we already have character position identification with the focusSelectFromStart and focusSelectFromEnd attributes of the EditTextWidget.
If at all possible, the solution for substring should not clash too hard with those attributes, to make it easier for users to remember.

TW_Tones · April 29, 2024, 11:58am

Can you give an example what it may look like?

If we used some advanced features from Javascript it may be fine if we document it well, or will we follow standard filter syntaxes?

Scott_Sauyet · April 29, 2024, 1:41pm

[[abcdefghijklmnopqrstuvwxyz]slice[1],[6]]     <!-- yields "abcde"                     (Note 1)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[9],[10]]    <!-- yields "i"                         (Note 2)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[14]]        <!-- yields "nopqrstuvwxyz"             (Note 3)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[24],[30]]   <!-- yields "xyz"                       (Note 4)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[50],[100]]  <!-- yields ""                          (Note 5)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[20],[10]]   <!-- yields ""                          (Note 6)  -->

<!-- Additionally, we might also use the JS negative integer rules: -->

[[abcdefghijklmnopqrstuvwxyz]slice[1],[-5]]     <!-- yields "abcdefghijklmnopqrstu"    (Note 7)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[-15],[-10]]  <!-- yields "lmnop"                    (Note 8)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[-5]]         <!-- yields "vwxyz"                    (Note 9)  -->
[[abcdefghijklmnopqrstuvwxyz]slice[-30]]     <!-- yields "abcdefghijklmnopqrstuvwxyz"  (Note 10) -->

The JS slice API is (fromIndex, toIndex), inclusive in fromIndex, exclusive in toIndex. That inclusive/exclusive pattern is very useful in many programming situations, but might be more confusing to newcomers than it’s worth, in which case, we might do inclusive/inclusive, and then this would yield “abcdef”, and many of the remaining examples would also need to be altered…
In general, the length of the returned string will be toIndex - fromIndex, but see an exception in Note 4. (It would be one larger if we used inclusive/inclusive.)
If toIndex is not supplied, we slice to the end of the string.
If toIndex is larger than the length of the string, we slice to the end of the string.
If fromIndex is larger than the length of the string, we return an empty string.
If toIndex is not larger than fromIndex, we return an empty string.
Negative indices count from the end of the string. In inclusive/inclusive this means that slice[1],[-5] is equivalent to slice[1],[22], where 22 = 26 - 5 + 1.
This applies to fromIndex as well as startIndex.
Similarly, slice[-n] for an integer n selects the last n characters from the string.
If a negative index shifts before the beginning of the string, we just count from the beginning.

This is essentially describing the JS slice API. JS has two others, substring which is similar, but doesn’t handle negative indices, and substr, which is deprecated and probably not worth discussing.

The inclusive/exclusive design is probably the most common version in programming languages, but inclusive/inclusive far from unknown. The biggest advantage of the former is that, for instance, [<string>slice[1][n]] [<string>slice[n]] yields back <string> for any n. It is generally easier to break strings up this way, where the toIndex of one string is the fromIndex of another. But either API would be reasonable.

TW_Tones · May 1, 2024, 12:00am

Thanks @Scott_Sauyet for the effort here. Very good.

I think there is value taking the Javascript method and name “slice” and document that fact, then people who are familiar with it, it is quick to learn, even although its not my Favorite word for this, it makes sense to map to its JavaScript function. Its growing on me.

My thought is what if it defaults to inclusive /inclusive and we document this for users with no mention of inclusive /exclusive,

then below all that documentation an option to change it to inclusive /exclusive and document that as needed.
- Does exclusive/inclusive and exclusive /exclusive have any validity?

One thing is obvious here is this operates on an un-split string. Earlier in the discussion we were splitting split[] the string and then applying operators to the list or set of titles (in fact now characters), for example first/last.

Is this something we need to study further?
Either in documentation/examples or other places we will need to make this difference clear.
I say this because currently the documentation is perhaps what I would call minimalist or frugal.

Other notes;

I/we do need to think about the difference between characters split and direct actions on a title (as slice does).
I still think there is value providing a simplified mid (positional) to complement the first/last, but it could be a custom filter operator that uses slice.
- Similarly we could add a second parameter to nth[] and zth[] to give a length, however these relate to title, not characters unless you split[]
Ultimately to make use of slice and similar operators we need the integer values to use, with fixed length strings this is arguably known, for others it may need to be determined, however if one is searching a string we are more likely to use other methods that do not return integer positions.
Could split be designed to act on titles, that is string that have being split[]?

Thanks very much @Scott_Sauyet I think you have given us enough for a requirements definition we can move to creating an operator from this. Perhaps I can even get ChatGPT to write it?

bluepenguindeveloper · May 1, 2024, 12:20am

Just want to add a suggestion to have a suffix to determine whether the second parameter would be an index or a length. Sometimes we want [...slice<n>,<m>] to give a substring from index n to index m, but sometimes it’s nice to have [substring<p>,<length>] to give a substring of length <length> starting at position (index) p. Since I think both would be useful and both are often called by the same name (substring), I would suggest a suffix to distinguish the two. (Alternatively, “slice” can refer specifically to the method with start and end indices, and “substring” can refer to the length-based version.)