Web address detection regular expression etc

TW_Tones · March 30, 2024, 10:23pm

Folks, if a string begins with www. it is quite easy to determine programmatically this is referring to a link to an internet resource, likewise if it begins with a protocol like https://

I have however come across environments where a string ending with .com or .au and other country codes are also recognised as addresses.

Starting with a regular expression and perhaps a dictionary of High Level Domains (HLD) I am thinking about a solution, to add to tiddlywiki, that can recognise such addresses in titles and wiki text, perhaps even the parser would be helpful.

I feel TiddlyWiki could benefit from this so I ask for contributions please.

If you can craft a regular expression for this please also put it in TiddlyWiki form and explain its structure if possible.
The list of HLD’s is growing.

clsturgeon · March 30, 2024, 11:58pm

I see regular expressions online (not in TW format). Plus a couple of JavaScript solutions that look more appealing. What form of result are you looking for. The JS solution could be used as a filter operator or macro /procedure? Regardless what would you like returned? True/false? Or perhaps it would return an empty string if not a valid url and the url string if it is?

TW_Tones · March 31, 2024, 4:42am

Perhaps ideally a modified version, or redefinition of the $link widget, that the parser uses when it already detects links?

The current parser detects camelcase, protocol:// and [[tiddler titles]] and I would like it to detect these “domain/link” formats. One “tell tail” is something.something which does not occur in normal text, although may in some titles, variables or function names.
Then we add the TLD knowledge to check if the right most “word” is a valid TLD

Once detected as a $link, the link widget can be modified to default to an external links eg https://, and other appropriate settings in the “a record” used to render in html.

The only part I cant do is;

Adjust the parser to detect based on a regular expression and a database of TLD’s and invoke the $link widget
- See module-type=wikirule such as $:/core/modules/parsers/wikiparser/rules/extlink.js
I know how to subsequently enhance the link widget through its redefinition.

And all of the above if possible.

jeremyruston · March 31, 2024, 4:41pm

Hi @TW_Tones the idea of matching the complete list of Top Level Domains may not be practical: there are now nearly 1,500 of them in the official registry:

https://data.iana.org/TLD/tlds-alpha-by-domain.txt

However, the core of this question crops up regularly in general programming discussions, and you’ll find many answers on StackOverflow offering suggesting of regexes that match some useful subset of possible URLs.

Once you’ve selected your regex, you’ll need to override the parse rule in “$:/core/modules/parsers/wikiparser/rules/extlink.js”.

Scott_Sauyet · March 31, 2024, 8:56pm

I think that’s going to be a real problem, though. It’s difficult to distinguish whether, say, colors.blue represents a bit of UI code or the domain colors on the TLD blue.

TW_Tones · March 31, 2024, 11:36pm

Scott,

If someone gives me the regex to find something.something and something.something.something etc… I should be able to handle the scope of its application.

Scott_Sauyet · April 1, 2024, 12:39am

A slightly modified version of one of the multitude found online looks like this:

/(https?:\/\/)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/=]*)/mg

If you want to use it like Jeremy suggested, you can use that intact. If you want it in a function, you will need to drop the initial / and final /mg:

\define url-regexp (https?:\/\/)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/=]*)

\function find.urls() [search:text:regexp<url-regexp>]

(untested)

It’s not perfect, but I don’t believe there are perfect ones. The question is whether it meets your needs.

I will try to find time tomorrow to report how this particular regex works, but you can see one tool’s analysis at https://regex101.com/r/ybn2nx/1.

TW_Tones · April 1, 2024, 12:45am

Thanks @Scott_Sauyet

I do believe it would be fantastic if we documented thoroughly on tiddlywiki how given a regular expression found online, we can convert it to use in tiddlywiki. As a result those like me, who seem to have a regex form of dyslexia, can use online resources to generate them, then convert to tiddlywiki applications.

Scott_Sauyet · April 1, 2024, 1:31am

If it’s a JS regular expression, the process is quite simple.

In JS, a regex is delineated with slashes, /regex-here/. It may be followed by some single-character flags:

/regex-here/igm According to regexp Operator, the only one generally useful in TW is i, which means the matching is case-insensitive.

To create a TW regex from this you can define a variable to hold the characters of the regex, stripping off the slashes, and then include it in a call to the regexp Operator or the search Operator:

\define xyz() regex-here
...   [...regexp<xyz>...

(or)

... [...searchfield1,field2,...:regexp<xyz> ... ]

If you want to include the flags, they can precede or succeed the body:

\define xyz() (?ig)regex-here

or

\define xyz() regex-here(?i)

Theoretically, there are times you can define the regexp inline,

...  [...regexp[regex-here]...]

but so often a regular expression contains characters which can’t be used that way that’s it’s probably easiest to always use the external definition.

I have not seen regex-generation tools; they may exist, but I think they would be tricky to get right. There are a good number of regex-testing tools, though, and you can use them to see if you’ve got it right. Of course there are many regular expressions online, and a large portion of those are JS-compatible (every language has a slightly different flavor of regex), so you can often search for them and just use them directly with the minor conversion above.

TW_Tones · April 1, 2024, 1:49am

I have found Chat GPT a regular expression writer, but the is the gap to tiddlywiki

In TiddlyWiki, regular expressions (regex) are utilized in several areas, including filter operators and widgets, to match or manipulate text. TiddlyWiki’s regex capabilities are essentially those provided by JavaScript’s regex engine, as TiddlyWiki is built on JavaScript. This means any valid JavaScript regular expression should work within TiddlyWiki.

Here are some key points about using regular expressions in TiddlyWiki:

Syntax: JavaScript regex patterns are enclosed within forward slashes (/). For example, /pattern/flags. Common flags include:

g: Global search.

i: Case-insensitive search.

m: Multi-line search.

Character Classes: You can use character classes such as \d for digits, \w for word characters (alphanumeric and underscore), and \s for whitespace characters.

Quantifiers: Quantifiers like * (zero or more), + (one or more), and ? (zero or one) are available to control how many instances of a character or group must be present for a match.

Groups and Ranges: Parentheses () are used for grouping, and square brackets [] for character ranges. For example, [a-z] matches any lowercase letter, and (abc)+ matches one or more repetitions of the sequence “abc”.

Special Characters and Escaping: Special characters like ., *, ?, etc., have specific meanings in regex. To use them as literal characters, they need to be escaped with a backslash \.

Lookahead and Lookbehind: JavaScript regex supports lookahead ((?=...) and (?!...)) for matching a group after or not after a certain point. Lookbehind ((?<=...) and (?<!...)) is supported in newer JavaScript engines, so its availability might depend on the browser or environment TiddlyWiki is running in.

Using Regex in TiddlyWiki

When using regex within TiddlyWiki, particularly within filter operators, you typically pass the regex as a string argument. For example, the search operator allows you to filter tiddlers based on a regex match against their titles or content. The regex should be properly escaped according to TiddlyWiki’s parsing rules for filter strings.

Example

A filter to find tiddlers with titles containing one or more digits could look like this:
[title[My Tiddlers]search:title/\\d+/]
Notice the double backslashes \\ are used to escape the \d within the TiddlyWiki filter string.

In summary, any regular expression that’s valid in JavaScript should work in TiddlyWiki, but be mindful of the context in which you’re using them, especially with respect to escaping characters in filter strings and other TiddlyWiki-specific syntax.

see also http://tw-regexp.tiddlyspot.com/

Scott, perhaps you could edit the above to ensure is is compatible with TiddlyWIki?

I will have a go at writing some additional doco based on the above including you notesa and Mohammads site. I may help that I am somewhat naive.

Scott_Sauyet · April 1, 2024, 2:26pm

I’m assuming that quoted material is ChatGPT output.

I find it useless, I’m afraid. It has a lot of irrelevancies regarding basic regex syntax. It would likely be better to simply link to a useful regex tutorial for that. It has wrong and confusing syntax. What in the world is this supposed to mean?

[title[My Tiddlers]search:title/\\d+/]

What is the title for? Where is the regexp flag? Why is its argument missing the [ ] wrapper? Why are the JS regex slash delimiters included when not used in TW?

This is entirely wrong:

Notice the double backslashes \\ are used to escape the \d within the TiddlyWiki filter string.

The summary tells you to “be mindful of the context in which you’re using them, especially with respect to escaping characters in filter strings and other TiddlyWiki-specific syntax,” which is often irrelevant in TW, but doesn’t tell you how to handle characters which can’t appear in [ ]

I think the world has gone gaga over LLMs like ChatGPT, expecting much more from them than they can actually supply. The fact that they create grammatical and somewhat realistic-looking answers doesn’t compensate for their hallucinations and overconfidence.

I find this particularly apt: The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic's con.

TW_Tones · April 1, 2024, 7:34pm

I always apply scepticism to LLM out puts and new edits and knew it was wrong I just thought it would provide structure.

dont edit if of no value to your

TiddlyTitch · April 2, 2024, 8:25am

(note: “TLD” not “HLD”)

@TW_Tones, @Scott_Sauyet, @clsturgeon. @jeremyruston …

This is a fascinating and I think, finally, a very interesting thread of applied thinking. I am very aware it has potentia. It is in, one way, simple. Another way a likely rabbit-hole of juggling TLD (defined external to TW); JS regex; TW regex operators; the translation of the JS syntax to TW Operator syntax; the parsing of parsing solutions … etc.

Regarding regex matching in all this … @Scott_Sauyet correctly sees that there is (needs be?) an heuristic component in play. There is not one way to solve the issue of the OP. It could range from sophistication-central to minimal-good-enough.

As far as I understand the OP … @TW_Tones wants to …
(1) discover title: s that are (potentally) domain addresses;
(2) do some kind of validation that the TLD component IS valid.

Yes???

[ UPDATE: “No” on (1). @TW_Tones meant inside the text field,]

This is just a comment
IF I can, I’ll comment, more constructively I hope, later, on a couple of the specific issues that came up.

TT

TW_Tones · April 2, 2024, 8:39am

I am, not so concerned about this.

Perhaps something.something, and something.something.something would be sufficient as long as it does not impact with custom widgets, functions and variables.

With so many TLD’s now I am not sure a more detailed examination would be needed, but perhaps if we can use the tilde ~ to defeat it if necessary ~something.something

TiddlyTitch · April 2, 2024, 8:51am

You are right!

BUT is this really an issue IF you are only concerned with title: ? (That is how I read the OP.)

I can see the issue of messing up a TW IF the scope were more than titles. There it looks a tad complicated.

TW_Tones · April 2, 2024, 10:11am

I am interested in say tiddlywiki.com being identified as an internet link which is somewhat equivalent to [[tiddlywiki.com]] however tiddlywiki.com is parsed into a https://tiddlywiki.com and [[tiddlywiki.com]] is a title of the tiddler tiddlywiki.com.

This seems to make sense to me.

TiddlyTitch · April 2, 2024, 10:36am

Right! Makes sense to me.

But I’m not sure you quite got my last question (which I hope is practical!).

Viz, are you ONLY concerned in the OP with domain addresses in a title field?

I ask again because solving that is much easier than if it were in any field.
Formally in regex that is.

P.S. This might be it’s own rabbit hole?

TW_Tones · April 2, 2024, 11:17am

I am after it in the parsing of the text field. Typicaly allowing me to type or past internet addresses without the protocol https://

if this works other fiels such as title and fields should be possible.
yes tiddler titles would be easier and I can so that now and worth doing.

TiddlyTitch · April 2, 2024, 12:18pm

Ah. Do I have it right your want is …

1 - (Ideally) to have an on-load Parser that will auto-render a … (sub-sub-dom.sub-dom.)domain.TDL as an actionable URL even though it is without a protocol?

2 - Avoid potential conflicts with any tiddler that contains a “TW.construct.thing”

Have I understood yet?

P.S. Work around. One might just search and replace using regex to upgrade selected inactive “links” to active links in the conventional “protocol format”?

TW_Tones · April 2, 2024, 7:37pm

Yes you understand. I would hope it has a rule like wikilinks.

I could build work arounds but an intergrated solution is desirable.
at present regular expressions and the parser are my weaknesses.