Folks, if a string begins with www. it is quite easy to determine programmatically this is referring to a link to an internet resource, likewise if it begins with a protocol like https://
I have however come across environments where a string ending with .com or .au and other country codes are also recognised as addresses.
Starting with a regular expression and perhaps a dictionary of High Level Domains (HLD) I am thinking about a solution, to add to tiddlywiki, that can recognise such addresses in titles and wiki text, perhaps even the parser would be helpful.
I feel TiddlyWiki could benefit from this so I ask for contributions please.
If you can craft a regular expression for this please also put it in TiddlyWiki form and explain its structure if possible.
I see regular expressions online (not in TW format). Plus a couple of JavaScript solutions that look more appealing. What form of result are you looking for. The JS solution could be used as a filter operator or macro /procedure? Regardless what would you like returned? True/false? Or perhaps it would return an empty string if not a valid url and the url string if it is?
Perhaps ideally a modified version, or redefinition of the $link widget, that the parser uses when it already detects links?
The current parser detects camelcase, protocol:// and [[tiddler titles]] and I would like it to detect these “domain/link” formats. One “tell tail” is something.something which does not occur in normal text, although may in some titles, variables or function names.
Then we add the TLD knowledge to check if the right most “word” is a valid TLD
Once detected as a $link, the link widget can be modified to default to an external links eg https://, and other appropriate settings in the “a record” used to render in html.
The only part I cant do is;
Adjust the parser to detect based on a regular expression and a database of TLD’s and invoke the $link widget
See module-type=wikirule such as $:/core/modules/parsers/wikiparser/rules/extlink.js
I know how to subsequently enhance the link widget through its redefinition.
Hi @TW_Tones the idea of matching the complete list of Top Level Domains may not be practical: there are now nearly 1,500 of them in the official registry:
However, the core of this question crops up regularly in general programming discussions, and you’ll find many answers on StackOverflow offering suggesting of regexes that match some useful subset of possible URLs.
Once you’ve selected your regex, you’ll need to override the parse rule in “$:/core/modules/parsers/wikiparser/rules/extlink.js”.
I think that’s going to be a real problem, though. It’s difficult to distinguish whether, say, colors.blue represents a bit of UI code or the domain colors on the TLD blue.
If someone gives me the regex to find something.something and something.something.something etc… I should be able to handle the scope of its application.
I do believe it would be fantastic if we documented thoroughly on tiddlywiki how given a regular expression found online, we can convert it to use in tiddlywiki. As a result those like me, who seem to have a regex form of dyslexia, can use online resources to generate them, then convert to tiddlywiki applications.
If it’s a JS regular expression, the process is quite simple.
In JS, a regex is delineated with slashes, /regex-here/. It may be followed by some single-character flags:
/regex-here/igm According to regexp Operator, the only one generally useful in TW is i, which means the matching is case-insensitive.
To create a TW regex from this you can define a variable to hold the characters of the regex, stripping off the slashes, and then include it in a call to the regexp Operator or the search Operator:
If you want to include the flags, they can precede or succeed the body:
\define xyz() (?ig)regex-here
or
\define xyz() regex-here(?i)
Theoretically, there are times you can define the regexp inline,
... [...regexp[regex-here]...]
but so often a regular expression contains characters which can’t be used that way that’s it’s probably easiest to always use the external definition.
I have not seen regex-generation tools; they may exist, but I think they would be tricky to get right. There are a good number of regex-testing tools, though, and you can use them to see if you’ve got it right. Of course there are many regular expressions online, and a large portion of those are JS-compatible (every language has a slightly different flavor of regex), so you can often search for them and just use them directly with the minor conversion above.
I have found Chat GPT a regular expression writer, but the is the gap to tiddlywiki
In TiddlyWiki, regular expressions (regex) are utilized in several areas, including filter operators and widgets, to match or manipulate text. TiddlyWiki’s regex capabilities are essentially those provided by JavaScript’s regex engine, as TiddlyWiki is built on JavaScript. This means any valid JavaScript regular expression should work within TiddlyWiki.
Here are some key points about using regular expressions in TiddlyWiki:
Syntax: JavaScript regex patterns are enclosed within forward slashes (/). For example, /pattern/flags. Common flags include:
g: Global search.
i: Case-insensitive search.
m: Multi-line search.
Character Classes: You can use character classes such as \d for digits, \w for word characters (alphanumeric and underscore), and \s for whitespace characters.
Quantifiers: Quantifiers like * (zero or more), + (one or more), and ? (zero or one) are available to control how many instances of a character or group must be present for a match.
Groups and Ranges: Parentheses () are used for grouping, and square brackets [] for character ranges. For example, [a-z] matches any lowercase letter, and (abc)+ matches one or more repetitions of the sequence “abc”.
Special Characters and Escaping: Special characters like ., *, ?, etc., have specific meanings in regex. To use them as literal characters, they need to be escaped with a backslash \.
Lookahead and Lookbehind: JavaScript regex supports lookahead ((?=...) and (?!...)) for matching a group after or not after a certain point. Lookbehind ((?<=...) and (?<!...)) is supported in newer JavaScript engines, so its availability might depend on the browser or environment TiddlyWiki is running in.
Using Regex in TiddlyWiki
When using regex within TiddlyWiki, particularly within filter operators, you typically pass the regex as a string argument. For example, the search operator allows you to filter tiddlers based on a regex match against their titles or content. The regex should be properly escaped according to TiddlyWiki’s parsing rules for filter strings.
Example
A filter to find tiddlers with titles containing one or more digits could look like this:
[title[My Tiddlers]search:title/\\d+/]
Notice the double backslashes \\ are used to escape the \d within the TiddlyWiki filter string.
In summary, any regular expression that’s valid in JavaScript should work in TiddlyWiki, but be mindful of the context in which you’re using them, especially with respect to escaping characters in filter strings and other TiddlyWiki-specific syntax.
I’m assuming that quoted material is ChatGPT output.
I find it useless, I’m afraid. It has a lot of irrelevancies regarding basic regex syntax. It would likely be better to simply link to a useful regex tutorial for that. It has wrong and confusing syntax. What in the world is this supposed to mean?
[title[My Tiddlers]search:title/\\d+/]
What is the title for? Where is the regexp flag? Why is its argument missing the [ ] wrapper? Why are the JS regex slash delimiters included when not used in TW?
This is entirely wrong:
Notice the double backslashes \\ are used to escape the \d within the TiddlyWiki filter string.
The summary tells you to “be mindful of the context in which you’re using them, especially with respect to escaping characters in filter strings and other TiddlyWiki-specific syntax,” which is often irrelevant in TW, but doesn’t tell you how to handle characters which can’t appear in [ ]
I think the world has gone gaga over LLMs like ChatGPT, expecting much more from them than they can actually supply. The fact that they create grammatical and somewhat realistic-looking answers doesn’t compensate for their hallucinations and overconfidence.
This is a fascinating and I think, finally, a very interesting thread of applied thinking. I am very aware it has potentia. It is in, one way, simple. Another way a likely rabbit-hole of juggling TLD (defined external to TW); JS regex; TW regex operators; the translation of the JS syntax to TW Operator syntax; the parsing of parsing solutions … etc.
Regarding regex matching in all this … @Scott_Sauyet correctly sees that there is (needs be?) an heuristic component in play. There is not one way to solve the issue of the OP. It could range from sophistication-central to minimal-good-enough.
As far as I understand the OP … @TW_Tones wants to …
(1) discover title: s that are (potentally) domain addresses;
(2) do some kind of validation that the TLD component IS valid.
Yes???
[ UPDATE: “No” on (1). @TW_Tones meant inside the text field,]
This is just a comment
IF I can, I’ll comment, more constructively I hope, later, on a couple of the specific issues that came up.
Perhaps something.something, and something.something.something would be sufficient as long as it does not impact with custom widgets, functions and variables.
With so many TLD’s now I am not sure a more detailed examination would be needed, but perhaps if we can use the tilde ~ to defeat it if necessary ~something.something
I am interested in say tiddlywiki.com being identified as an internet link which is somewhat equivalent to [[tiddlywiki.com]] however tiddlywiki.com is parsed into a https://tiddlywiki.com and [[tiddlywiki.com]] is a title of the tiddler tiddlywiki.com.
1 - (Ideally) to have an on-load Parser that will auto-render a … (sub-sub-dom.sub-dom.)domain.TDL as an actionable URL even though it is without a protocol?
2 - Avoid potential conflicts with any tiddler that contains a “TW.construct.thing”
Have I understood yet?
P.S. Work around. One might just search and replace using regex to upgrade selected inactive “links” to active links in the conventional “protocol format”?