How does TW translate wikitext to HTML

Surge · September 27, 2024, 3:14am

Hi Eric,

Resurrecting this old thread for a somewhat related issue. Here’s the wiki text I have (i replaced the triple quotes with ticks because it messes up the formatting here):

<<.warning "HELLO
'''bash
some code
'''
">>

The html renders as:

<p>
  <div class="doc-icon-block">
    <div class="doc-block-icon"><svg><path></path></svg></div> HELLO
  <code>`bash
  some code
  </code>
  <code>
    &lt;/div&gt;
  </code>
  </div>
</p>

Using the code widget works, i’m just trying to understand the mechanics here, how does TW arrive at that?

pmario · September 27, 2024, 4:01am

Have a closer look at: Create Code-Blocks Using Backticks in Discourse Threads to see how to create code blocks in discourse

pmario · September 27, 2024, 4:18am

For me your code does not make too much sense.

You wrote: “(i replaced the triple quotes with ticks because it messes up the formatting here)”

I think you meant: you replaced tripple backticks ``` with tripple single-quotes '''
But if I do replace the single-quotes with backticks tiddlywiki.com produces this.

<p>
  <div class="doc-icon-block doc-warning">
    <div><strong>Warning</strong></div>
    <div class="doc-block-icon"><svg><path></path></svg></div>
    HELLO
    <code>`bash
    some code
    </code>
    <code>
    </code>
  </div>
</p>

Which is different to your HTML output. And it produces an inline code block, which is probably not what you expected. right?

I assume, what you want to have a “warning box” that contains some text and a code block with some BASH commands. – right?

I think the <<.warning>> macro can not do what you expect at the moment. We will have to have a closer look at the code.

Surge · September 27, 2024, 2:07pm

Thank you for the link on the syntax for this in discourse and sorry for the confusion caused. So the code in question looks like this:

<<.warning "HELLO
```bash
some code
```
">>

My interest is purely scientific here. I’d like to understand why that html is produced. But if it’s not immediately obvious, I’m good with that.

Scott_Sauyet · September 27, 2024, 7:52pm

I think you can get somewhat close with

<<.warning """HELLO
<pre><code class="language-bash">some code</code></pre>
""">>

But you won’t get the language-level highlighting for the code-block. I don’t know if there is a setting somewhere to enable that.

I don’t know if your question was more general and this happened to be your example, or if you were just trying to format that warning correctly. If you’re asking how the wikitext → HTML conversion happens, I’m sure it’s a substantial and intricate answer. I would love it if Jeremy, Eric, Saq, or Mario could shed some light on that. But I really want to know too, and it’s probably the next thing I’m going to dig into for TW after I clear some other work off my plate. If I do figure out something significant, I will try to post my findings in a thread here, hoping for corrections from the core team.

pmario · September 28, 2024, 7:57am

You are right, It is not immediately obvious, but there are some interesting mechanisms involved. Let’s start at the beginning, because it is important for context.

TiddlyWiki Conepts

A) Everything it TiddlyWiki is a tiddler
B) WikiText is a concise, expressive way of typing a wide range of text formatting, hypertext and interactive features.
C) TiddlyWiki’s display is driven by an underlaying collection of widgets

Wikitext Rules

TW wikitext follows several rules. To identify a paragraph it looks like this:

If wikitext is followed by 2 new-line characters, it is a paragraph.
If text within a tiddler ends without any new-lines, it is a paragraph.
If a paragraph is identified, use that text and inspect it, if there are “formatting rules”, like bold, italic and so on.

There are similar rules for eg: Headings, Comment block, Code blocks … and so on. Let’s continue with

Formatting Rules

Bold text can be created using 2 single-quotes as start- and end-marker: ''bold text''
Italic text can be created using 2 forward slashes as start- and end-marker: //italic text//

Let’s go with bold. Usually bold text means that something is important or of special interest.

So we need to know how we can define “important” text in HTML syntax. It is: : The Strong Importance element - HTML: HyperText Markup Language | MDN

That’s the goal

We need a way to go from ''important'' to important

Wikitext Parsing

To achieve this goal, in TW there are 3 elements needed.

A text parser, that converts human readable text into a computer readable text description following the “fundamental rules”. We call this description: parse-tree
Some javascript logic, that reads the parse-tree and converts it into TW widgets. We call it widget-tree
Some javascript code, that converts the widget-tree to HTML elements. We call that one “the renderer”

Why do we need those steps?

Because text parsing is “expensive” in terms of CPU cycles needed. So in instead of parsing and rendering HTML text over and over again, whenever a tiddler is displayed, we try to re-use as much information we have, to speed up the rendering.

Why is the parser performance sensitive?

Because of the “Wikitext Rules” that I pointed out above. In reality there are:

14 “block rules” similar to paragraph
24 “inline rules” like “bold”, “italic” and so on. And there are
8 “pragma” rules like \function, \procedure and so on

The whole wikitext has to be checked by those rules character by character.

Starting with “pragmas”, which are valid on a per tiddler base
Next “block rules” are used to identify blocks, like paragraphs and headings
If a block is identified, it has to be handed over to the inline-parser
The inline-parser identifies “Formatting Rules”

What can we do to make it performant?

We try to parse wikitext only once. That means, if a tiddler is shown, we can have a look, if it has been parsed already and we can reuse the parse-tree.

OK → But how does a parse-tree look like

Eg. text: ''important''

The TW editor can show the parse-tree if the “Internals Plugin” is installed.
Since there is only one line in this tiddler Wikitext Rule - 2 (see above) is active. The whole line is covered in a P(aragraph) element
The P element has some children
The Formatting Rule - bold was triggered by ''
As described above uses the HTML element , which they describe as html-tags. The STRONG tag has 1 child.
The child of STRONG is a HTML TEXT element, which starts at character position. 2 and ends at character position: 11 (9)
The P tag starts at: 0 and ends at: 13
The wikitext rule “bold” was identified from (0 → 13)

The parse-tree does not change, except the tiddler is changed. So it is the perfect candidate to be stored and reused internally.

Good – performance improved – What’s next?

As written above at: TiddlyWiki Conepts → C

C) TiddlyWiki’s display is driven by an underlaying collection of widgets

The parse tree has to be converted to a widget-tree, which already looks very much like the HTML DOM-tree, that browser use to render HTML content. See: DOM (Document Object Model) - MDN Web Docs Glossary: Definitions of Web-related terms | MDN

We can switch the editor preview mode to “widget tree”

As seen above in C) widgets are used to “render” the HTML content. TW widgets do have a javascript .render() and a .refresh() function
There is an element-widget, which renders HTML elements. In this case P tag, a paragraph
Inside the paragraph there is a STRONG tag also created by the element widget
TEXT is a basic HTML DOM node itself and is rendered as plain text.

That is it - We achieved “the goal”

If we switch the preview to “raw HTML”, we do get:

<p><strong>important</strong></p>

But, but – Then we could directly use HTML Code – and bypass all the parsing and widgeting stuff.

Yes we can – use HTML code directly.
But.
No we can not bypass the parse- and widget-tree for security reasons.

Even if we directly write important into the TW editor.
The whole thing will start at:
Wikitext Rules, but instead of parsing wikitext, we will now parse HTML text

Why did you mention .render() and .refresh() in widgets

Because TW widgets use those js-functions to finally convert the internal data-structure called widget-tree into HTML DOM nodes. .render() is used to create DOM nodes.

Since manipulating the DOM can be slow, TW uses “refresh()” only if the widget-tree changes. This is especially important for the list-widget.

One more optimization to be performant

@Surge … That should finally answer your “purely scientific question”.

Hope that makes sense.

Have fun!
Mario

Scott_Sauyet · September 28, 2024, 1:54pm

Thank you @pmario!

This was the sort of thing I was thinking of doing… except that I wanted to also go further into the weeds and determine where in the code all these things are happening. Perhaps I’ll get into that at some point.

I believe this answer would be a useful addition to TW documentation, if not at tiddlywiki.com, then at least in a community wiki entry here.

linonetwo · September 28, 2024, 4:06pm

Have a look at how TW translate wikitext to wikitext …

github.com/TiddlyWiki/TiddlyWiki5

feat: serialize AST node back to wikitext string

TiddlyWiki:master ← linonetwo:feat/to-string

opened 05:28PM - 12 Jun 24 UTC

linonetwo

+1257 -142

closes #8255 Currently just a demo, I will gradually move code from https://…github.com/tiddly-gittly/wikiast/tree/master/packages/wikiast-util-to-wikitext to here. Try it with tiddler (Well, if official website have share plugin installed, I can put a link here): ``` AAA test --- \```js var match = reEnd.exec(this.parser.source) \``` end ``` and run this in console: ```js $tw.utils.serializeParseTree($tw.wiki.parseTiddler('New Tiddler').tree) ```

to HTML have 2 more steps, it will new widgets like new ListWidget() to based on ast, and each widget’s render method will create HTML for its part.