Help needed about HTML2TW: Macro

arunnbabu81 · August 6, 2022, 10:34pm

@Mark_S
I use HTML2TW: Macro created by you along with tiddlyclip to clip from webpages in wikitext format. I got it from this google group discusiion - https://groups.google.com/g/tiddlywiki/c/2uh1pQLNKHM/m/CTcPAA8lAgAJ

It works well for text. But for images, I found some odd formatting error. Although the images are displayed, the link for the images are also displayed with some error in the link format. See the image below

Also is there a way to get code blocks from github to be converted into code blocks in wikitext.

image1778×516 51.3 KB

I clipped this code block from github and got this

image1746×296 30.4 KB

I have created a demo at this link. Click on the Tiddlyclip link in the topbar to see the demo.

Mark_S · August 7, 2022, 6:23pm

The error is happening because there is a link around the image. WikiText doesn’t handle links around wikitext (ext) images. I guess one solution would be to have a special rule that looks for images within links, and then moves the link outside the image.

arunnbabu81 · August 7, 2022, 7:52pm

These are the types for images I found

Github images

Acutual html mark up is given below

<a href="https://user-images.githubusercontent.com/67494083/183275174-ad3dfe37-cf0d-4eff-a7ff-cfa83145a4a0.jpg" rel="noopener noreferrer"><img style="max-width: 100%;" alt="Screenshot_20220807-095033_Chrome" src="https://user-images.githubusercontent.com/67494083/183275174-ad3dfe37-cf0d-4eff-a7ff-cfa83145a4a0.jpg"></a>`

But on wikitext conversion it appears like this. Image is seen, but an additional link appears below the images with an additional [[ is seen before the image.

[[

[img[https://user-images.githubusercontent.com/67494083/183275174-ad3dfe37-cf0d-4eff-a7ff-cfa83145a4a0.jpg]]

|https://user-images.githubusercontent.com/67494083/183275174-ad3dfe37-cf0d-4eff-a7ff-cfa83145a4a0.jpg]]

Article from a rad website

Acutual html mark up is given below

<img loading="lazy" alt="Figure." title="" data-lg-src="/cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg" src="https://pubs.rsna.org/cms/10.1148/rg.2018170097/asset/images/medium/rg.2018170097.fig1.gif" class="figure__image"><figcaption><strong></strong><span class="figure__caption hlFld-FigureCaption"><p><span class="captionLabel">

But on wikitext conversion it appears like this and image appears broken.

[img[/cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg]]

If I manually add https://pubs.rsna.org before the /cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg the image becomes seen again.

Can a special rule be added for the above two behaviours in the macro ?
It must be something similar to these right.

    /* ANCHORS */
    pattern = new RegExp ("<a [^>]*?href=\"(.*?)\".*?>(.*?)</a>" ,"gi") ; 
    intext = intext.replace(pattern, "[[$2|$1]]") ;

    /* SPANS */
    pattern = new RegExp ("<span *.*?>(.*?)</span>","gi") ;
    while(intext.search(pattern) > -1) { 
	intext = intext.replace(pattern,"$1") ;
    }

    /* IMAGES */
    pattern = new RegExp ("<img.*?src=\"(.*?)\".*?>" ,"gi") ; 
    intext = intext.replace(pattern, "@@[img[$1]]@@") ;

Images from many other websites are clipped somewhat correctly.

Mark_S · August 7, 2022, 9:53pm

There’s an assortment of issues. But the main one is that you can’t write

[[ [img[http://myimage.png]]|[[http://somesite.html]]

in wikitext. That is, you can’t wrap a pretty link around [img[image.png]] . So the link and the image would have to be separated. Like:

[img[http://myimage.png]] [[Link|http://somesite.html]]

arunnbabu81 · August 7, 2022, 10:08pm

Is there a solution for this? Can the link be completely omitted with the image only getting converted from the html.

arunnbabu81:

Article from a rad website

Acutual html mark up is given below
<img loading="lazy" alt="Figure." title="" data-lg-src="/cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg" src="https://pubs.rsna.org/cms/10.1148/rg.2018170097/asset/images/medium/rg.2018170097.fig1.gif" class="figure__image"><figcaption>
But on wikitext conversion it appears like this and image appears broken.
[img[/cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg]]
If I manually add https://pubs.rsna.org before the /cms/10.1148/rg.2018170097/asset/images/large/rg.2018170097.fig1.jpeg the image becomes seen again.

About this issue, is there a way to add the initial part of the link to the converted portion of the image using regexp.

I have seen something similar being done in tiddlyclip - to remove the initial part of the link from tiddler title.
See this discussion - How to apply a regular expression to use only a part of a URL as a title. · Discussion #47 · buggyj/tiddlyclip-plugin · GitHub

Is this possible? It would make clipping from GitHub discussions very easy.

Mark_S · August 8, 2022, 6:33pm

I’ve put an updated version of the HTML2TW macro here. This fixes the image link by making the image and link wikitext separately. Keep in mind that in some cases (most actually) that the link around the image and the image itself will be two different things. So we don’t want to throw away the link. In many cases, the link will be more important than the image, which might just be a stock image.

I wanted to test from a copy of your MW43 site. But in edit mode, why, for the love of all that is alphanumeric, does Shift-L invoke a popup? This means I can’t type important words like “Liberty”, “Litany”, “Liturgical”, and most importantly, “Link”.

When running the web-clipper, I got an error “This page can not find removelfmarco”. I searched for “removelfmarco” on MW43, but couldn’t find it. The error didn’t block clipping, so I was able to proceed. Maybe you know how to fix it?
.

Do you have a link to the article?

Thanks!

arunnbabu81 · August 8, 2022, 9:01pm

Thank you @Mark_S for fixing it. I am currently not on my desktop. So unable to test it.

I disabled that keyboard shortcut. I hope it won’t get triggered when you press shift+L nowonwards. That was modal search experiments done by me recently.

I haven’t seen this error when I clipped from the demo wiki the last time. Since it’s based on multicolumn layout, I think only one ensemble of this wiki has tiddlyclip enabled. You have to click on the Tiddlyclip link in the topbar to go to this ensemble created for the purpose of Tiddlyclip demo. I will confirm once I am back on my desktop. Not sure why that error occured.

This is the link of that article. Its one of our main Journal. So many of the references will be taken from this site. If this site is covered, it will be a huge relief.

Mark_S · August 8, 2022, 11:23pm

I’ve included a fix for this. It turns out they had their own source attribute which was confusing the regular expression.

arunnbabu81 · August 9, 2022, 3:26am

Both works fine. Thank you once again. Where in the macro code is the second fix added. I just wanted to learn how to do it in case I find similar issues in other websites.

Any solution for this code blocks issue.

Do you use this macro these days. I found it to be more reliable than a markdown addon i tried recently after these fixes.

Mark_S · August 9, 2022, 5:44pm

Do you have a link to a sample? This is a little off from HTML2TW conversion (since now it’s more like JSON2TW )

The original code looked like this.

   /* IMAGES */
    pattern = new RegExp ("<img.*?src=\"(.*?)\".*?>" ,"gi") ;

The problem was that the image tag had it’s own attribute containing “src” (data-lg-src). Which is the one the regular expression was finding:

<img class="figure__image" src="https:...gif" data-lg-src="/cms/10.11..." title="" >

The fix was to require a white space (\s) in front of the “src” attribute. Perhaps a word boundary would work as well or better. @TiddlyTitch might be able to comment:

   /* IMAGES */
    pattern = new RegExp ("<img.*?\\ssrc=\"(.*?)\".*?>" ,"gi") ;

arunnbabu81 · August 9, 2022, 7:57pm

I have added 8 example tiddlers arranged in 3 columns - each tiddler has 2 parts - first part uses clipping by HTML2TW macro and second part uses html clipping to see the html mark up. All tiddlers are folded for ease to find them. Click on the tiddlyclip link in the topbar to see all relevant tiddlers together. Hope you can identify the tiddlers using the permalink. If you want to see only one tiddler at a time, use the fullscreen button in the viewtoolbar of each tiddler

Column 1 - first tiddler - Missing images

I tried to clip the OP of this post in Talk Tiddlywiki forum which had multiple images…all images after the first image were missing in the clipped tiddler.
Column 1 - second tiddler - Missing text

This was clipped from github. Some parts of the text were missing (I saw the disclaimer regarding missing text in the macro tiddler. Still if something can be done it would be nice).
Column 2 - first tiddler - Another example of missing text while clipping from talk tiddlywiki
Column 2 - second tiddler - Example of image with anchors showing bug with link display
Link is not shown seperately from the image inspite of the new updated macro being used (I found it accidently. New macro was working well for this feature in all other cases).
Column 3 - first tiddler - Example for quotes and single backtick code blocks in Github
Column 3 - second tiddler - Example of triple backtick codeblock in Github
Column 3 - third tiddler - Example of single backtick codeblock in talk tiddlywiki

Also numered list is seen as bullet list on clipping.
Column 3 - fourth tiddler - Example of triple backtick codeblock in talk tiddlywiki

Code is not seen at all including the html clip.

Just wanted to point out these issues which I found in addition to the code and quote block feature inclusion. If not fixable, there are workarounds I guess like selective html clipping or clip selectively using html2tw macro instead of clipping an entire post in single session.

Mark_S · August 9, 2022, 10:46pm

Ok. See if current version fixes this.

Edit: How do I shrink/expand columns?

arunnbabu81 · August 10, 2022, 2:39am

wow. It works now. I have updated that tiddler.

Ctrl+Alt+B to reduce the number of columns.
Ctrl+Alt+N to increase the number of columns.

TW_Tones · August 10, 2022, 4:40am

I am keen to incorporate a HTML2TW into a dropbar, if html is dropped on it, it appends the html as TW Markup to the text field.

What level of maturity is HTML2TW at?

arunnbabu81 · August 10, 2022, 8:27am

I think @Mark_S is the better person to answer it.

From this post, it seems like this macro was created way back in 2016. So many must have used it - just an assumption.
I started using it recently only. It’s reasonably reliable except for a few bugs I have reported in this thread. Also it depends on the type of websites we need to clip contents from. I will definitely use it.

In the current state, it’s best used along with tiddlyclip. Tiddlyclip and this macro is a deadly combo if some of these bugs are fixed. In two or three click, you can clip text, images to your wiki with preserved wikitext formatting.

I wonder why people hardly talk about these two tools here in the forum - dont know whether it’s because less people use it or is it because nobody has any doubts regarding their usage. Also many people might not be aware of these tools.

Mark_S · August 10, 2022, 5:07pm

I’ve made some tweaks that should fix (sort of) problems 1 thru 4.

It’s six years old. How many mature six year olds do you know?

There’s been almost no interest in the last 6 years. I think part of the problem is that’s it’s kind of hard to figure out the configuration for bj’s web clipper – the documentation is old, scattered in various places, and doesn’t include the latest developments. It’s not an out-of-the-box experience like with Evernote or some other app.

I know for myself, I generally just use zotero or something to save articles. Lazy.

That said, it’s surprising that there isn’t more interest.

I can only assume that most people don’t need web-clipping, or they are content with HTML clipping, regardless of the management implications.

The HTML2TW macro is all based on regular expressions. This is very flexible and easy to understand. But it can go off the rails with poorly formatted text. Probably a more robust method would be to put the text in an HTML object and then parse the tree using standard JS tools. I sort of thought someone would have done that by now.

TiddlyTitch · August 10, 2022, 5:26pm

Bingo! The man (buggyj · GitHub) is brilliant! Likely fedup with supporting something with no revenue stream over years??

TT

arunnbabu81 · August 10, 2022, 5:42pm

That’s true. I have asked too many doubts to him regarding tiddlyclip and his other plug-ins in github and he have replied to all of them with answers without any delay. Just see this discussion page of Tiddlyclip plug-in and see the number of topics he have answered may be within the last one year or less. I hope to set up a demo wiki for tiddlyclip for demostarting various clipping methods I have learned from him with examples.

Sadly that also must be true.

arunnbabu81 · August 10, 2022, 6:13pm

@Mark_S Thank you. I have updated the demo wiki. First 4 image related issues are fixed now except for the one thing which I had forgot to mention before. In the 4th issue, the numbered list was seen as bulllet list after conversion with some extra gap between each bullet.

Earlier today I found another bug related to images. See this tiddler. The first image is seen missing from this clip (html added for comparison). With the earlier version of the macro, first two images were missing.

I think if I use more, there will be some random issues like this. But that doesn’t limit the usefulness of this macro. It saves so much of our time. Should I report such small issues here or just ignore them.

Mark_S · August 10, 2022, 6:25pm

Maybe you’re using the version before the update? When I try clipping the first 3 images and some text, it seems to be all there. I’m a bit overwhelmed by the size of the images

redsign-of-landing-page.json (1.3 KB)

Re multi-columns, how do you change the size of the columns (not the number, which I was able to find easily).

Thanks!