Convert a word document into tiddlers

quarktasche · August 13, 2024, 7:52am

Hello everybody.
I’m wondering if someone ever thought about converting a worddocument into different Tiddlers.
The goal is to convert a manual written in a word dokument into Tiddlers of a wiki.
Therefor the chapters and subchapters should be used as Tags.
How to get a structure for the different tiddlers using Tags I know.
I take a Tiddler for the chapter and in this tiddler i arange all of the Tiddlers of the subchapters as tabs.

My Problem is how to convert the worddocument maybe into a json file so i can import it into a tiddlywiki.

Thanks for your answers in advance.

Springer · August 13, 2024, 1:54pm

Hello @quarktasche,

Certainly I’ve converted word docs into tiddlywiki.

The right structure of tiddlers, of course, depends on the structure of the original and the desired kind of interaction you want with the final result.

I don’t think you need to do the splitting in advance (that is, to import many json for one document). Just paste the whole document into a tiddler, and then use various steps to “digest” it into the right tiddler-sized bits…

I recall that someone had made a button, at one point, to split a tiddler into one tiddler for each paragraph in the original. But at some point I stopped having a need for it, and can’t easily track it down in my own wikis now…

However, a quick search for tiddler “slice” gives this — which represents at least one version of that approach:

https://tiddlywiki.com/editions/text-slicer/

quarktasche · August 14, 2024, 5:05am

Thanks for your Answer@Springer.
I’ll give it a try and post my experiences.

jms19 · August 14, 2024, 10:21am

I guess and with no experience of doing so it should be done in two stages

export/save the word document as some open structured form possibly HTML or XML
write something to process that in your favourite language into Tiddly-importable JSON

TW_Tones · August 14, 2024, 11:10pm

Along with @Springer’s advice it has being suggested that saving a word document as xml is an avenue towards importing word documents, with formatting into TiddlyWiki. This will mean it isa not in xml rather than simple TiddlyWiki markup.

There are possibly hundreds of paths to import Word to TiddlyWiki so it is worth doing some exploration.

Some tips

Export you document to Markdown, import it as markdown and install the markdown plugin, then your document will be in an easy to edit form with formatting. See Pandoc
Once imported as a monolithic document use as is but if you want to break it down use the excise tool, eg excise each chapter and other sections while keeping a single view in the original tiddler eg replace with transclusion
- Do the same for each resulting chapter tiddler so the sections get the chapter tag
One could keep the monolithic document and only excise parts you want to reference separately eg quote.

If you import the document into markdown it may be possible to write some batch solutions to split the document into pieces if you can identify each subsection.

Perhaps something can be done using “JSON Mangler” against an XML document?

Is this once off or an ongoing need?, if on going consider automating and lean on the community because this may be a shared need.

quarktasche · August 15, 2024, 8:26am

Thanks for your answers and hints.
For me it’s not regularly needed to split up Word Documents.
I just have to split up one document and I try out the different hints.
The problem is that the document seems to be to big so the import leads to a bad performance of the wiki I am trying to build up.
Maybe I have to split up the Word-Document before I import it to the wiki.

hiddengarden · August 25, 2024, 11:45pm

If the Word file is structured well with H1 to H3 headings, I would run a macro in word to flag them accordingly. Then I would paste it in an Excel spreadsheet and use VBA to generate the JSON file for import.

I did this long time ago, thinking if I can digest the content of a book, the knowledge will somehow get into my brain and stay there. I did this once or twice before I realised that this will not work for me unless I do the reading and take the notes myself…

Edit: typo

Springer · August 26, 2024, 1:17am

I hear you. I have tried all kinds of osmosis, outlining, and x-ray vision. In grad school, I even tried touch-typing a whole book that was on brief interlibrary loan (at a time when OCR and electronic text were not yet widely available), figuring this approach would be faster than formulating careful notes. I’m afraid I got little more than an imperfectly-searchable text that I still had not digested… At least for the works that are really important and provocative, nothing quite does the trick like starting at the beginning and sitting with the text.

quarktasche · August 26, 2024, 5:03am

A little update to this topic.

I just stopped dealing with this idea because it is just a onetime project and my knowledge in converting via VBA and other programming languages is very small.
The other point is that, maybe because of a lot of pictures in the text, some functions I tried didn’t work.
I also have not enough time to spent on this project because a lot of other work has to be done.
Thanks for all of your aswers and input.

hiddengarden · August 26, 2024, 10:03am

Very true. An image is worth a thousand words. I too couldn’t figure that one out.

You can extract images from Word file in bulk, but it exports them as “Image 1, 2, 3, etc.” disconnecting them from the text. Adding images manually in my scenario, with a limited context takes the joy out of reading.

Perhaps, if someone could help me handle the context of the images, I would be happy to share the macro files, because I still believe there is a value in this.

My challenge then would be to make Excel write a TOC JSON/tiddler that would keep information sequential. And of course make it usable by others.

john.edw_gmail.com · August 26, 2024, 8:32pm

Pandoc should do the trick. You could bash script the conversion of a directory of files.

Here is an example conversion from the command line:


pandoc -s file-sample_100kB.docx --wrap=none --reference-links -t markdown_strict --extract-media=./file-sample_100kB -o file-sample_100kB.md

This will convert the file-sample_100kb.docx to markdown_strict and extract the media to a subdirectory of ./file-sample_100kb/media

original docx: File Examples | Download redirect...

pandoc output: Processing: file-sample_100kB.md…

pandoc output pasted into a markdown tiddler:
file-sample_100kB.md.tid (7.4 KB)

media folder contents:

hiddengarden · August 29, 2024, 1:33pm

Awesome! Can you confirm this works in Windows?

john.edw_gmail.com · September 3, 2024, 2:19pm

Works on my win10 setup. There is a pandoc python library if you need to do some manipulation other than just converting from the command line.

ShadowTiddlers · September 3, 2024, 6:57pm

In theory, I image .odt files could be importable as they store data in xml format and zip, zip likely for images and any other embedded files. It’s a neat thing I’d like to look Into at some point as docx should be convertible to odt, a plugin enabling something like application/open doc might be possible for native tiddlywiki to do but it may end up acting like embedded pdf. Though I imagine that it should be possible if odt is mostly xml, as that should be atleast readable but I’m pretty sure docx is overly complex. A new thing I’d personally like to tinker with.