Archiving the list

I see the posts made from GG are being archived at www.mail-archive.com. It doesn’t look as though the ones posted from Discourse are being captured. I have no idea if it’s easy or possible to get them to archive a Discourse board but it might be worth considering.

I don’t know anything about Mail Archive in particular (in the sense that I guess it’s a free service and we hope they stay around?). Looks like someone subscribed an email there at some point. There seem (from search results) to be various things that grab GG content.

For Discourse, since it isn’t email first, the methods are slightly different. We have backups, plus the entire server is backed up. It also should be web crawlable so e.g. Internet Archive / Wayback Machine can get content as needed.

I did a little research, and the Discourse Data Explorer Plugin (which “explores” the data in the database) can be used to export to CSV, among other formats

@clutterstack can you say more about what you think should be done?

Do you want to take on researching this and making recommendations?

For me, the ideal archive for the forums would be a GitHub repo that anyone could clone and use offline.

Given that Discourse uses an RDBMS, presumably we’d need a Cron job/GitHub Action that periodically uses the API to update the archive and commit it back.

There seems to be a data-explorer plugin:

So once we know how to extract the content in a format that works, there is a JSON and a CSV export button in the UI screenshots. …

The docker container running the app seems to have a postgresql cli installed. But I don’t know how easy it will be to send data to the outside. … But it should be possible.

@jeremyruston What format would the archive ideally take on GitHub?

It looks as though lots of people have talked about archiving Discourse sites, but there doesn’t seem to be one obvious solution.

Here’s one fairly info-rich thread on it: A basic Discourse archival tool - dev - Discourse Meta. After a quick scan, it looks like everybody there is talking about crawling the forum site and making a kind of limited static copy rather than using database queries.

Suggestions there:

HTTrack website copier (older, general-purpose tool, infinite scroll a barrier?)

A Juptyer notebook by the thread’s OP Mark McClure for archiving his Discourse site – example output

A fork of Mark McClure’s ArchiveDiscourse by GitHub user kitsandkats notably grabbing all the posts in a topic. Results: https://discuss-learn.media.mit.edu/ Note lack of dates.

User Brecht Machiels mentions at the end of the thread he used wget to download his site, and, in ways I haven’t investigated yet, ends up with a searchable archive using replayweb.page (a viewer) and WACZ format (a " working draft/proposal for a directory structure + ZIP format specification for sharing and distributing web archives"). This was interesting because he mentions search.

I’m not advocating any of these approaches, just exploring.

1 Like

Thanks for doing this research, it’s a great start!

I think Jeremy is thinking about portability rather than backup (so JSON files or HTML for example).

Including, make it work offline, specifically for search-ability.

There are a number of minimal search engines that could do a good job at this.

And, unspoken, not requiring the rather hefty “stack” required to run a Discourse server with database, email, etc. The backup function handles that Discourse restore use case.

I think if we’re going this direction, turning a Discourse into a TiddlyWiki would actually be a reasonable process.

It could start by dumping CSVs into a GitHub repo, and then a process for turning those CSVs into tiddlers.

I also think this would be of interest to the wider Discourse community.

I know, so tempting.

There seems to be a big gap for the general Discourse-using community, even if just for a good static site conversion. Unless I am missing a big obvious thing.

So, I experimented with Data Explorer, which requires writing some SQL queries inside of it, exported to CSV, used @joshuafontany JSONMangler which has a CSVImport on the inside of it … and exported / imported all posts from a food Discourse of mine to my TWGroceries :slight_smile:

I’ll do a full write up rather than this teaser, but suffice it to say that there definitely is a pipeline to export Discourse forum to a standalone TW.

Thanks for kicking this off @clutterstack!

1 Like

Hmm. We could make this as simple as drag and drop the CSV on a TiddlyWiki to import, by writing a custom deserializer module.

Right. No reason you should. It has been around forever. It is mentioned in the GG info header for TW because I, long ago, pointed out it was actually an easier, more reliable resource for advanced search than what Google do in the GG!

That said, Mail Archive is resolutely rooted in PLAIN TEXT stuff that was the daily fodder of Usenet and Listserv, from which it evolved.

RESULT? I DO think a copy of every Discourse post sent there would still exist well into the future. However, it does not support HTML email natively–given it’s nativity was in a world of plain text only.

BUT, I would certainly consider its relevance for CONTINUITY in some ways. It may still actually be better than GG itself for finding stuff.

Best wishes
TT

Re: Mail Archive - I meant, it’s business model, who runs it, and how long it is likely to stick around.

As answered above, Discourse isn’t a mailing list, so sending posts there isn’t possible.

I think we’re going in a good direction to make content available.

As @boris suggests, I’m interested in a few goals:

  • Making sure that we have a durable backup of discussiosn that anyone in the community can verify or use
  • Lowering the barriers to TW-based experiments with the discussion forum content
  • e.g. An offline search engine

Given Discourse’s tech stack, perhaps a reasonable goal would be HTML snapshots of posts (and images and attachments) in a format that TW can import such as JSON. It would be a pretty generic tool that should be useful to other communities.