Can I get talk forum and gg dataset to train AI?

linonetwo · May 17, 2023, 11:42am

I bought a 3090ti recently, I think its time to train an AI that can auto-reply to new-bee level questions, as well as write some basic macro and widget-call wikitext.

QA AI

I would need some publicly accessible data to build the dataset, but crawling this forum is not efficient like a direct download. @boris seems you host the forum server? Do you think it’s ok to send me a zip or sqlite of public conversation data in this forum? (Exclude the user info, and private DM).

Would people here mind this?

For google forum, I may have to write a crawler.

If you have other suggested dataset, please reply. For example, some wiki that is full of tutorials? I can’t just crawl the https://links.tiddlywiki.org/, some site may contains non-tiddlywiki related personal info.

Code AI

I think I will download tiddlywiki plygins from Github. Then I need to add comments to plugins, (with help of QA AI). So trained AI can generate more reasonable wikitext using chain-of-thought (output reason in a comment, then code).

pmario · May 17, 2023, 2:02pm

Most of the GG forum will contain TWClassic information. So it will confuse TW5 content. So you need to be careful to separate info at GG into TWC and TW5 for training data.

linonetwo · May 18, 2023, 3:27am

Is there an easy way to filter them? From posted time?

linonetwo · May 18, 2023, 3:32am

@jeremyruston Are you the owner of tw google group? Seems gg can also export data.
https://support.google.com/groups/answer/9975859?hl=en

jeremyruston · May 18, 2023, 8:43am

Hi @linonetwo in the past, because I’ve always subscribed to the group via email, I’ve used an export from my email app to make an MBOX file of the TW google group content.

I didn’t realise Google offered a direct download of Groups data. I’ve now initiated that process and will be happy to pass you the results when they appear.

Quite early on the community adopted the convention of using [TW5] in the subject line of posts concerned with TW5. At some point, we established the separate TiddlyWiki Classic Google Group, and I think from that point there isn’t much TWC content in the group.

linonetwo · October 10, 2023, 12:45pm

Since this is already public data, do you think uploading them to a GitHub repo is acceptable? Many dataset is on github now, and it is more convenient for me to download it.

If it does, I would suggest https://github.com/tiddlywiki/dataset or something. And we need another project to prepare some high quality QA sets there in a collaboration way.

linonetwo · October 10, 2023, 12:47pm

Hi @admins ,

Can you help exporting the latest data of the forum? And it is better to put it in a Github repo, so we don’t need a net disk to share it, and anyone else interesting in make a AI copilot for writing wikitext can use it to train a private AI for themselves too.

jeremyruston · October 11, 2023, 4:26pm

I did do that once, but we pulled it once we realised that the data includes unreacted email addresses. In other words, the copy of the data in my email inbox is not actually the same as the publicly available copy of the data on Google Groups servers.

I think we’d need to obfuscate the email address in order to be able to publish the data.

linonetwo · October 12, 2023, 8:31am

Okay, I can clean it up before uploading if you don’t have time for it. But we have to find a way to transfer it, maybe via google drive? Or you can clean it first, when you have time, and upload to Github directly?

linonetwo · December 20, 2023, 4:28am

It is a year after GPT3 release, we still have no progress on this, @jeremyruston @admins can you provide some training data so my 4090ti can start working?

Justin_H · December 21, 2023, 9:49pm

I’m not a resident expert on the subject, but a 4090ti might not be able to fully train an LLM in a very fast manner.

Something I’ve read up on is that there are companies that would allow you to rent their systems for training models, since someone I watch on youtube was having a hard time on their own with 64 gigabytes of ram and a 4080

Just felt like sharing, I’m nit sure how resource intensive tiddlywiki would prove to be, but if a 4090ti isn’t beefy enough, I hope your able to find a budget friendly service to use

linonetwo · December 22, 2023, 10:22am

Thanks, luckily I don’t have to worry about this, I have an AI company and recently bought a few 4090ti, I can use them. Well, but I don’t have time to prepare data myself. I don’t even know how big the dataset will be. There isn’t much good training data about Tiddlywiki.

jeremyruston · December 23, 2023, 10:51am

Hi @linonetwo with respect to the Google Groups archives that I have, I am concerned that passing them to you without redacting the email addresses would be a breach of EU and other privacy laws. As the administrator of the group, I think I am the data controller which means that I only have access to personal data for the limited purposes of operating the group.

I don’t actually think that the obfuscation need necessarily be a huge task. The mbox files are plain text files, and while I understand that it’s not reliable to use a regex to find email addresses, I don’t think that it would be hard to find a reliable email address parser that one could roll up into a JS command line tool.

linonetwo · December 27, 2023, 12:18pm

I understand, Jeremy. I can wait for that, or maybe you can start a new community project to collect useful training data? For example, create a community editable wiki / repo that will full of QA pairs, I think this is also need to be started by you to have appeal.

I believe a dedicated 3B even 1B model, will perform better than expensive GPT4.

linonetwo · December 27, 2023, 12:35pm

I’m asking people to create Chinese based data set to train AI. Now the problem is similar to https://talk.tiddlywiki.org/t/community-curated-editions-how-best-to-coordinate-our-efforts/

Bers1974 · May 29, 2025, 1:44am

linonetwo:

I bought a 3090ti recently, I think its time to train an AI that can auto-reply to new-bee level questions, as well as write some basic macro and widget-call wikitext.

QA AI

I would need some publicly accessible data to build the dataset, but crawling this forum is not efficient like a direct download. @boris seems you host the forum server? Do you think it’s ok to send me a zip or sqlite of public conversation data in this forum? (Exclude the user info, and private DM). By the way, I’ve recently started experimenting with Overchat AI — it’s a great platform that helps build smart AI chat assistants that can auto-reply and even generate code snippets intelligently. It might actually speed up training by providing advanced natural language understanding out of the box, which could be useful for your QA AI project.

Would people here mind this?

For google forum, I may have to write a crawler.

If you have other suggested dataset, please reply. For example, some wiki that is full of tutorials? I can’t just crawl the , some site may contains non-tiddlywiki related personal info.

Code AI

I think I will download tiddlywiki plygins from Github. Then I need to add comments to plugins, (with help of QA AI). So trained AI can generate more reasonable wikitext using chain-of-thought (output reason in a comment, then code).

Sounds very ambitious and interesting!
Indeed, for QA AI training, access to a large and structured dataset is key. Unfortunately, publicly available forum dumps rarely exist or are provided, and privacy issues are always an important aspect.
If the forum administrators are okay with it, then anonymized data is probably the best option. But it is worth discussing in advance and ensuring that there will be no leakage of personal information.

For publicly available data, you can consider sources such as Stack Overflow, GitHub Issues, official wiki projects, and sites with educational materials - they often have open APIs or dumps.
As for TiddlyWiki, it is a great idea to take plugins from GitHub and add comments to them via AI, this can significantly help in creating understandable and extensible solutions.

If you need help with data preparation or analysis, you can reach out - it is always useful for the community to have tools that simplify working with questions and code. Good luck with the project!