Can I get talk forum and gg dataset to train AI?

I bought a 3090ti recently, I think its time to train an AI that can auto-reply to new-bee level questions, as well as write some basic macro and widget-call wikitext.

QA AI

I would need some publicly accessible data to build the dataset, but crawling this forum is not efficient like a direct download. @boris seems you host the forum server? Do you think it’s ok to send me a zip or sqlite of public conversation data in this forum? (Exclude the user info, and private DM).

Would people here mind this?

For google forum, I may have to write a crawler.

If you have other suggested dataset, please reply. For example, some wiki that is full of tutorials? I can’t just crawl the https://links.tiddlywiki.org/, some site may contains non-tiddlywiki related personal info.

Code AI

I think I will download tiddlywiki plygins from Github. Then I need to add comments to plugins, (with help of QA AI). So trained AI can generate more reasonable wikitext using chain-of-thought (output reason in a comment, then code).

6 Likes

Most of the GG forum will contain TWClassic information. So it will confuse TW5 content. So you need to be careful to separate info at GG into TWC and TW5 for training data.

Is there an easy way to filter them? From posted time?

@jeremyruston Are you the owner of tw google group? Seems gg can also export data.
https://support.google.com/groups/answer/9975859?hl=en

Hi @linonetwo in the past, because I’ve always subscribed to the group via email, I’ve used an export from my email app to make an MBOX file of the TW google group content.

I didn’t realise Google offered a direct download of Groups data. I’ve now initiated that process and will be happy to pass you the results when they appear.

Quite early on the community adopted the convention of using [TW5] in the subject line of posts concerned with TW5. At some point, we established the separate TiddlyWiki Classic Google Group, and I think from that point there isn’t much TWC content in the group.

Since this is already public data, do you think uploading them to a GitHub repo is acceptable? Many dataset is on github now, and it is more convenient for me to download it.

If it does, I would suggest https://github.com/tiddlywiki/dataset or something. And we need another project to prepare some high quality QA sets there in a collaboration way.

Hi @admins ,

Can you help exporting the latest data of the forum? And it is better to put it in a Github repo, so we don’t need a net disk to share it, and anyone else interesting in make a AI copilot for writing wikitext can use it to train a private AI for themselves too.

I did do that once, but we pulled it once we realised that the data includes unreacted email addresses. In other words, the copy of the data in my email inbox is not actually the same as the publicly available copy of the data on Google Groups servers.

I think we’d need to obfuscate the email address in order to be able to publish the data.

Okay, I can clean it up before uploading if you don’t have time for it. But we have to find a way to transfer it, maybe via google drive? Or you can clean it first, when you have time, and upload to Github directly?

It is a year after GPT3 release, we still have no progress on this, @jeremyruston @admins can you provide some training data so my 4090ti can start working?

I’m not a resident expert on the subject, but a 4090ti might not be able to fully train an LLM in a very fast manner.

Something I’ve read up on is that there are companies that would allow you to rent their systems for training models, since someone I watch on youtube was having a hard time on their own with 64 gigabytes of ram and a 4080

Just felt like sharing, I’m nit sure how resource intensive tiddlywiki would prove to be, but if a 4090ti isn’t beefy enough, I hope your able to find a budget friendly service to use :grinning_face_with_smiling_eyes:

Thanks, luckily I don’t have to worry about this, I have an AI company and recently bought a few 4090ti, I can use them. Well, but I don’t have time to prepare data myself. I don’t even know how big the dataset will be. There isn’t much good training data about Tiddlywiki.

Hi @linonetwo with respect to the Google Groups archives that I have, I am concerned that passing them to you without redacting the email addresses would be a breach of EU and other privacy laws. As the administrator of the group, I think I am the data controller which means that I only have access to personal data for the limited purposes of operating the group.

I don’t actually think that the obfuscation need necessarily be a huge task. The mbox files are plain text files, and while I understand that it’s not reliable to use a regex to find email addresses, I don’t think that it would be hard to find a reliable email address parser that one could roll up into a JS command line tool.

I understand, Jeremy. I can wait for that, or maybe you can start a new community project to collect useful training data? For example, create a community editable wiki / repo that will full of QA pairs, I think this is also need to be started by you to have appeal.

I believe a dedicated 3B even 1B model, will perform better than expensive GPT4.

I’m asking people to create Chinese based data set to train AI. Now the problem is similar to https://talk.tiddlywiki.org/t/community-curated-editions-how-best-to-coordinate-our-efforts/