Preparing AI training materials Season 1 - Explain TW core WikiText

linonetwo · October 31, 2024, 9:16am

Explain TW core WikiText with LLM<- we are here
human validation of dataset
train a wikitext generation expert model
get talk forum and gg dataset to train AI for chat model
fine tune a latest opensource LLM
RLHF

I’m working on step 1, read and comment on GitHub - tiddly-gittly/TiddlyWiki-LLM-dataset: WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP) if you have any doubt or ideas.

You can help add more tasks in TiddlyWiki-LLM-dataset/wiki/tiddlers/prompts/data at master · tiddly-gittly/TiddlyWiki-LLM-dataset · GitHub , PR or comment on this post are both OK.

If you are interesting in step 2, please stay tuned for my follow-up post, I will invite reviewers for help when AI generated dataset is ready.

LoryMoney · November 13, 2024, 12:33pm

Can we feed the entire TW Github repo to a LLM as a dataset?

Sure the gg forums and talk.tiddlywiki has valuable wikitext formation info, i think that would be great that the IA not just learns wikitext but also has a deep knowledge of how TW works by ingesting the microkernel and understanding the js that builds the application.

well-noted · November 13, 2024, 2:38pm

The question I always come back to is would there be enough high quality training data there for a good quality model?

Explaining Tiddlywiki to an existing LLM works because it has extensive high quality training on language as a whole – language here being an ambiguous term, we are really talking about the ability to recognize patterns and draw on a cache of tokens based on previous tokens.

Let’s remind ourselves and be conscientious of our choice of words here here: there is no “understanding” going on here, the model has the ability to predict tokens in both natural and artificial languages – and wiki text is just another artificial language.

These models have been trained on a lot of JavaScript and HTML already, and have certainly been fed at least some knowledge of Tiddlywiki, if not documentation, as part of their training process. We know they have been fed mountains and mountains of the Internet At Large on order for them to be able to function at all. Providing the model with the core may change the weight of tokens that one might get in response to questions, but it would contribute very little to the process of teaching the model anything about responding to queries in general and how tokenization should be performed.

So if one were to train primarily on all the forums and the GitHub entries and the kernel, I’m not sure you’d have a sufficiently useful model for much of anything. A rough guideline is that it takes a minimum of 1 TB of text data in order to generate a reasonably functional model - - ideally, that should be 1TB of high-quality data, removing the bulk of extraneous or false patterns.

Which raises the question, why not distill the important parts of the documentation (ideally which goes over the existing blindspots) as a reference for the agent to consult in its answering (weighs its responses) and/or provide it a tiddlywiki instance to actually experiment on and draw it’s own conclusions?

linonetwo · November 14, 2024, 3:46pm

That’s what I’m trying to do, I will write a script to make use of each tiddler in TW official repo.

I think wikitext is enough, I will not use JS part, because LLM won’t understand the connection between JS, but they will understand Wikitext syntax. And we only need to help people write wikitext, don’t need to help developers write JS, that is Github Copilot’s job.

That must be human generated. So in next step, I will invite TW users to review materials derived from “tiddler in TW official repo”.

And I will fine tune a pretrained model, so only a few MB of material is needed.

well-noted · November 14, 2024, 4:01pm

I’m confused by this part, @linonetwo – I have not had any problem getting an LLM to “understand” (again, this word choice obfuscates rather than clarifies) a connection between wikitext and javascript. Perhaps I’m misunderstanding the context in which you are speaking?

If anything, I’ve found it difficult to constrain LLMs to only using JS or Wikitext, when I’m in a situation that requires one of them and not the other – though it is usually relatively simple to clarify the situation and get things back on-rails.

linonetwo · November 14, 2024, 5:24pm

Oh, I mean understanding JS means use larger context. To explain JS widget mechanism it needs to understand the widget base class, and parser mechanism. That is too complex, even cursor and copilot can’t do it well, so I’m not going to challenging it.

And there won’t be enough people help me review JS related materials. WikiText material I believe there will be enough reviewers.

well-noted · November 14, 2024, 5:27pm

I see. I don’t use cursor or copilot, but I find GPT4o and Claude 3.5 sonnet fairly capable with JS (though these obviously have huge context windows).

Do you have a pretrained model picked out already?

linonetwo · November 15, 2024, 6:39am

If LLM don’t see the whole base widget class and related usage, they will only say some rhetoric and platitudes. Which will due to a sillier finetune model.

For pretrained model I will use Qwen 2.5 coder, best opensource code model recently.

I’m still writing the JS to call openai api. And I play too much game recently (xStarbound), the progress is behind the schedule…

well-noted · November 15, 2024, 4:09pm

Feel free to ignore if my questions become burdensome enough to distract from Starbound I’m just curious and enthusiastic to discuss such topics.

Let me summarize my understanding at this point:

The idea here is that you would like to create a “coding” chatbot that would have expertise in TW wikitext?
you believe that you would not need too much additional training data because you know that most modern models already have extensive training in html, markdown, and wikitext.
You believe additional training is superior to just taking a model and giving it reference material because you’re trying to prioritize models that have a more limited “context window” (I assume so they are more accessible to a wider audience).
To the same end, you are looking at models that are smaller and opensource.

Per our conversation on the inclusion of javascript, I think we are having some language difficulties that led to my misunderstanding, so I will outline for clarity, in case others want to casually follow –

by this I took you to mean “context” in an operational sense, which is to say, “the context window.” This made little sense to me because training data does not constitute “context.” Upon reflection, I take you to be using a more colloquial version of the word “context,” as in, whether a model has training in JS or not.

Along a similar line, my understanding led me to understand you to be using the word “see” here in terms of a context-window. You can understand why this too was confusing to me: a model that was trained on javascript would not need to “see” the entire widget class in the context window to generate code.

Since training knowledge is static and does not shift around like context does, I assume you are instead using “see” to mean something like “If the LLM does not have sufficient training in the javascript widget class it is more likely to hallucinate, because Javascript is so much more complex than wikitext”

This may be, I think javascript certainly presents more opportunities for hallucinations ^{(instances where an LLM generates plausible-sounding but incorrect information)} to occur – this is where “context” really becomes relevant to this discussion, to my eye: if a user is asking for a huge output, the model has many more opportunities to lose context and for things to go off-rails. Writing a widget often does take much more space than generating a table or something like that. However, this is different from the model’s fundamental understanding of JavaScript or any other programming language - it’s more about the challenge of maintaining accuracy over extended generations.

That said, however, I do think if you wanted a model to not even make an attempt at javascript, one would need to explicitly forbid it in the system instructions – because

these models do not work by ‘learning’ this language or that language in isolation - rather, they learn statistical patterns across all text they’ve been trained on, allowing them to generate appropriate responses based on probability distributions, whether that’s in JavaScript, wikitext, or natural language, and
Most modern LLMs already have extensive exposure to javascript from their training, so these capabilities are built-in.

linonetwo · November 18, 2024, 10:33am

Also known as WikiScript, see how people having trouble in this aspect Thinking about visual filter builder - #11 by linonetwo

I think a code model can understand what user want to build, and write “WikiScript” for them. I believe code model contains the ability to understand feature requiremets…

When it is writing React or Vue! There are many opensourced material about React code, so it can do it without understand React’s internal. But if writing a TW widget class, then I think training material is not enough, I guess it has to understand the whole thing from ground up to write a good app.
Just a guess, so maybe it will work simply using materials from tw core. But I think it won’t, we need more JS widget materials, and it is expensive.

And there are many advanced technique, like calling WikiText in JS & calling JS in WikiText, we don’t have enough materials yet, its hallucinations might be very big, like try to use non-existed API to do so.
Instead, calling WikiText actions in pure WikiText, I think we have enough materials in tw core.

And we always recommend writing UI and plugins using wikitext. So a model that can write JS widget have low ROI. This is just a choise.

Anyway, let’s do this one by one, I also want to learn LLM training from this project, for a new career.

well-noted · November 18, 2024, 3:44pm

Am excited to hear about and follow your journey, @linonetwo, I also have aspirations to build a deep learning system in the coming years, and share your enthusiasm for its career possibilities

I don’t know how realistic getting autoregressive transformers to build an entire anything is – despite the pretty impressive example we’ve seen developers showing off. I think likely the future of the tool is going to be more comparable to the development of ergonomic hammer design. We want models that are statistically less likely to lead people down the wrong path and will take over a significant amount of buswork, not necessarily models that replace developers.

joshuafontany · November 21, 2024, 4:44am

Having used ChatGPT fairly successfully at work over the last 3 monhts (private sub, so “anonymizing” and scrubbing company data out of my examples), I am coming back to thinking about how to leverage the tools with TW5.

One thing I think the model lacks context for is the abstract design patterns that turn wikitext into rendered HTML. With examples from https://tiddlywiki.com/dev/ organized as an initial prompt (or CustomGPT instructions) then I think well crafted WikiText and Javascript prompts would generate useful results. Maybe.

well-noted · November 23, 2024, 3:21am

This has been my experience.