Feel free to ignore if my questions become burdensome enough to distract from Starbound I’m just curious and enthusiastic to discuss such topics.
Let me summarize my understanding at this point:
- The idea here is that you would like to create a “coding” chatbot that would have expertise in TW wikitext?
- you believe that you would not need too much additional training data because you know that most modern models already have extensive training in html, markdown, and wikitext.
- You believe additional training is superior to just taking a model and giving it reference material because you’re trying to prioritize models that have a more limited “context window” (I assume so they are more accessible to a wider audience).
- To the same end, you are looking at models that are smaller and opensource.
Per our conversation on the inclusion of javascript, I think we are having some language difficulties that led to my misunderstanding, so I will outline for clarity, in case others want to casually follow –
by this I took you to mean “context” in an operational sense, which is to say, “the context window.” This made little sense to me because training data does not constitute “context.” Upon reflection, I take you to be using a more colloquial version of the word “context,” as in, whether a model has training in JS or not.
Along a similar line, my understanding led me to understand you to be using the word “see” here in terms of a context-window. You can understand why this too was confusing to me: a model that was trained on javascript would not need to “see” the entire widget class in the context window to generate code.
Since training knowledge is static and does not shift around like context does, I assume you are instead using “see” to mean something like “If the LLM does not have sufficient training in the javascript widget class it is more likely to hallucinate, because Javascript is so much more complex than wikitext”
This may be, I think javascript certainly presents more opportunities for hallucinations (instances where an LLM generates plausible-sounding but incorrect information) to occur – this is where “context” really becomes relevant to this discussion, to my eye: if a user is asking for a huge output, the model has many more opportunities to lose context and for things to go off-rails. Writing a widget often does take much more space than generating a table or something like that. However, this is different from the model’s fundamental understanding of JavaScript or any other programming language - it’s more about the challenge of maintaining accuracy over extended generations.
That said, however, I do think if you wanted a model to not even make an attempt at javascript, one would need to explicitly forbid it in the system instructions – because
-
these models do not work by ‘learning’ this language or that language in isolation - rather, they learn statistical patterns across all text they’ve been trained on, allowing them to generate appropriate responses based on probability distributions, whether that’s in JavaScript, wikitext, or natural language, and
-
Most modern LLMs already have extensive exposure to javascript from their training, so these capabilities are built-in.