Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

btp@kbin.social · edit-2 11 months ago

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

The Hobbyist@lemmy.zip · 11 months ago

This is not the case in language models. While computer vision models train over multiple epochs, sometimes in the hundreds or so (an epoch being one pass over all training samples), a language model is often trained on just one epoch, or in some instances up to 2-5 epochs. Seeing so many tokens so few times is quite impressive actually. Language models are great learners and some studies show that language models are in fact compression algorithms which are scaled to the extreme so in that regard it might not be that impressive after all.

j4k3@lemmy.world · edit-2 11 months ago

How many times do you think the same data appears after a model has as many datasets as OpenAI is using now? Even unintentionally, there will be some inevitable overlap. I expect something like data related to OpenAI researchers to reoccur many times. If nothing else, overlap in redundancy found in foreign languages could cause overtraining. Most data is likely machine curated at best.