When German journalistĀ Martin Bernklautyped his name and location intoĀ Microsoftās CopilotĀ to see how his articles would be picked up by the chatbot, the answersĀ horrified him. Copilotās results asserted that Bernklau was an escapee from a psychiatric institution, a convicted child abuser, and a conman preying on widowers. For years, Bernklau had served as a courts reporter and the AI chatbot hadĀ falsely blamed himĀ for the crimes whose trials he had covered.
The accusations against Bernklau werenāt true, of course, and are examples of generative AIāsĀ āhallucinations.āĀ These are inaccurate or nonsensical responses to a prompt provided by the user, and theyāreĀ alarmingly common. Anyone attempting to use AI should always proceed with great caution, because information from such systems needs validation and verification by humans before it can be trusted.
But why did Copilot hallucinate these terrible and false accusations?
āHallucinationsā is the wrong word. To the LLM thereās no difference between reality and āhallucinationsā, because it has no concept of reality or whatās true and false. All it knows it what word maybe should come next. The āhallucinationā only exists in the mind of the reader. The LLM did exactly what it was supposed to.
Theyāre bugs. Major ones. Fundamental flaws in the program. People with a vested interest in āAIā rebranded them as hallucinations in order to downplay the fact that they have a major bug in their software and they have no fucking clue how to fix it.
Itās an inherent negative property of the way they work. Itās a problem, but not a bug any more than the result of a car hitting a tree at high speed is a bug.
Calling it a bug indicates that itās something unexpected that can be fixed, and as far as we know it canāt be fixed, and is expected behavior. Same as the car analogy.
The only thing we can do is raise awareness and mitigate.
Youāre attempting to redefine ābug.ā
From a software testing point of view, a correctly coded realization of an erroneous algorithm is a defect (a bug). It fails validation (a test for fitness for use) rather than verification (a test that the code correctly implements the erroneous algorithm).
This kind of issue arises not only with LLMs, but with any software that includes some kind of model within it. The provably correct realization of a crap model is still crap.
It actually can be fixed. There is an accuracy to answers. Like how confident the statistical model is on the answer. Thatās why some questions get consistent answers while others donāt.
The fix is not that hard, itās a matter of reputation on having the chatbot answer āI donāt knowā when the confidence on an answer isnāt high enough. Itās pretty similar on what the chatbot does when you ask them to make you a bomb, just highjacks the answer calculated by the model and says a predefined answer instead.
But it makes the AI look bad. So most public available models just answer anything even if they are not confident about it. Also your reaction to the incorrect answer is used to train the model better so itās not even efficient for they to stop the hallucinations on their product. But it can be done.
Models used by companies usually have a higher confidence threshold and answer āI donāt knowā if they donāt have enough statistical proof on a particular answer.
This has been tried, itās helping but itās not enough by itself. Itās one of the mitigation steps I was thinking of. And companies do work very hard to reduce hallucinations, just look at Microsoftās newest thing.
From that article:
The Hidrogen from water thing is simply wrong. If that is supposed to mean that hallucinations are just part of a generative LLM technology that cannot be solved.
They are not inherent of the technology. They are a product of lack of control over the stadistical output. Prioritizing any answer before no answer.
As with any statistics you have a confidence on how true something is based on your data. Itās just a matter of putting the threshold higher or lower.
If you ask an easy question āWhat is the capital of France?ā You wont ever get an hallucination. Because all models will have that answer provided with very high confidence. You just have to make so if that level of confidence is not reached it just default to a āI donāt know answerā. But, once again, this will make the chatbots seem very dumb as they will answer with lots of āI donāt knowā.
The problem here is the amount of data and the efficiency of the model. In order to get an usable general purpose model with a confidence threshold high enough to not hallucinate, by todays efficiency with the models it would need to be an humongous model, too big and with too much training data even for big tech. So we can go that big, we can try to improve efficiency (which is being proven very hard for general models) or we do both. Time will tell, but Iām quite confident that we will reach a general use model without hallucinations sooner or later.
I think you misunderstand how LLMās work, it doesnāt have a confidence, itās not like it looks at itās data and say āhmm, yes, most say Paris is the capital of France, so thatās the answerā. It ājustā puts weight on the next token depending on itās internal statistics, and then one of those tokens are picked, and the process start anew.
Teaching the model to say āI donāt knowā helps a bit, and was lauded as āThe Solutionā a year or two ago but turns out it didnāt really help that much. Then you got Grounded approach, RAG, CoT, and so on, all with the goal to make the LLM more reliable. None of them solves the problem, because as the PhD said itās inherent in how LLMās work.
And no, local llmās arenāt better, theyāre actually much worse, and the big companies are throwing billions on trying to solve this. And no, itās not because āthat makes the llm look dumbā that they havenāt solved it.
Early on I was looking into making a business of providing local AI to businesses, especially RAG. But no model I tried - even with the documents being part of the context - came close to reliable enough. They all hallucinated too much. I still check this out now and then just out of own interest, and while itās become a lot better itās still a big issue. Which is why you see it on the news again and again.
This is the single biggest hurdle for the big companies to turn their AIās from a curiosity and something assisting a human into a full fledged autonomous / knowledge system they can sell to customers, you bet your dangleberries they try everything they can to solve this.
And if you think you have the solution that every researcher and developer and machine learning engineer have missed, then please go prove it and collect some fat checks.
What do you think is āweightā?
Is, simplifying, the amounts of data that says āThe capital of France is Parisā it doesnāt need to understand anything. It just has to stop the process if the statistics donāt not provide enough to continue with confidence. If the data is all over the place and you have several āThe capital of France is Berlin/Madrid/Milanā, itās measurable compared to all data saying it is Paris. Not need for any kind of āunderstandingā of the meaning of the individual words, just measuring confidence on what next word should be.
Back a couple of years when we played with small neural networks playing mario and you could see the internal process in real time, as there where not that many layers. It was evident how the process and the levels of confidence changed depending on how deep the training was. Here it is just orders of magnitude above. But nothing imposible to overcome as some people pretend to sell.
Alternative ways of measure confidence is just run the same question several times and check if answers are equivalent.
PhD is PhD in scaremongering about technology, so itās not an authority on anything here.
IDK what did you do, but slm donāt really hallucinate that much, if at all. Specially if they are trained with good datasets.
As I said the solution is not in my hand, as it involves improving the efficiency or the amount of data. Efficiency has issues as current techniques seems to be unable to improve efficiency over a certain level. And amount of data is, obviously, costly.
You can call that confidence if you want, but it got very little to do with how āsureā the model is.
Actually, it would be "The confidence of token Th is 0.95, the confidence of S is 0.32, the confidence of ā¦ " and so on for each possible token, many llmās have around 16k-32k token vocabulary. Most will be at or near 0. So you pick Th, and then token āeā will probably be very high next, then a space token, thenā¦ Anyway, the confidence of the word āParisā wonāt be until far into the generation.
Now there is some overseeing logic in a way, if you ask what the capitol of a non existent country is itāll say thereās no such country, but is that because it understands it doesnāt know, or the training data has enough examples of such that it has the statistical data for writing out such an answer?
I assume by SLM you mean smaller LLMās like for example mistral 7b and llama3.1 8b? Well those were the kind of models I did try for local RAG.
Well, it was before llama3, but I remember trying mistral, mixtral, llama2 70b, command-r, phi, vicuna, yi, and a few others. They all made mistakes.
I especially remember one case where a product manual had this text : āIf the same or a newer version of <product> is already installed on the computer, then the <product> installation will be aborted, and the currently installed version will be maintainedā and the question was āWhat happens if an older version of <product> is already installed?ā and every local model answered that then that version will be kept and the installation will be aborted.
When trying with OpenAIās latest model at that time, I think 4, it got it right. In general, about 1 in ~5-7 answers to RAG backed questions were wrong, depending on the model and type of question. I could usually reword the question to get the correct answer, but to do that you kinda already have to know the answer is wrong. Which defeats the whole point of it.
This article is an example where statistical confidence doesnāt help. The model has lots of data so it likely has high confidence, but it didnāt have any understanding of the nature of the relation in the data.
I recently did an application where we indicated the confidence of the output of the model. For some scenarios, the high confidence output had even more mistakes than the low confidence output
OK, so describe how you control that output so that hallucinations donāt occur. Does the anti-hallucination training set exceed the size of the original LLMās training set? How is it validated? If itās validated by human feedback, then how much of that validation feedback is required, and how do you know that the feedback is not being used to subvert the model rather than to train it?
Itās not a bug. Just a negative side effect of the algorithm. This what happens when the LLM doesnāt have enough data points to answer the prompt correctly.
It canāt be programmed out like a bug, but rather a human needs to intervene and flag the answer as false or the LLM needs more data to train. Those dozens of articles this guy wrote arenāt enough for the LLM to get that heās just a reporter. The LLM needs data that explicitly says that this guy is a reporter that reported on those trials. And since no reporter starts their articles with āHi Iām John Smith the reporter and today Iām reporting onā¦ā that data is missing. LLMs canāt make conclusions from the context.
Well, Itās not lying because the AI doesnāt know right or wrong. It doesnāt know that itās wrong. It doesnāt have the concept of right or wrong or true or false.
For the llmās the hallucinations are just a result of combining statistics and producing the next word, as you say. From the llmās āpovā itās as real as everything else it knows.
So what else can it be called? The closest concept we have is when the mind hallucinates.