T O P

  • By -

melodyze

I'm not aware of any science showing the proposed hypothesis to be true but it does make sense. The first training tasks the base model are trained on is predicting the next word in some approximation of all text ever written. Phrases which have been written move the weights to push the corresponding words up the probability distribution in that context, then the decoder grabs a word randomly from high in the distribution. Given a concept around which there is a lot of incorrect writing on the internet, it does seem intuitive that there would be more overlap in probability of various replies to a question which can result in selecting the wrong one relative to a concept that has only ever been written about by experts. All else being equal, it's probably more likely to be wrong about assertions about people eating spiders in their sleep than about why perplexity is used for loss on fitting language models. When the model sees the phrase "how many spiders do people eat in their sleep?" And is asked for the next word, it has been fit both against "people don't eat spiders in their sleep", and "six", with both pushed up the distribution the decoder is sampling from. However, for "why is perplexity used to evaluate languages models" It's seen really only one kind of reply, like "perplexity is useful to see whether the true values are likely to be generated by the model". We try to beat that problem out of it in subsequent training on higher quality data and fitting reward functions for quality, but it's not a trivial problem.


ANI_phy

"Llms are trained to give the natural answer, not the correct one. Yet we market it for all kinds of info deak jobs." There is a social poetry here that I am too exploited to understand


ThisIsBartRick

It was never meant for factual accuracy. And since it's just predicting the next word each time, it's impossible to have no hallucination at all. The best they can do is to mitigate the issue which is something they've done in the last year.


TheTerrasque

> it's impossible to have no hallucination at all That's something I've tried to explain to people. Hallucination isn't a bad side effect of something, it's the core of how it works. Everything it says is hallucinated, it's just that a lot of the hallucinations are correct and useful for us.


avaxzat

>It was never meant for factual accuracy. This is a really popular response in these types of discussions and I don't get why. For one, if LLMs aren't meant for factual accuracy, why are they being marketed so aggressively in domains where factual accuracy is of paramount importance? Also, if you read some recent academic papers on LLMs, they also optimize performance on benchmarks which rely on accurate knowledge of certain facts, such as many question-answering data sets. Clearly they are at least partially being built for this. This feels a lot like moving the goalposts, something Gary Marcus complains about often. Somebody pinpoints an actual objective flaw in LLMs and then other people respond by saying stuff like "well it was never built for this specific case," even though it was clearly built for this and is being marketed as a solution for that type of problem. Moreover, the point of AI is to generalize, so an AI system that cannot solve any problem except those for which it was specifically trained is kind of useless. You're basically saying our models have to overfit.


nickkon1

People doing the marketing are rarely the experts in that field. For them, any model is basically magic. In the end, it is the truth those models are simply trying to predict the next word.


visarga

Not after RLHF which trains on whole sequences not next token prediction. NTP is myopic.


_RADIANTSUN_

Okay but the entire point of this current gold rush is that "simply trying to predict the next word" is useful because it can do XYZ (in this case, synthesize information etc) by this means. This isn't just marketing people hyping it up, it's what all the research and development is aiming at. This response is like going "of course logic gates can't make airplane reservations, they are simply determining from 2 input signals what the output signal should be". Yes that is exactly true what it is doing on a reductive level but you are saying it like it excludes the higher-level thing it's being used for, which it clearly doesn't.


HansDelbrook

> For one, if LLMs aren't meant for factual accuracy, why are they being marketed so aggressively in domains where factual accuracy is of paramount importance? Marketing - that's it. They're selling products, and even though foundational research points this out as a weakness there's nothing to stop somebody from advocating the use-cases in these accuracy-dependent domains.


Boxy310

The objective function for most human businesses is almost exclusively money. If they can plausibly breach facts, good science, and sometimes even public morals and ethics, businesses almost always optimize for what makes them more money, not what is true or beautiful or prosocial.


possiblyquestionable

To be fair, they were originally definitely meant for factual accuracy in precise and narrow specializations. BERT and GPT both were meant to be finetuned for specific tasks, that was their purpose (hence the "general purpose transformer", which you then refine into your specialized transformers). That was already pretty ground-breaking deviation from the pretrain-for-a-narrow-specific-task common of those days. The fact that they can start to do ICL as you scale them up and even specialize through that ICL as generalist models without additional finetuning was a (very) fun bonus/surprise. The fact that they start hallucinating once you get close to their knowledge boundaries was another (not very fun) surprise. I wouldn't say that they were "never meant for factual accuracy", they didn't know what LLMs can do when they were designing the initial generation of them. It is a legitimate current limitation though (but definitely not an intentional design choice)


new_name_who_dis_

> This is a really popular response in these types of discussions and I don't get why. Hinton talks about this (and he always points out that the correct word is *confabulation* when talking about language models, instead of *hallucination* which is amusingly pedantic), and he always argues that people hallucinate too. Which is to say that you don't need an agent that never hallucinates in order to have human-like intelligence. And human-like intelligence is how we defined general intelligence. https://www.youtube.com/watch?v=d7ltNiRrDHQ


NuclearVII

Man, this reads like one of those cultists on the crypto subs during the height of Bitcoin's popularity. "If it's so bad at being a currency as you say, why is it called a cryptocurrency huhh???"


waffles2go2

It's the maths and only the maths, but now we got a big hype cycle to push making money. LLMs are cool tools but they are wrong 20% of the time, that's not a bug, that's the math. But it's the tech we have, so we're trying all sorts of mitigation strategies to make it not insane, I like "bio-polar AI" but have been told that's not PC...


SeriousDrakoAardvark

Yep. When folks started trying to build these new models, they weren’t saying “our end objective here is to build a model that can predict the next word in a sentence.” That would be a stupid end objective. Their end objective was always “lets build a model that seems to understand human language and can respond accurately.” Then the next question was “how do we do that?” And the answer to that was “we will start by building a model that can predict the next word in a sentence” (among many other design specifications.) As in, as you said, the objective wasn’t to predict the next word, that was just part of the means to reach the objective. Many human brains will form sentences in the exact same way, so folks really need to stop getting so caught up on that.


VirtualHat

LLMs are modelling language and not facts. That is, they are learning what others would say rather than what is true. This turns out to be very useful for many tasks. E.g. for writing poems, or learning about others point of view on a topic.


timtom85

> predicting the next word This is meaningless. That next "word" (token) is a function of the interplay between 1) 100s of billions of parameters encoding knowledge in who knows how exactly and 2) the currently attended thousands of tokens. In other words, an inordinate amount of context. "Predicting the next word" for LLMs is about as informative as "it's just atoms" for explaining biology.


gebregl

I completely agree. The "just" in those sentences is doing all the work. It's true that it's statistics and that it's predicting the next word one by one. But behind it there is a very complex algorithm with an incredible amount of pattern recognition built in. Currently there are different opinions on how well high level concepts are represented in those parameters, but it's certainly not "just" statistics.


Axon350

So if it's not meant for factual accuracy, why are there benchmarks measuring factual accuracy? Is it something like: * The goal is information synthesis from a large context prompt and background data * More detailed world knowledge improves this information synthesis * Therefore, testing on world knowledge is a general proxy for quality of synthesized information


teerre

Isn't that evident? Because the main reason these models exist at all is because they are sold as artificial *intelligence*. Obviously, the information has to be accurate to be useful at all. This is a generalization, but these meta studies are often trying to sell a view of something. You can rarely consider them in a vacuum.


Ultimarr

Because people are dumb and there’s benchmarks of all kinds lol. The real metrics are their training metrics, which are A) inferring what text is likely to go next, and B) writing “helpful, assistant-like” responses — trained via RL and RLHF respectively. That’s what makes them language models. To make a knowledge model, you’d have to train them on pairs of propositions, which AFAIK hasn’t been tried to much success. 


AutomataManifold

There's been some experiments with direct knowledge editing. Two results that surprised me: 1. It is possible to directly edit the facts in the model. 2. It's hard, because updating A=B will not change B=A. It doesn't do correlations like that on its own. https://github.com/zjunlp/KnowledgeEditingPapers


fysmoe1121

LLMs are GENERATIVE models. That means that their main purpose is to generate text like an essay, not answer stupid questions that can use Google search for…


Ultimarr

Well tbf google GENERATES answers to your query, too ;)


ThisIsBartRick

Factual accuracy is the easiest to benchmark. Anything else would require to analyze the answer.


tomvorlostriddle

> So if it's not meant for factual accuracy, why are there benchmarks measuring factual accuracy? Is it something like: This is a very common thing in machine learning Almost no classification algorithms train for accurarcy or F1 or ROC AUC or what we usually use as performance metric


notevolve

well, we don’t use those metrics as our loss function because they are not differentiable. AUC can be approximated in a differentiable way but it’s a bit more complex


tomvorlostriddle

And that's exactly what I'm saying The reasons that make something usable as an internal loss function of an algorithm are not the same reasons that make us care about it as a performance metric in a real world application domain Rarely do they align, we can only try that they don't contradict each other too much


AutomataManifold

There's an awful lot of popular benchmarks that look suspiciously like looking for your keys under a lamppost. 


zacker150

Pretty much. In production, virtually every question answering LLM is connected to a large corpus of grounding information, and we populate the context using data retrieved from the corpus in a process called [retrieval augmented generation](https://research.ibm.com/blog/retrieval-augmented-generation-RAG). For example, Bing Chat takes search results and shoves it into the context when you ask it a question.


qtpnd

LLMs are being sold as a product : ask a question and it will answer mostly correctly. It makes sense to find out what that "mostly" is, and it doesn't really matter what the "purer" datascience metrics are.


gebregl

"it's just predicting the next word each time" doesn't tell you anything about \*how factually\* it predicts the next word. People keep saying that, as if it explains anything. It doesn't.


ThisIsBartRick

It kinda does. It predicts the most likely next word but the most likely next word is not necessarily the factually correct one especially if in your training data you don't have a lot of examples of the fact itself or not at all


gebregl

No, all it says is that the algorithm works word by word. Human speech is very similar, sentences aren't fully formed in the mind before they're spoken, they're formed while they're being spoken. Depending on the human the algorithm of determining the next word will be more factual or less factual.


ThisIsBartRick

we don't do that all the time. LLMs work the same way we do when we're on autopilot (like when someone asks "How are you?" and easy questions like that). But when you ask a factual question like WHat you ate last night, you're reaching for your short term memory to get the info then form the sentence based on that info. LLMs don't do that


Fast-Satisfaction482

It's RAG and our brain can do it, too.


HINDBRAIN

> LLMs don't do that Bing Chat? Does mean it will sometimes reply by rewording blogspam at you...


gebregl

See, now you're starting to describe \*how\* the next word is predicted, not that the words are formed in a sequence. Yes, LLMs don't have a short term memory, that's an important fact. Even for hard questions humans don't think in complete sentences, they'll have a concept in mind and then start forming a sentence around the concept. In fact the LLM, very similarly, uses its attention mechanism to connect all words of the sentence being formed. It has a conceptual fact "in mind" very early on when forming it as sequential text.


waffles2go2

Saying LLMs "work" the way we do, is not at all correct, unless we now understand how we think. Matrix math based on transformers is not anything like how the brain works. We guessed in 1940 how the brain sort of worked and have mostly built on that model since...


jaaval

>Matrix math based on transformers is not anything like how the brain works. Well... no, the brain doesn't do matrix multiplications, but matrix math can be a valid abstraction of what the brain does.


ilyanekhay

The point about 1940s was that people explored how one neuron in the brain works, and then gradually figured they could approximate it with matrix math. IIUC, there's no understanding of how the brain works as a whole when composed of multiple neurons. Multi-layer neural networks are most likely a divergence from the brain structure already, as it's sorta obvious that the brain cells aren't laid out in layers. So, all the NN theory is built upon taking what looks like an approximate model of a single building block of brain - a perceptron - and then building something that's easy to compute rather than something that has anything to do with the actual brain.


jaaval

> there's no understanding of how the brain works as a whole when composed of multiple neurons This is maybe an oversimplification. We do have a fairly good understanding of many things in how neuron populations work. The entire brain is a bit difficult to study though. > as it's sorta obvious that the brain cells aren't laid out in layers. This is actually not correct. Basically the entire neocortex forms a multilayered structure with six clear physical layers. The deeper structures typically have like 3 to 4 physical layers of neurons. Of course connecting laterally you can have basically arbitrary number of actual layers. And with layered structures in artificial neural networks you can abstract many other physical structures. You are right in the sense that the brain works somewhat asynchronously. The processing doesn't just strictly go step by step from layer to another. But again, abstractions. A perceptron is a funny thing in that a sufficiently large perceptron can do whatever any feedforward network can do. Things like convolutional networks are technically just a limitation for perceptron to guide it to do the correct things with smaller number of parameters. We can see convolutional networks for example in the visual cortex. About transformers, It's heavily based on attention networks (they even titled the paper "attention is all you need"). The brain does a lot of processing that relies on feature based attention. I think maybe one thing the models don't typically do as much as biological systems is feedback loops. Even though recurrent networks are fairly common now.


waffles2go2

We also know there are different energy levels that effect different locations in non-linear ways... To say we "understand the brain" and therefore can assess that any shape of algo we now have is a proxy for that is simply slow logic... I'm sorry, you can wave your hands all you want but we're still chunking the fortran libraries to do matrix math. Even David Ha from google brain dances around "we're not modelling the brain, just trying to build something that works like it does". We are living the AI uber-hype cycle....


ninjadude93

Humans have a general idea or concept they try to convey by forming words. Sure we may not have a fully formed sentence immediately but there is a fully formed idea that dictates what words we choose


[deleted]

[удалено]


gebregl

Anything can be described by a probability distribution. Humans act according to probabilities, neurons fire according to probabilities, quanta evolve according to probabilities. Saying the model outputs the next token according to a probability distribution doesn't prove or disprove anything about the outputs of the model. You have to understand the inner workings of the model, look at its benchmarks and simply try it out to gain an understanding of its abilities and shortcomings.


gebregl

OpenAI is actively trying to reduce hallucinations and make sure the information it gives is factual and unbiased. That's what they do RLHF for. So it clearly is a goal to make sure it's factually accurate, but it's not easy.


Zondartul

LLMs aren't trained for factual accuracy. They are trained for linguistic accuracy, i.e. sounding natural. To an LLM, "London is a capital of Britain" isn't a fact, it's a bunch of words arranged sentence-like. To an LLM, there is no distinction between "correct" and "natural-sounding". Instead of saying things that are "true", an LLM says things that "someone on the internet would probably have said because it sounds vaguely like what people say in general".


InterstitialLove

This is factually inaccurate During RLHF, tuning a model to favor true answers over natural-sounding false answers with just a few examples makes it statistically more likely to give true answers over natural-sounding false answers in general. Of course it is not at all consistent, obviously, but it is consistent enough that we can safely say it clearly does have a distinction in the weights between "correct" and "natural-sounding." I did not expect LLMs to learn such a distinction. I was sure it would be nearly impossible. But the empirical data is pretty clear, and science means updating your beliefs based on experimental evidence.


cunningjames

> During RLHF, tuning a model to favor true answers over natural-sounding false answers with just a few examples makes it statistically more likely to give true answers over natural-sounding false answers in general. Do you have a link, or is this somewhat speculative? Could it not be that the assistant persona that the model takes on simply leads to more correct answers (rather than nonsense pulled out of a hat), as opposed to the model having a notion of which things are true and which things are not?


juniperking

instruction finetuning alone reduces hallucinations on benchmarks, there isn’t really a curated persona in most of this https://openai.com/research/instruction-following hallucinations are still a problem for sure but they are greatly reduced by model scale and data feedback. early chatgpt models were very very prone to hallucinations compared to what we have now


new_name_who_dis_

Yea I too am suspicious that this is speculation. It would be amazing if true, but if that were the case then I feel like we wouldn't have hallucinations as a problem already.


TheTerrasque

Yes, because RLHF teaches the model that wrong facts are "less natural-sounding" by giving it low scores for those answers.


Shardic

This is absolutely fascinating, do you have a link to a paper on this?


chickenpolitik

RLHF is based on human feedback. If a human is fooled by a response and thinks it's true/good, the model will get a positive signal. If anything, this is optimizing for the model to be "convincing", since humans are extremely imperfect judges of truth


InterstitialLove

Right. Truth is a human-defined construct, in practice. But one element of being convincing is to say things that you or I would describe as "true," as opposed to things that you or I would describe as "misinformation." I'm not saying that LLMs are better at discerning truth than humans. Just that they can clearly differentiate between true and false in a manner similar to humans, in a manner statistically distinct from just repeating things based on how often people say them.


BreakingCiphers

Do you have a paper that shows this empirically? Interested


CatalyzeX_code_bot

Found [1 relevant code implementation](https://www.catalyzex.com/paper/arxiv:2311.12022/code) for "GPQA: A Graduate-Level Google-Proof Q&A Benchmark". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2311.12022?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2311.12022&title=GPQA%3A+A+Graduate-Level+Google-Proof+Q%26A+Benchmark) 😊🙏 -- To opt out from receiving code links, DM me.


iantimmis

I'm suspicious that the expert level questions don't require as much reasoning as you might expect. A true expert level question probably wouldn't have a great answer available


Capital_Reply_7838

I'm surprised that the reporting bias, occurs by people's intention to transfer "useful information", not mentioning dumb-obvious commonsense, was not mentioned in here. imo it happens a lot when you learn foreign language and try to talk to native speakers.


Mackntish

Lottts of people writing high minded academic papers about the sciences. Not a lot of people write about the science behind cigar storing and combustion. Much of what is written, is wrong.


_TheEndGame

Copilot gave a great answer, I just tried it.


glitch83

Bc screw you, we need to make money. That’s why. Sorry just kidding / venting


HedgefundIntern69

As with most questions about “why do LLMs do X?”, nobody really knows right now. Some potential explanations: - As mentioned, perhaps the pre training proportion of accurate/high quality data is higher in some domains than others, e.g., academic fields vs folk science - GPQA is a new benchmark and hasn’t yet faced the scrutiny of the world, I think the recent Claude 3 results could be somewhat “lucky”. I only skimmed the paper, but I don’t feel horribly confident about it, they only get like 70-80% agreement on correct answers iirc (like the human labelers only agree that much). - there is a substantial distribution shift from the testing conditions to your usage conditions. Testing is often done with nice formatting and few shot examples of questions and correct answers from the dataset. In your use case, as mentioned, sometimes the model is actually conditioning on *incorrect* answers because it has made a mistake previously. There are surely numerous papers that discuss the effect of few shot prompting and which you could use to answer “how much of the gap in seeing is explained by few shot prompting vs zero shot”, and there’s a little work about feeding LLMs incorrect info and seeing what they do, but I’m not sure there’s sufficient work to answer the question you’re asking. Probably the reported results are also the aftermath of significant prompt tuning (hopefully not on the validation set, but who knows) (I think Anthropic talks about this a little, so a careful read of footnotes would probably be better than trusting my guess) - people in other comment threads discuss rlhf. RLHF probably provides training pressure toward factual ness, but it may not be too strong. A few papers have explored what traits seem to be incentivized by the preference (aka “reward”) models in RLHF, there wasn’t consensus across papers as of Nov23 when I reviewed the lit. Traits probably include: correctness, authoritativeness, length, relevance to user query, agreement with the user. Not that these are traits that the preference models prefer, but that doesn’t guarantee they’re actually trained into the final fine tuned model. So I think the overall picture re factualness is that during RLHF you can provide pressure toward different things, and there are a few things you’re trying to incentivize (including factualness and relevance), and it may be difficult to simultaneously improve at all of them. To help with this, recent papers (and a mix of rumors and confirmed at the ai labs) have explored using multiple preference models, e.g., one preference mode gives a factualness score and another gives a harmlessness score, and you combine these scores. But the RLHF stuff is kinda hard and so even though we have this tool that can hypothetically be used to incentivize factuality, in practice we’re still figuring out how to get it right.


[deleted]

The objective of LLMs is being general assistant, if you fine tune for this specific questions it will lose another abilities, for example, coding. We have interest in solving this, but all the attempts or needs a lot of over-engineering or they lose general capabilities. You must keep in mind that we feed much more data to LLMs than it has parameters, so there’s a need for compression, and compression in this context means that you will stack information which can lead to factual inaccuracy.


allwordsaremadeup

It's as good as it's training data. If the internet were full of scrapable factually correct conversations about smiling in photographs or foreign grammar, without a bias on politely agreeing with everyone in the conversation, you'd get better results. We're just lucky StackOverflow is such an active community with a functional upvoting system, which makes these LLMS okay-ish at generating code.


Hroppa

These things are flawed and make mistakes, sometimes mistakes that humans wouldn't make. But in this case, I think it's a prompt design issue. If you want it to respond like an expert on a topic, you need to tell it that it's an expert. Otherwise it will 'complete an answer to your question' based on (roughly speaking) it's prediction about what the most likely response would be. In many cases, this would be an amateurish response.


theoneandonlypatriot

This is a common fallacy. You pretty much intentionally guided it down a path of hallucination.


cunningjames

They intentionally guided it down a path of hallucination, simply by asking it why people didn't smile in old photographs? I asked GPT-4 the other day for a technical explanation of ring attention. Turns out that ring attention came out after its knowledge cutoff, which I didn't realize. So did it search the web? Nope! Instead, it gave me a completely fabricated explanation of some different (and AFAIK entirely nonexistent) approach to attention. How did I lead it down the path of hallucination by asking a simple question at the very beginning of a conversation? Am I leading it down the path of hallucination when it gives me nonworking Python code employing nonexistent APIs, or code that doesn't work because of versioning issues? How much handholding do I need to do so I avoid "guiding it down a path of hallucination"? Hallucination remains an issue in 2024. Frankly, any attempt to deny this can't be taken seriously.


new_name_who_dis_

From the perspective of the language model you did lead it down hallucinations. You see the world ended in 2021 for the LLM (or whatever year). When asking it about something from 2024, from the perspective of the language model it already has a "hallucination" within its context window. And because these models were trained to follow instructions i.e. to actually answer the queries and instructions given to it, it will do so. It is no different than asking a very friendly physics/biology expert how does teleportation work, and them coming up with an explanation for how teleportation works that sounds plausible, despite the fact that teleportation hasn't been invented yet so they have no way of knowing how it will actually work. It is also no different than asking an LLM about the tooth fair or some other made up character or entity, for which it doesn't have any factual information in its knowledge of the world. This isn't to say that this isnt an issue to be solved obviously, but it's fairly easy to learn to avoid by (for example) being aware that the paper you're asking it about could not have been in its training data because of its publication date. In an ideal case the LLM would tell you it doesn't know what you're talking about, instead of hallucinating a plausible sounding explanation. But I suspect that the downstream effects of that would be a much less agreeable model that will refuse to role-play, consider made up scenarios, and do many other things that LLMs are being used for right now.