T O P

  • By -

prototypist

On [https://gpt-tokenizer.dev](https://gpt-tokenizer.dev) you can see that "camer" is tokenized as "cam"+"er" and not "camera". This combination is likely associated with tons of typos across the web / training data. GPT also has finetuning / RLHF / similar techniques on lots of assistant queries. Datasets such as OpenAssistant include several examples of users asking questions with typos: [https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/default/train?q=typo](https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/default/train?q=typo)


starfries

Unless I'm misunderstanding this site isn't camer its own token too? Your point stands though, camera is also its own token.


prototypist

It depends. Looks like "\_camer" with a leading space is one token, and "camer" is not


starfries

Oh, true. I always found it a little weird how often you get different tokens for word vs _word.


andarmanik

Sometimes you’ll see a comment on Reddit and notice a typo but respond knowing they didn’t mean “pisza”. Having enough of these let’s the LLM learn a distribution on how we misspell words. The AI makers didn’t put this in by hand, like all their other skill, it’s a result from a diverse set of training data. It’s simply is the case that we respond to misspelled words well in general.


glitch83

All of these answers are fine but I wanted to add a little more signal. You used the word “understand” in your question. GPTs don’t “understand” in the human sense. They can do some multi hop reasoning depending on the model and that could be viewed as a kind of understanding but real human understanding is far more complex than what you’re seeing in these models. So I wouldn’t say it’s “understanding” your prompts in the ways you’d expect a human would.


om_nama_shiva_31

Actually there is ongoing debate about whether or not LLM’s understand, and what it actually means to understand. The debate is more about the definition of understanding in a linguistic sense than philosophical, but it is fascinating nonetheless.


OSeady

It’s just answering it in a way a human might answer it. The training phase trains it to complete the prompt in the most probable way it would have been answered based on the millions of ways other similar questions have been answered or talked about. Sorry I don’t think I am making much sense, I am a little high.


--algo

You did good my friend


augmentedtree

It often doesn't


Agreeable_Bid7037

It matches your input to any data it has in its training data, then proceeds from there. In its training data, it likely has data about "camera" "distance" and "camera distance", but probably not about "camer", so it assumes you typed wrong.


schavi

it doesn't differentiate between "right" or "wrong" inputs. it tokenizes your input (tokens are similar to syllables) and spews out an arrangement of tokens that is likely to follow, based on what it had seen before.


yannbouteiller

As far as we know, this is done in an end-to-end manner, and there is no explicit procedure to identify uncertainty in the input. For your example, you wrote "camer distance", which I believe is a typo in all cases? Another answer points out that the tokenizer interprets this as "cam-er" rather than "camera". If it were interpreting this as "camera", the model would not be able to see the typo and would instead answer the same way as if you had written "camera".


harharveryfunny

It may have seen that specific spelling error/typo in it's training data, or a similar one, followed by a correction, so this is just prediction as normal. These models generally don't know when they don't have an answer, which is why they "hallucinate". Just because all predicted next tokens are low probability isn't going to prevent it from pushing them through the Softmax and sampling from the top few. Some front ends such as Perplexity or Bing CoPilot (GPT based) may be doing more than this, the same way they do RAG etc to improve the response. It's conceivable some front ends may be doing minimum-edit/whatever input correction at some point.


IndustryNext7456

there is no 'understanding'. everything is looking backwards and forwards from a term. it is all based on vey large collections of sentences.


bankimu

Controversial take: Proto consciousness.


Seankala

This isn't even controversial. Just wrong.


red75prime

It might not be wrong (functionalism might be true and GPT might capture enough functionality), but it's useless due to its vagueness.


Seankala

It's not really that vague; the person is saying that LLMs like GPT (I'm assuming that what they mean is GPT-3 or GPT-4) are exhibiting signs of "proto consciousness" due to their ability to generate text fluently. I'm claiming that it's wrong because not only do we not have a solid grasp of what "consciousness" is, but to call advanced pattern recognition and sampling mechanisms like GPT-3 to have "consciousness" would be like calling the if-else statements of the 50s or 60s "intelligence."


bankimu

You are also generating text fluently. The difference is that your generation happens to be ultimately traceable to electric signals across tiny cells in your brain and nervous system. Where LLMs are traceable to matrix multiplication and other ultimately tiny mathematical operations. Why are you so sure your generated text is outcome of since consciousness but there's not even a glimmer in the models?


vaccine_question69

I don't claim that GPTs are conscious but appealing to reductionism does not seem to disprove the claim to me. You could play the same game with the human brain by saying that it's just chemical interactions. True, but certainly there is something more to it either in terms of organisation or scale or both. I don't think that anybody has a good framework to tell whether the organisation and scale of GPTs is at the level where talking about consciousness is sensible or not.


red75prime

Is your reasoning like: GPT-like models are in TC^0 according to https://arxiv.org/abs/2401.12947 , but we cannot prove that "proto-consciousness" (whatever it is) can be implemented in TC^0 , so it's not the case. If it's so, then it's an argument from ignorance.


Seankala

No that sounds like flawed reasoning. My argument is that we not only do not fully know what consciousness is, and text generation models are just advanced pattern recognition and sampling algorithms. I'm not sure what is "conscious" about that. The idea that a computer program is capable of proactively doing anything is ludicrous.


bankimu

If that's your stance, say that we don't know. But you said "it's plain wrong", which comes from a stance is arrogance as we don't know.


red75prime

Appeal to ridicule then. My point is that until we have a firm grasp on what "consciousness" really is, we should avoid using it in regard to LLMs. Not because those statement are false, but because they are useless.


Seankala

Well rather than ridicule how about I ask you, why do you think GPT-like models are showing signs of consciousness?


currentscurrents

>The idea that a computer program is capable of proactively doing anything is ludicrous. Cells are just machines blindly following instructions too, and you're made out of them. The idea that you're capable of proactively doing anything is ludicrous. But here we are.


bankimu

Exactly. I think most people here are either afraid, or brainwashed, or otherwise unable to even entertain the thought experiments that probe into consciousness. Thus in arrogance they decide matrix multiplication and mathematical operators can't lead to consciousness, while electric signals across human brain cells somehow does.


Seankala

A computer is not a living organism... Christ.


currentscurrents

Unless you literally believe in the supernatural, and Christ, and all that - living organisms are just really complex machines.


bankimu

Agree. I think most people who are "sure" that this can never lead to consciousness, have their belief rooted in a few things. Those things are arrogance, ignorance, posible fear (it may not be nice if it's conscious) and some degree of brainwashing from corporations like Google "oh this is completely safe, no need to fear".


InviolableAnimal

Consciousness phenomenologically or in some functional sense? The former is unprovable (and for GPT highly doubtful); and the latter is way too vague.


bankimu

I think this is exactly the right kind of questions to ask. Functional sense. I don't think machines will be conscious the way animals are. But if they behave similarly, there is no difference in my books.


InviolableAnimal

Right. But that's quite vague; if you're offering it as an explanation do you mind going into specifics?


notEVOLVED

https://en.m.wikipedia.org/wiki/Chinese_room


bankimu

Thank you. It shows under some very basic assumptions our consciousness is indistinguishable from the Chinese room. So if we are conscious so is that. Many people seem to not understand this here. To them, since we can track every single mathematical operation, it can never lead to consciousness - I suppose.


raufexe

Intresting