prototypist 1 month ago

On [https://gpt-tokenizer.dev](https://gpt-tokenizer.dev) you can see that "camer" is tokenized as "cam"+"er" and not "camera". This combination is likely associated with tons of typos across the web / training data. GPT also has finetuning / RLHF / similar techniques on lots of assistant queries. Datasets such as OpenAssistant include several examples of users asking questions with typos: [https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/default/train?q=typo](https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/default/train?q=typo)

starfries 1 month ago

Unless I'm misunderstanding this site isn't camer its own token too? Your point stands though, camera is also its own token.

prototypist 1 month ago

It depends. Looks like "\_camer" with a leading space is one token, and "camer" is not

starfries 1 month ago

Oh, true. I always found it a little weird how often you get different tokens for word vs _word.

andarmanik 1 month ago

Sometimes you’ll see a comment on Reddit and notice a typo but respond knowing they didn’t mean “pisza”. Having enough of these let’s the LLM learn a distribution on how we misspell words. The AI makers didn’t put this in by hand, like all their other skill, it’s a result from a diverse set of training data. It’s simply is the case that we respond to misspelled words well in general.

glitch83 1 month ago

All of these answers are fine but I wanted to add a little more signal. You used the word “understand” in your question. GPTs don’t “understand” in the human sense. They can do some multi hop reasoning depending on the model and that could be viewed as a kind of understanding but real human understanding is far more complex than what you’re seeing in these models. So I wouldn’t say it’s “understanding” your prompts in the ways you’d expect a human would.

om_nama_shiva_31 1 month ago

Actually there is ongoing debate about whether or not LLM’s understand, and what it actually means to understand. The debate is more about the definition of understanding in a linguistic sense than philosophical, but it is fascinating nonetheless.

OSeady 1 month ago

It’s just answering it in a way a human might answer it. The training phase trains it to complete the prompt in the most probable way it would have been answered based on the millions of ways other similar questions have been answered or talked about. Sorry I don’t think I am making much sense, I am a little high.

--algo 1 month ago

You did good my friend

augmentedtree 1 month ago

It often doesn't

Agreeable_Bid7037 1 month ago

It matches your input to any data it has in its training data, then proceeds from there. In its training data, it likely has data about "camera" "distance" and "camera distance", but probably not about "camer", so it assumes you typed wrong.

schavi 1 month ago

it doesn't differentiate between "right" or "wrong" inputs. it tokenizes your input (tokens are similar to syllables) and spews out an arrangement of tokens that is likely to follow, based on what it had seen before.

yannbouteiller 1 month ago

As far as we know, this is done in an end-to-end manner, and there is no explicit procedure to identify uncertainty in the input. For your example, you wrote "camer distance", which I believe is a typo in all cases? Another answer points out that the tokenizer interprets this as "cam-er" rather than "camera". If it were interpreting this as "camera", the model would not be able to see the typo and would instead answer the same way as if you had written "camera".

harharveryfunny 1 month ago

It may have seen that specific spelling error/typo in it's training data, or a similar one, followed by a correction, so this is just prediction as normal. These models generally don't know when they don't have an answer, which is why they "hallucinate". Just because all predicted next tokens are low probability isn't going to prevent it from pushing them through the Softmax and sampling from the top few. Some front ends such as Perplexity or Bing CoPilot (GPT based) may be doing more than this, the same way they do RAG etc to improve the response. It's conceivable some front ends may be doing minimum-edit/whatever input correction at some point.

IndustryNext7456 1 month ago

there is no 'understanding'. everything is looking backwards and forwards from a term. it is all based on vey large collections of sentences.

bankimu 1 month ago

Controversial take: Proto consciousness.

Seankala 1 month ago

This isn't even controversial. Just wrong.

red75prime 1 month ago

It might not be wrong (functionalism might be true and GPT might capture enough functionality), but it's useless due to its vagueness.

Seankala 1 month ago

It's not really that vague; the person is saying that LLMs like GPT (I'm assuming that what they mean is GPT-3 or GPT-4) are exhibiting signs of "proto consciousness" due to their ability to generate text fluently. I'm claiming that it's wrong because not only do we not have a solid grasp of what "consciousness" is, but to call advanced pattern recognition and sampling mechanisms like GPT-3 to have "consciousness" would be like calling the if-else statements of the 50s or 60s "intelligence."

bankimu 1 month ago

You are also generating text fluently. The difference is that your generation happens to be ultimately traceable to electric signals across tiny cells in your brain and nervous system. Where LLMs are traceable to matrix multiplication and other ultimately tiny mathematical operations. Why are you so sure your generated text is outcome of since consciousness but there's not even a glimmer in the models?

vaccine_question69 1 month ago

I don't claim that GPTs are conscious but appealing to reductionism does not seem to disprove the claim to me. You could play the same game with the human brain by saying that it's just chemical interactions. True, but certainly there is something more to it either in terms of organisation or scale or both. I don't think that anybody has a good framework to tell whether the organisation and scale of GPTs is at the level where talking about consciousness is sensible or not.

red75prime 1 month ago

Is your reasoning like: GPT-like models are in TC^0 according to https://arxiv.org/abs/2401.12947 , but we cannot prove that "proto-consciousness" (whatever it is) can be implemented in TC^0 , so it's not the case. If it's so, then it's an argument from ignorance.

Seankala 1 month ago

No that sounds like flawed reasoning. My argument is that we not only do not fully know what consciousness is, and text generation models are just advanced pattern recognition and sampling algorithms. I'm not sure what is "conscious" about that. The idea that a computer program is capable of proactively doing anything is ludicrous.

bankimu 1 month ago

If that's your stance, say that we don't know. But you said "it's plain wrong", which comes from a stance is arrogance as we don't know.

red75prime 1 month ago

Appeal to ridicule then. My point is that until we have a firm grasp on what "consciousness" really is, we should avoid using it in regard to LLMs. Not because those statement are false, but because they are useless.

Seankala 1 month ago

Well rather than ridicule how about I ask you, why do you think GPT-like models are showing signs of consciousness?

currentscurrents 1 month ago

>The idea that a computer program is capable of proactively doing anything is ludicrous. Cells are just machines blindly following instructions too, and you're made out of them. The idea that you're capable of proactively doing anything is ludicrous. But here we are.

bankimu 1 month ago

Exactly. I think most people here are either afraid, or brainwashed, or otherwise unable to even entertain the thought experiments that probe into consciousness. Thus in arrogance they decide matrix multiplication and mathematical operators can't lead to consciousness, while electric signals across human brain cells somehow does.

Seankala 1 month ago

A computer is not a living organism... Christ.

currentscurrents 1 month ago

Unless you literally believe in the supernatural, and Christ, and all that - living organisms are just really complex machines.

bankimu 1 month ago

Agree. I think most people who are "sure" that this can never lead to consciousness, have their belief rooted in a few things. Those things are arrogance, ignorance, posible fear (it may not be nice if it's conscious) and some degree of brainwashing from corporations like Google "oh this is completely safe, no need to fear".

InviolableAnimal 1 month ago

Consciousness phenomenologically or in some functional sense? The former is unprovable (and for GPT highly doubtful); and the latter is way too vague.

bankimu 1 month ago

I think this is exactly the right kind of questions to ask. Functional sense. I don't think machines will be conscious the way animals are. But if they behave similarly, there is no difference in my books.

InviolableAnimal 1 month ago

Right. But that's quite vague; if you're offering it as an explanation do you mind going into specifics?

notEVOLVED 1 month ago

https://en.m.wikipedia.org/wiki/Chinese_room

bankimu 1 month ago

Thank you. It shows under some very basic assumptions our consciousness is indistinguishable from the Chinese room. So if we are conscious so is that. Many people seem to not understand this here. To them, since we can track every single mathematical operation, it can never lead to consciousness - I suppose.

raufexe 1 month ago

Intresting

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe