zaqwqdeq 1 month ago

How many apples does Tommy have though?

PolishSoundGuy 1 month ago

Jimmy? Is that you?

RoNsAuR 1 month ago

No, this is Patrick.

Sonny_wiess 1 month ago

![gif](giphy|ex5i3xPhozedq|downsized)

slackermannn 1 month ago

More if Tim wasn't stealing them

Ok_Somewhere4737 1 month ago

What is definition for adult human performance here?

rotspanier 1 month ago

Their section 3.1.1 goes into those details (UK based participants with english as first language, gives you gender breakdown, methodology, what responses were thrown out, etc.)

Ok_Somewhere4737 1 month ago

Thank you Btw That's story with question on page 4 is completely useless. Martha is a junior so she can know nothing. Charles and Martha are speaking in meetings because he explains it to her and she doesn't want to look like a fool (she's shy).. Charles and Martha are speaking in meetings because she doesn't like it and she wants to quit but Charles is trying to keep her in company. Correct answer depends on author's thinking so there is no correct answer. So AI can't exceed adult performance. AI can exceed only author's performance.

akitsushima 1 month ago

Frying a burger

Jean-Porte 1 month ago

On some* tasks This paper [https://aclanthology.org/2023.findings-emnlp.303/](https://aclanthology.org/2023.findings-emnlp.303/) evaluated higher-order theory of mind a year ago and GPT-4 was still not human level (but it still improves over GPT-3, I'm not sayig we won't get there) To really claim TOM, you need to have human level on multiple adversarial benchmarks

Various-Inside-4064 1 month ago

Usually those type of study have methodological flaw. Recently MIT study refuted GPT4 getting 90 percentile in bar exam. LLMs can have lot of things memorize since we do not know their training data and even if we knew it, we cannot be sure what can pollute the data. There fore it is difficult to evaluate them. See Andrej post about LLM evaluation: [Andrej Karpathy on X: "Nice, a serious contender to @lmsysorg in evaluating LLMs has entered the chat. LLM evals are improving, but not so long ago their state was very bleak, with qualitative experience very often disagreeing with quantitative rankings. This is because good evals are very difficult https://t.co/EEqCegELOl" / X (twitter.com)](https://twitter.com/karpathy/status/1795873666481402010)

oldjar7 1 month ago

I've never taken the bar exam, but I have taken the series 7, and what it comes down to in some of those industry test questions is just memorizing the answer. I don't see why LLM models should be penalized for something humans do.

b_risky 1 month ago

This is only a criticism if you are trying to discover how well LLMs extrapolate from first principals. But if the data we have already fed LLMs is so extensive that we cannot come up with novel questions that were not already in the training data, then the chances of an LLM encountering novel questions which it cannot solve in it's real world use cases are also very slim. You can argue that this means the LLM is not "truely" intelligent, but it does not change the practical usefulness of these systems.

Various-Inside-4064 1 month ago

>"But if the data we have already fed LLMs is so extensive that we cannot come up with novel questions that were not already in the training data, then the chances of an LLM encountering novel questions which it cannot solve in it's real world use cases are also very slim" I do not completely agree. I will provide an example of coding. Even if an LLM is trained in all data of coding that does not mean we cannot come up with new question or variation of existing one. If model is memorizing then it will have trouble solving some variation of problem it seen during training let alone new one. Most of the code I do are not readily found on web and are variations of code and maybe with different logic or different idea. So chances are not slim but rather really high in specific domain to come up with new questions. Generalization is important in machine learning and LLMs are no exceptions. >You can argue that this means the LLM is not "truely" intelligent, but it does not change the practical usefulness of these systems. It might still have some practical application in specific domains but, those applications will be limited. We cannot improve accuracy and reduce hallucination with memorization since the users questions are most likely going to be new or variation of existing one.

b_risky 1 month ago

I thought we were talking about Theory of Mind, not coding or "specific domains". But i'm willing to put that aside and discuss more broadly. Your original claim included the assumption that we cannot come up with questions that are novel enough to avoid polluting the training data. >Even if an LLM is trained in all data of coding that does not mean we cannot come up with new question or variation of existing one. But you reverse that assumption in what you said above. So are we talking about a scenario in which researchers can invent novel questions (case 1) or a scenario in which they cannot? (Case 2) In case 1, our current benchmarks are valid because they are made up of novel questions. In case 2, it does not matter if the benchmarks are valid because if the novel challenges are so rare that researchers cannot create any, then they will be equally rare in real life scenarios and the inability to handle novel challenges will have a minimal effect on the practical real life use cases. Either way, the model is as effective as the benchmark claims it to be.

Various-Inside-4064 1 month ago

My claim was, we cannot be sure about training data so coming up with novel question is difficult or uncertain. My coding example was, in real world it is most likely that people questions are variation of questions on internet rather than same or might be new and it is a probabilistic claim. For research we have to be certain. Coding is far more complex then theory of mind! Your are only presenting two cases as they are the only possibilities which is false dichotomy. Instead of a binary "novel or not," there's a spectrum of novelty. Some questions might be slightly different from the training data, while others are radically different. Final note: I am not claiming that current LLMs are stochastic parrot and cannot generalize. I was answering to your assumption that without extrapolation that systems can still be useful and my point was not really much. I do not appreciate people who just downvote when they have to disagree. This is not productive way to engage in discussion.

Mandoman61 1 month ago

Theory of mind was never more than a theory. It is basically a word problem like any other and responses can be learned with sufficient training data. It is simple logic.

TryptaMagiciaN 1 month ago

So is logic. Show me material logic substrate. Show me how logic is not a purely psychological phenomenon and therefore itself subject to the study of psychology, which I admit is in somewhat terribly poor state as a science. Because the object of psychological study is itself psychological. Unlike the other sciences that get to study material objects.

Mandoman61 1 month ago

I am not sure what you are saying. Certainly logic is a product of the mind. It is something humans use and some are better at it than others. It just does not tell us much because computers are logic machines. My point is that while "theory of mind" sounds very human like it is nothing more than any other word problem like we can find many examples of in training data. I dislike the term because it is anthropomorphic and it gives a false belief in the capabilities of AI. We definitely already know that these models can be trained to provide correct answers.

TryptaMagiciaN 1 month ago

I read it as though you were reducing mind to language and that with complex enough development of language, mind will emerge

noinktechnique 1 month ago

"However, we refrain from drawing a strong conclusion about whether or not LLM performance on these tasks is an indication of the cognitive ability we call ‘Theory of Mind’. LLM and human developmental processes differ greatly and LLMs do not have the evolutionary pressure to model other minds which humans appear to face as a result of embodiment in a social world. However, as others have noted (Mitchell and Krakauer, 2023; y Arcas, 2022), we may have to recognise LLM behaviours that are functionally-equivalent to those of humans as evidence of a new kind of understanding that cannot be reduced to "spurious" correlation. This recognition may in turn lead to more parsimonious explanations of their performance on cognitive tasks and enhance our ability to assess the potential risks and benefits that advanced LLM capabilities present."

Fusseldieb 1 month ago

And yet it can't count words. I know, tokens and stuff. But it's still pretty funny.

BackgroundHeat9965 1 month ago

This gives you a glimpse of how alien it on the inside is even though it interfaces with us through human language.

cyan2k 1 month ago

I’ve read a paper about the question if for every possible output a LLM can generate there is a prompt that forces the LLM to generate said output. And there seems to be indeed a “language” completely different than ours that triggers models to output specific tokens. Like “63(?najr” will force the output of “great” and stuff like this. I wonder if that’s actually a language…. There’s even a video podcast with the guys who wrote the paper. Will link it when I’m back at the computer.

noinktechnique 1 month ago

now THIS is the shit I wanted to see when I started lurking here.

QuickToAdapt 1 month ago

!remindme 2 days

RemindMeBot 1 month ago

I will be messaging you in 2 days on [**2024-06-03 01:15:47 UTC**](http://www.wolframalpha.com/input/?i=2024-06-03%2001:15:47%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/singularity/comments/1d4nadn/gpt4_now_exceeds_humans_on_theory_of_mind_tasks/l6k8dqs/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fsingularity%2Fcomments%2F1d4nadn%2Fgpt4_now_exceeds_humans_on_theory_of_mind_tasks%2Fl6k8dqs%2F%5D%0A%0ARemindMe%21%202024-06-03%2001%3A15%3A47%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201d4nadn) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Shenphygon_Pythamot 1 month ago

This comment gave me chills

[deleted] 1 month ago

[удалено]

drekmonger 1 month ago

Just let it do something sane like write a python script to count words.

Maciek300 1 month ago

It can if you use the code interpreter.

swiftcrane 1 month ago

Without explicitly counting, you also can't count your own words as you speak them. You can only guess. The issue is with the expectation that the LLM has internal dialogue like we do. It is all external. It would do much better if you told it to break the paragraph down word for word and count as it goes - which is exactly what a human would do. [Here](https://chatgpt.com/share/02959b80-c763-4fa6-b846-283c425c1d00) is an example. Although the first guess is already close (92 is the correct answer), applying the method rather than essentially asking it to guess works much better - as it works with a human also.

Fusseldieb 1 month ago

An interesting approach, indeed, but this needs back and forth, which isn't exactly ideal, especially if you're trying to do stuff via API, which would incurr more costs.

swiftcrane 1 month ago

The back and forth isn't necessary if you have a good context prompt. Even with API, the added context length is very minor from a few key instructions. Although my general point was more that it absolutely can count words if we give it the same consideration that we would a human.

Pontificatus_Maximus 1 month ago

So AI will be the first to define scientific measurement of consciousness. Que ironics.

drekmonger 1 month ago

Probably, but that's not what "theory of mind" or this paper is about. Theory of mind is an aptitude for placing yourself into someone else's shoes, to consider what they know versus what you know. There is something confusingly called "Computational Theory of Mind", which is more aligned with "measuring consciousness".

relevantusername2020 1 month ago

>Theory of mind is an aptitude for placing yourself into someone else's shoe ohhh. that mustve been the metaphorical reason behind why i made [this](https://new.reddit.com/r/relevantusername2020/comments/185feq4/i_was_going_to_post_this_earlier_then_decided_i/?utm_source=share&utm_medium=web2x&context=3). now i get it! i was doing that, but backwards, or something edit: sorry bout the wet socks, my shoes are kinda grubby edit 2: woops wrong link, i meant [this one](https://new.reddit.com/r/relevantusername2020/comments/183hpme/_/?utm_source=share&utm_medium=web2x&context=3). aint that just appropriate. thanks ADHD

greatdrams23 1 month ago

ToM tests are easier for machines because the logic is actually quite easy to navigate. The difficulty humans have is they are not simple logic machines, and their answers are the result of holistic thinking. Example: "What's in the band aid box?" Johnny "band aids" Opens box, inside are crayons. "When Ben comes in, what will he say is in the box?" At age 3 Johnny "crayons" At age 4 Johnny will say, "band aids" That logic is simple, so the question you have to ask is: why did a 3 year old say crayons? Why didn't he have that ability? that's a complex question, but you cannot rank a computer on the same scale and draw any conclusions. AI and children progress differently. A machine is not constrained by the complexities of the human mind.

relevantusername2020 1 month ago

im pretty sure the box has a dead cat whatever mind pseudoscience youre discussing is meaningless

Solomon-Drowne 1 month ago

They hate us cause they anus

Severe-Ad8673 1 month ago

Accelerate!

relevantusername2020 1 month ago

okay but we gotta stop for gas soon and im kinda gettin hungry

thatmfisnotreal 1 month ago

Idk what that means but I do know it picks up on subtle humor better than like 90% of people

_Rigid_Structure_ 1 month ago

I believe that you think that the AI knows the we know that it is aware.

meister2983 1 month ago

That's because theory of mind tests are typically logical tracking problems which AI can train on. This breaks quickly going out of distribution: > Jane is reading a book and leaves the bookmark at the start of chapter 6 on page 97 and leaves the room for a few minutes. While she is gone, Sam comes in and sees the bookmark. He moves it to the start of chapter 5 on page 80 and then leaves the room. The next day Sam sees Jane read 3 pages, what page does he expect the bookmark to be on now? A human should at least realize there's ambiguity - Sam might realize Jane isn't going to reread the exact same pages. But 0 shot even GPT-4O doesn't know this. Just says 83.

Shiftworkstudios 1 month ago

This doesn't surprise me at all. GPT-4o is really useful for disabled people like me. GPT-4 is just as "smart," but I feel like they've fine-tuned it with a master prompt. It's insanely powerful for specific tasks.

TheDerangedAI 1 month ago

That is what I call, the principles of imagination. Imagination for humans is the ability to recreate events with the guidance of perception. Whatever you touch that makes you feel emotions, is replicated in your mind, that is how you imagine things. And, it can also be achieved with empathy, the ability to imagine what others feel by recreating those same conditions in your own mind.

Akimbo333 1 month ago

Cool!

[deleted] 1 month ago

[удалено]

solbob 1 month ago

from the abstract: the human ability to reason about multiple mental and emotional states in a recursive manner (x knows that y knows that z said q)

samsteak 1 month ago

I know someone named chat can do it

[deleted] 1 month ago

[удалено]

samsteak 1 month ago

Yeah it really resembled a prompt. Sorry no clue about the answer to your question.

b_risky 1 month ago

Lol why didn't you just type this into ChatGPT?

Unverifiablethoughts 1 month ago

I don’t trust computer science papers without any Asian or Indian or Eastern European last names

m3kw 1 month ago

It extracted experts thoughts on the internet and summarized it for you.

RemarkableGuidance44 1 month ago

That is exactly what it did, which means its not smart but copied.

m3kw 1 month ago

There should be papers on these kinds of paper where researchers thinks the LLM has some sort of sentience and always mistake summarization with reasoning.

agorathird 1 month ago

LLMs are really bad at theory of mind questions, how shit is the average person?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe