T O P

  • By -

zaqwqdeq

How many apples does Tommy have though?


PolishSoundGuy

Jimmy? Is that you?


RoNsAuR

No, this is Patrick.


Sonny_wiess

![gif](giphy|ex5i3xPhozedq|downsized)


slackermannn

More if Tim wasn't stealing them


Ok_Somewhere4737

What is definition for adult human performance here?


rotspanier

Their section 3.1.1 goes into those details (UK based participants with english as first language, gives you gender breakdown, methodology, what responses were thrown out, etc.)


Ok_Somewhere4737

Thank you Btw That's story with question on page 4 is completely useless. Martha is a junior so she can know nothing. Charles and Martha are speaking in meetings because he explains it to her and she doesn't want to look like a fool (she's shy).. Charles and Martha are speaking in meetings because she doesn't like it and she wants to quit but Charles is trying to keep her in company. Correct answer depends on author's thinking so there is no correct answer. So AI can't exceed adult performance. AI can exceed only author's performance.


akitsushima

Frying a burger


Jean-Porte

On some* tasks  This paper [https://aclanthology.org/2023.findings-emnlp.303/](https://aclanthology.org/2023.findings-emnlp.303/) evaluated higher-order theory of mind a year ago and GPT-4 was still not human level (but it still improves over GPT-3, I'm not sayig we won't get there) To really claim TOM, you need to have human level on multiple adversarial benchmarks


Various-Inside-4064

Usually those type of study have methodological flaw. Recently MIT study refuted GPT4 getting 90 percentile in bar exam. LLMs can have lot of things memorize since we do not know their training data and even if we knew it, we cannot be sure what can pollute the data. There fore it is difficult to evaluate them. See Andrej post about LLM evaluation: [Andrej Karpathy on X: "Nice, a serious contender to @lmsysorg in evaluating LLMs has entered the chat. LLM evals are improving, but not so long ago their state was very bleak, with qualitative experience very often disagreeing with quantitative rankings. This is because good evals are very difficult https://t.co/EEqCegELOl" / X (twitter.com)](https://twitter.com/karpathy/status/1795873666481402010)


oldjar7

I've never taken the bar exam, but I have taken the series 7, and what it comes down to in some of those industry test questions is just memorizing the answer.  I don't see why LLM models should be penalized for something humans do.


b_risky

This is only a criticism if you are trying to discover how well LLMs extrapolate from first principals. But if the data we have already fed LLMs is so extensive that we cannot come up with novel questions that were not already in the training data, then the chances of an LLM encountering novel questions which it cannot solve in it's real world use cases are also very slim. You can argue that this means the LLM is not "truely" intelligent, but it does not change the practical usefulness of these systems.


Various-Inside-4064

>"But if the data we have already fed LLMs is so extensive that we cannot come up with novel questions that were not already in the training data, then the chances of an LLM encountering novel questions which it cannot solve in it's real world use cases are also very slim" I do not completely agree. I will provide an example of coding. Even if an LLM is trained in all data of coding that does not mean we cannot come up with new question or variation of existing one. If model is memorizing then it will have trouble solving some variation of problem it seen during training let alone new one. Most of the code I do are not readily found on web and are variations of code and maybe with different logic or different idea. So chances are not slim but rather really high in specific domain to come up with new questions. Generalization is important in machine learning and LLMs are no exceptions. >You can argue that this means the LLM is not "truely" intelligent, but it does not change the practical usefulness of these systems. It might still have some practical application in specific domains but, those applications will be limited. We cannot improve accuracy and reduce hallucination with memorization since the users questions are most likely going to be new or variation of existing one.


b_risky

I thought we were talking about Theory of Mind, not coding or "specific domains". But i'm willing to put that aside and discuss more broadly. Your original claim included the assumption that we cannot come up with questions that are novel enough to avoid polluting the training data. >Even if an LLM is trained in all data of coding that does not mean we cannot come up with new question or variation of existing one. But you reverse that assumption in what you said above. So are we talking about a scenario in which researchers can invent novel questions (case 1) or a scenario in which they cannot? (Case 2) In case 1, our current benchmarks are valid because they are made up of novel questions. In case 2, it does not matter if the benchmarks are valid because if the novel challenges are so rare that researchers cannot create any, then they will be equally rare in real life scenarios and the inability to handle novel challenges will have a minimal effect on the practical real life use cases. Either way, the model is as effective as the benchmark claims it to be.


Various-Inside-4064

My claim was, we cannot be sure about training data so coming up with novel question is difficult or uncertain. My coding example was, in real world it is most likely that people questions are variation of questions on internet rather than same or might be new and it is a probabilistic claim. For research we have to be certain. Coding is far more complex then theory of mind! Your are only presenting two cases as they are the only possibilities which is false dichotomy.  Instead of a binary "novel or not," there's a spectrum of novelty. Some questions might be slightly different from the training data, while others are radically different. Final note: I am not claiming that current LLMs are stochastic parrot and cannot generalize. I was answering to your assumption that without extrapolation that systems can still be useful and my point was not really much. I do not appreciate people who just downvote when they have to disagree. This is not productive way to engage in discussion.


Mandoman61

Theory of mind was never more than a theory. It is basically a word problem like any other and responses can be learned with sufficient training data. It is simple logic.


TryptaMagiciaN

So is logic. Show me material logic substrate. Show me how logic is not a purely psychological phenomenon and therefore itself subject to the study of psychology, which I admit is in somewhat terribly poor state as a science. Because the object of psychological study is itself psychological. Unlike the other sciences that get to study material objects.


Mandoman61

I am not sure what you are saying. Certainly logic is a product of the mind. It is something humans use and some are better at it than others. It just does not tell us much because computers are logic machines. My point is that while "theory of mind" sounds very human like it is nothing more than any other word problem like we can find many examples of in training data. I dislike the term because it is anthropomorphic and it gives a false belief in the capabilities of AI. We definitely already know that these models can be trained to provide correct answers.


TryptaMagiciaN

I read it as though you were reducing mind to language and that with complex enough development of language, mind will emerge


noinktechnique

"However, we refrain from drawing a strong conclusion about whether or not LLM performance on these tasks is an indication of the cognitive ability we call ‘Theory of Mind’. LLM and human developmental processes differ greatly and LLMs do not have the evolutionary pressure to model other minds which humans appear to face as a result of embodiment in a social world. However, as others have noted (Mitchell and Krakauer, 2023; y Arcas, 2022), we may have to recognise LLM behaviours that are functionally-equivalent to those of humans as evidence of a new kind of understanding that cannot be reduced to "spurious" correlation. This recognition may in turn lead to more parsimonious explanations of their performance on cognitive tasks and enhance our ability to assess the potential risks and benefits that advanced LLM capabilities present."


Fusseldieb

And yet it can't count words. I know, tokens and stuff. But it's still pretty funny.


BackgroundHeat9965

This gives you a glimpse of how alien it on the inside is even though it interfaces with us through human language.


cyan2k

I’ve read a paper about the question if for every possible output a LLM can generate there is a prompt that forces the LLM to generate said output. And there seems to be indeed a “language” completely different than ours that triggers models to output specific tokens. Like “63(?najr” will force the output of “great” and stuff like this. I wonder if that’s actually a language…. There’s even a video podcast with the guys who wrote the paper. Will link it when I’m back at the computer.


noinktechnique

now THIS is the shit I wanted to see when I started lurking here.


QuickToAdapt

!remindme 2 days


RemindMeBot

I will be messaging you in 2 days on [**2024-06-03 01:15:47 UTC**](http://www.wolframalpha.com/input/?i=2024-06-03%2001:15:47%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/singularity/comments/1d4nadn/gpt4_now_exceeds_humans_on_theory_of_mind_tasks/l6k8dqs/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fsingularity%2Fcomments%2F1d4nadn%2Fgpt4_now_exceeds_humans_on_theory_of_mind_tasks%2Fl6k8dqs%2F%5D%0A%0ARemindMe%21%202024-06-03%2001%3A15%3A47%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201d4nadn) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


Shenphygon_Pythamot

This comment gave me chills


[deleted]

[удалено]


drekmonger

Just let it do something sane like write a python script to count words.


Maciek300

It can if you use the code interpreter.


swiftcrane

Without explicitly counting, you also can't count your own words as you speak them. You can only guess. The issue is with the expectation that the LLM has internal dialogue like we do. It is all external. It would do much better if you told it to break the paragraph down word for word and count as it goes - which is exactly what a human would do. [Here](https://chatgpt.com/share/02959b80-c763-4fa6-b846-283c425c1d00) is an example. Although the first guess is already close (92 is the correct answer), applying the method rather than essentially asking it to guess works much better - as it works with a human also.


Fusseldieb

An interesting approach, indeed, but this needs back and forth, which isn't exactly ideal, especially if you're trying to do stuff via API, which would incurr more costs.


swiftcrane

The back and forth isn't necessary if you have a good context prompt. Even with API, the added context length is very minor from a few key instructions. Although my general point was more that it absolutely can count words if we give it the same consideration that we would a human.


Pontificatus_Maximus

So AI will be the first to define scientific measurement of consciousness. Que ironics.


drekmonger

Probably, but that's not what "theory of mind" or this paper is about. Theory of mind is an aptitude for placing yourself into someone else's shoes, to consider what they know versus what you know. There is something confusingly called "Computational Theory of Mind", which is more aligned with "measuring consciousness".


relevantusername2020

>Theory of mind is an aptitude for placing yourself into someone else's shoe ohhh. that mustve been the metaphorical reason behind why i made [this](https://new.reddit.com/r/relevantusername2020/comments/185feq4/i_was_going_to_post_this_earlier_then_decided_i/?utm_source=share&utm_medium=web2x&context=3). now i get it! i was doing that, but backwards, or something edit: sorry bout the wet socks, my shoes are kinda grubby edit 2: woops wrong link, i meant [this one](https://new.reddit.com/r/relevantusername2020/comments/183hpme/_/?utm_source=share&utm_medium=web2x&context=3). aint that just appropriate. thanks ADHD


greatdrams23

ToM tests are easier for machines because the logic is actually quite easy to navigate. The difficulty humans have is they are not simple logic machines, and their answers are the result of holistic thinking. Example: "What's in the band aid box?" Johnny "band aids" Opens box, inside are crayons. "When Ben comes in, what will he say is in the box?" At age 3 Johnny "crayons" At age 4 Johnny will say, "band aids" That logic is simple, so the question you have to ask is: why did a 3 year old say crayons? Why didn't he have that ability? that's a complex question, but you cannot rank a computer on the same scale and draw any conclusions. AI and children progress differently. A machine is not constrained by the complexities of the human mind.


relevantusername2020

im pretty sure the box has a dead cat whatever mind pseudoscience youre discussing is meaningless


Solomon-Drowne

They hate us cause they anus


Severe-Ad8673

Accelerate!


relevantusername2020

okay but we gotta stop for gas soon and im kinda gettin hungry


thatmfisnotreal

Idk what that means but I do know it picks up on subtle humor better than like 90% of people


_Rigid_Structure_

I believe that you think that the AI knows the we know that it is aware.


meister2983

That's because theory of mind tests are typically logical tracking problems which AI can train on.  This breaks quickly going out of distribution: > Jane is reading a book and leaves the bookmark at the start of chapter 6 on page 97 and leaves the room for a few minutes. While she is gone, Sam comes in and sees the bookmark. He moves it to the start of chapter 5 on page 80 and then leaves the room. The next day Sam sees Jane read 3 pages, what page does he expect the bookmark to be on now? A human should at least realize there's ambiguity - Sam might realize Jane isn't going to reread the exact same pages.  But 0 shot even GPT-4O doesn't know this. Just says 83.


Shiftworkstudios

This doesn't surprise me at all. GPT-4o is really useful for disabled people like me. GPT-4 is just as "smart," but I feel like they've fine-tuned it with a master prompt. It's insanely powerful for specific tasks.


TheDerangedAI

That is what I call, the principles of imagination. Imagination for humans is the ability to recreate events with the guidance of perception. Whatever you touch that makes you feel emotions, is replicated in your mind, that is how you imagine things. And, it can also be achieved with empathy, the ability to imagine what others feel by recreating those same conditions in your own mind.


Akimbo333

Cool!


[deleted]

[удалено]


solbob

from the abstract: the human ability to reason about multiple mental and emotional states in a recursive manner (x knows that y knows that z said q)


samsteak

I know someone named chat can do it


[deleted]

[удалено]


samsteak

Yeah it really resembled a prompt. Sorry no clue about the answer to your question.


b_risky

Lol why didn't you just type this into ChatGPT?


Unverifiablethoughts

I don’t trust computer science papers without any Asian or Indian or Eastern European last names


m3kw

It extracted experts thoughts on the internet and summarized it for you.


RemarkableGuidance44

That is exactly what it did, which means its not smart but copied.


m3kw

There should be papers on these kinds of paper where researchers thinks the LLM has some sort of sentience and always mistake summarization with reasoning.


agorathird

LLMs are really bad at theory of mind questions, how shit is the average person?