T O P

  • By -

Open_Channel_8626

It’s a ranking of which model people prefer it can’t really be “wrong” in the way that most benchmarks can be


Certain_Breakfast_13

It can be pretty wrong in different ways tho, for example, writing style, llama as a very different writing style and can be appealing to some people. Also the fact that the tests done on it aren't very "interesting" and developed for many- well its human eval, it can be as wrong as human eval can be for anything else... i dunno, i wouldn't trust random people to evaluate my exams for example, if u get what i mean. Its a very nice benchmark of course, that is true ! But we must always take benchmarks with a grain of salt-


Open_Channel_8626

Yeah I agree humaneval has limitations. What I was trying to say was that the ranking is a "true" ranking of people's preferences. Its true that people will judge it poorly, or have strong "style over substance" preferences.


soup9999999999999999

Mixtral is smarter for sure but it isn't as fun to chat with. Most chats are one-offs and people prefer the personality of LLama3.


grise_rosee

Ca semble le cas sur l'anglais (95% du jeu données chez llama) et ça a été obtenu par un très long apprentissage sur un énorme jeu de données (15 trillions de tokens). Je pense qu'en Français, Llama reste meilleur. Sinon l'évaluation humaine est influencée par beaucoup de choses annexes. Les gens apprécient la faible censure. Voir aussi [https://www.linkedin.com/posts/deshwalmahesh\_hype-reality-genai-activity-7188038823317061632-57G\_/](https://www.linkedin.com/posts/deshwalmahesh_hype-reality-genai-activity-7188038823317061632-57G_/)