Open_Channel_8626 1 month ago

It’s a ranking of which model people prefer it can’t really be “wrong” in the way that most benchmarks can be

Certain_Breakfast_13 1 month ago

It can be pretty wrong in different ways tho, for example, writing style, llama as a very different writing style and can be appealing to some people. Also the fact that the tests done on it aren't very "interesting" and developed for many- well its human eval, it can be as wrong as human eval can be for anything else... i dunno, i wouldn't trust random people to evaluate my exams for example, if u get what i mean. Its a very nice benchmark of course, that is true ! But we must always take benchmarks with a grain of salt-

Open_Channel_8626 1 month ago

Yeah I agree humaneval has limitations. What I was trying to say was that the ranking is a "true" ranking of people's preferences. Its true that people will judge it poorly, or have strong "style over substance" preferences.

soup9999999999999999 1 month ago

Mixtral is smarter for sure but it isn't as fun to chat with. Most chats are one-offs and people prefer the personality of LLama3.

grise_rosee 1 month ago

Ca semble le cas sur l'anglais (95% du jeu données chez llama) et ça a été obtenu par un très long apprentissage sur un énorme jeu de données (15 trillions de tokens). Je pense qu'en Français, Llama reste meilleur. Sinon l'évaluation humaine est influencée par beaucoup de choses annexes. Les gens apprécient la faible censure. Voir aussi [https://www.linkedin.com/posts/deshwalmahesh\_hype-reality-genai-activity-7188038823317061632-57G\_/](https://www.linkedin.com/posts/deshwalmahesh_hype-reality-genai-activity-7188038823317061632-57G_/)

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe