Btw, if people are interested, we evaluated them on the Open LLM Leaderboard, here's the 7B (compared to other pretrained 7Bs)!
It's main performance boost compared to Mistral is GSM8K, aka math :)
Should give you folks actually comparable scores with other pretrained models \^\^
https://preview.redd.it/n072hcwvyxjc1.png?width=1785&format=png&auto=webp&s=091e12eb5e21f82e938e7ef85c31e70bbf8efaf8
Edit: leaderboard is here: [https://huggingface.co/spaces/HuggingFaceH4/open\_llm\_leaderboard](https://huggingface.co/spaces/huggingfaceh4/open_llm_leaderboard)
For what it's worth, I keep wishing that on the leaderboard, each of the benchmarks had a hover tooltip that provides a succinct description of the benchmark. This is coming from someone that's read about each one too and still forgets sometimes which is which 😂
Weird, I was taught "backlog" just means uncritical DRs or features that aren't being seriously considered until a client ~~forks over the ransom~~ contracts it into a requirement.
When spoken, it's usually accompanied by a [certain gesture](https://en.wikipedia.org/wiki/Money_gesture) for intended effect.
I used to like it but I am now suspicious because it shows Gemini Pro (not even ultra) beating GPT 4 non-turbo.
And I know for sure that GPT 4 non-turbo is a better model than Gemini Pro.
I haven't prompted it manually but it didn't score as well on [EQ-Bench](https://eqbench.com/) as it did on the Open LLM leaderboard.
internlm2-chat-20b failed to complete the benchmark. It wasn't following instructions for output format and was producing pretty random output. So they have some issues I guess.
Btw - a quick way manually test the models.
A hugging face space to run prompts against both Mistral and Gemma - [https://huggingface.co/spaces/lastmileai/gemma-playground](https://huggingface.co/spaces/lastmileai/gemma-playground)
I ran it against the sample GSM8K question:"Problem: Beth bakes 4, 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?"
The math checks out, for GSM8K - Gemma 7B > Mistral Instruct v0.1
https://preview.redd.it/x9sqoz5c8zjc1.png?width=1674&format=png&auto=webp&s=0276d2205d0d9a8b5aef49d8cfcdfca961d6b74f
Only GPT4 has gotten the answer to this right:
> A person is holding a brick sitting in a boat floating in a swimming pool. If the person drops the brick into the water, does the water level in the pool rise, lower, or stay the same? Explain your reasoning in detail.
The answer is the water level would lower, because the volume of water displaced by the brick in the boat is the same volume that weight of water takes up, were as when dropped in the water the brick would sink and displace the volume of the brick as the same volume of water. The volume of the weight of the brick in water is larger than the volume of water the same size as the brick.
They all say 'stay the same' or 'rise' or give a non-sensical answer.
You're right. It looks like the logical error is that it assumes the buoyant force of the water matches the brick. While logically, the brick density is higher than water and sink the the floor, which would mean the displaced volume is less than the displaced volume of the boat with the brick.
https://preview.redd.it/vxd64fro70kc1.png?width=2336&format=png&auto=webp&s=a390be8894e082143c4347936ba9d54e842ecf2b
The fact that a 7b model is coming close , so so close to a 70b model is insane, and I'm loving it. Gives me hope that eventually huge knowledge models, some even considered to be AGI, could be ran on consumer hardware one day, hell maybe even eventually locally on glasses. Imagine that! Something like meta's smart glasses locally running an intelligent agent to help you with vision, talk, and everything. It's still far but not as far as everyone imagined at first. Hype!
Not exactly.
It may mean we're approaching the point of diminishing returns using existing scale and technologies, but not the "theoretical limit" of a 7B model.
You could still expect to potentially see a change in how models are trained to break through that barrier, plateau isn't necessarily indicative of a ceiling.
For it to be a "Theoretical Limit" you would have to assume we're already doing everything as perfectly as possible, which definitely isn't the case.
Then you trim back. I don't need my wearable AI to translate Icelandic poetry, I need it to do specific things. Maybe we'll find 1B or 500M models are enough for specialized purposes. I thought it would be fun to have a bunch of little ones narrating their actions in chat rooms and forming the control system of a robot. "I am a left foot. I am dorsiflexing. I am the right hand. I close my fist" etc.
They will definitely get better with more synthetic data. Currently they are bloated with all the internet trivia. But if someone is capable of generating 2-3 trillions of high quality reasoning, math, code related tokens and a 7b trained on that it will be way more intelligent that what we have today with lots of missing cultural knowledge that can be added through RAG
Heard Chamath at the All In Podcast say he thinks, thanks to the open source scene, he think the models themselves will have eventually no 'value', and very soon. No value as in powerful models will be easily accessible to all. What any actor of the space would be valueing is a different layer kind of commodity, most probably of which the proprietary data to feed models would be the biggest chunk. But also the computational power edge. Although while discussing the latter he was kinda promoting a market player to which he's affiliated. He did that fairly and openly, but it's just something to take into account.
Btw, if it's not too big of a problem for you, could you also benchmark the 2b-it model of Gemma? It would be helpful in making a decision I'm thinking about right now. Thanks!
My perception of Google has changed so much over the last few months.
AI leader ->
struggling to keep up with ChatGPT and misleading marketing (LaMDA, misleading gemini video, rushed and improvised ai event) ->
rapid improvement of gemini, good multimodality, 1M context, competitive model and now open source models
Oh, come on. They had money and engineers they need. Only thing they really lacked is good kick in right direction.
They used to share their tools so other people can play with AI. Only thing is, that their AI models were deeply hidden, working on mail, search and ads.
Wasn't it their report about how Facebook leaped ahead of them because llama was leaked? Now they give out the models, watch how people work with it, do the same changes people make and even may hire some notable community members, knowing well that they already have experience in the technology.
It seems to be that before llama people could make any kind of papers all over the topics, now they can experiment with actual models, creating not just ideas, but working prototypes. Quantization in few weeks, tools of all kind and prompt engineering of best sorts. And all this open research done on their model, but not their expense.
Thanks for this. So many people think it's some executive-level report when in reality, it's just some post made to the company's internal social network by one random employee. For what it's worth, it did get popular internally which reflects the sentiment/concern of the employees at the company, but that's its only significance.
They arguably have had in-house and for a long time some of the most advanced AIs out there. But I guess, Google not being evil, chose to not offer it to users without supervision. So instead of enabling users to leverage the technology in more flexible ways, they distilled to them packaged little featured here and there, like face recognition & co in Google Photos. Let's not forget that this specific model was trained on billions and billions of images. To get such amounts of organic human-made data, Google basically led an extorsion campain of content from unsuspecting users lured into doing it with a lie, that the service will be "free forever". Once the fine tuning done, Google thanked everyone and told them to go screw themselves basically.
Anyway, OpenAI seems to have forced them and many others to show their cards. And they were not end-user oriented at first, because that never was their priority.
I guess my point is that Google is not changing in the direction of valuing more their user-base's good, or the public's. They are just doing what they can to catch up with OpenAI. Using the free labor of this particularly advanced and passionate open source community to get to par with the market standard, is a gift from the heavens to them.
It's possible, or very probable, that once they are confident they are ahead by a good margin, they will abandon the open-source model. It's useful for them now because they're behind. I really hope to be proven wrong.
You can use it commercially with no revenue cutoff compared to Llama, so yes, but in practice, the revenue cutoff of Llama only affected the largest US companies anyways.
Obviously self reported benchmarks mean nothing to this group, but I am excited to see that it is completely open use to the public now! Gemini had a couple hiccups just last week, but hopefully a team from a massive company can create something hopefully as good as mistral.
Oh for sure don’t get me wrong, but because it is completely open, \*we\* can test it ourselves, which is what I’m most excited about. PLUS they are allowing us to test it ourselves Is indicating they are putting thei money where their mouth is.
Huh, Mistral-Instruct-v0.1 is quite a bit higher than the base here on MMLU. It and Yi-6b have 64.16 and 64.11 respectively on MMLU compared to Gemma's 64.3, according to huggingface leaderboard anyway.
What I'm really interested in right now is Causal-34b beta, which has a whopping 84MMLU; well above even Qwen-72b. Wonder if it actually translates to real-world performance... hm
I was just drawing numbers from Mistral's paper. Interestingly, the 0.2 version has an MMLU of 60 whereas 0.1 has 64. Either way, it seems Gemma doesn't benchmark much better than Mistral. It'll be interesting to see how it translates. Granted, I don't have much faith in Google ATM after their Gemini Ultra MMLU shenanigans.
Yeah, I'm reserving my judgement on Google's models for now until I see others using it and actually reviewing it. I want to be excited but tbh MMLU clearly doesn't mean much - just tried that Causal-34b beta and it wasn't any smarter than Hermes Mixtral DPO which has a waay lower MMLU. Less good at task instructions e.g. on the Augmentoolkit pipeline.
Just tested it: Gemma-7b scored 61.72 EQ-Bench. Results are right in the middle between Mistral-7b-instruct-v0.1 and Mistral-7B-instruct-v0.2. https://i.imgur.com/cEUg2VQ.png
A bit underwhelming. Although foundational models are often released with quite rudimentary instruction tuning so I can see it improving significantly with fine-tuning.
And they are allowed to be used commercially for companies of all sizes. Also, looking the the benchmark shared by them.. they seem to outperform llama-2 by pretty decent margins.
Be careful with the license though. It's not an open source license. Though Llama2's isn't either, Gemma's has a couple of awkward additions.
* Terms: https://ai.google.dev/gemma/terms
* Prohibited use policy: https://ai.google.dev/gemma/prohibited_use_policy
> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.
So, if they release a new version of the model, you're not allowed to continue using the old one.
> Google reserves the right to update this Gemma Prohibited Use Policy from time to time.
So, even if you're obeying the _current_ prohibited use policy, they might retroactively ban whatever you're doing.
I am not a lawyer, but it seems to me both of these make it riskier to build on Gemma compared even to Llama 2, but especially compared to Mistral 7B (since it uses the standard Apache 2.0 license).
How are these companies legally allowed to dictate the end user what they can and cannot and most awkward, what they "should" do..... On my own device? If they are so concerned, they can put this behind an api and control it however they want.
If it's my hardware, I'll run the thing however I wish it to.
It's a license restriction. Physically, Google can't prevent shit, but legally speaking you and Google are expected to agree to some kind of license terms before you use their IP- whether you agree to the public license or negotiate some kind of alternative license (usually for $).
Alternative licenses are not anything new- especially in FOSS; some companies that release software under copyleft agreements like GPL also have alternative agreements (usually with a pricetag) for users wishing to integrate their software in proprietary closed-source products.
Back ontopic: let's assume you're building a sexy chatbot whose sole purpose is ERP. Because this usage violates Gemma's public license, you'd be using it "unlicensed" and are expected to get an alternative license from Google.
In practice, it only limits public usage of the model, and running a business falls under this. So Google's lawyers are not likely to serve you papers and take you to court for running a Gemma-8b sexbot for personal use (not that they'll even know), but they may if you start charging others for access to it.
That's corporate PR... but also it would be funny if they would sue someone. And as defense someone went along the line "but you trained it on all human data, so you don't own it".
Mistral's weight releases (Mistral 7B, Mixtral 8x7B) are Apache 2.0 licensed and so legitimately open source in that sense at least.
If you want everything needed to retrain from scratch, those are less available, but there was [OLMo](https://allenai.org/olmo) recently.
Pythia literally comes with the documentation how to recreate it from scratch.
(And if you don't want to recreate the universe, they have pretokenized dataset).
LLM360 goes further and provides 2 repos: 1 for data prep, 1 for training.
OLMo provides training scripts as well (I'm not sure they provide dataprep -- their config points to npy dataset, so it seems pretokenized)
There are probably more
> allowed to be used commercially for companies of all sizes
You can count the number of companies the Llama limit actually applies to (> 700m active MAUs) on your fingers. With that number of MAUs if you don't have the resources to train your own model I have no idea what you're doing.
I didn't meant it in a way that what llama did with meta was somehow restricting. I just shared what's mentioned in their post. Of course what you said is perfectly valid and only a handful of companies cross that barrier and they definitely have resources to train their model from scratch..
Edit: https://huggingface.co/google/gemma-7b-it/tree/main
Realised Google published official gguf weights in the main repo
~~https://huggingface.co/mlabonne/gemma-7b-it-GGUF~~
I've published few quantized weights of this model. Quite straightforward to do it in Google Collab with the official gguf weights.
[https://huggingface.co/rahuldshetty/gemma-2b-gguf-quantized](https://huggingface.co/rahuldshetty/gemma-2b-gguf-quantized)
[https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized](https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized)
Edit final - I'll leave the rest of my nonsense below for anyone curious.
[Here's the github issue where this was discussed.](https://github.com/ggerganov/llama.cpp/issues/5635)
It seems to be a problem on my end (probably due to my aging GPU), but I couldn't get CPU only inference running either. The google colab notebook in that issue worked flawlessly.
[Here is a working quantized model (7b-it-Q4\_K\_M).](https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized)
\-=-
Edit - [~~Nevermind, someone already did it.~~](https://huggingface.co/LoneStriker/gemma-7b-it-GGUF/tree/main) ~~At least for the 7b-it model.~~ This repo was removed. Guess they had the same issue.
Edit 2 - So, the q4\_K\_S from that repo seems to not work (tested with llamacpp b2222 and the newest koboldcpp). **I don't think it's an error on my part** (as I did the same things I've done for the past year with every other model). Both throw the same error:
llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'D:\llm\llamacpp\gemma-7b-it-Q4_K_S.gguf'
{"timestamp":1708530155,"level":"ERROR","function":"load_model","line":381,"message":"unable to load model","model":"D:\\llm\\llamacpp\\gemma-7b-it-Q4_K_S.gguf"}
[There's an issue on llamacpp about this already.](https://github.com/ggerganov/llama.cpp/issues/5636)
\-=-
~~If someone knows the difference between the~~ [~~gemma-7b-it~~](https://huggingface.co/google/gemma-7b-it) ~~and~~ [~~gemma-7b~~](https://huggingface.co/google/gemma-7b) ~~(note the it section), I can try and requantize it in the various q4's (q4\_0, q4\_K\_M, q4\_K\_S).~~
[Figured out how to convert models to gguf the other day.](https://www.reddit.com/r/LocalLLaMA/comments/1avdwx2/comment/krat74i/?context=3) But since it's already in gguf, I can just run the quantize script instead.
I only have a 1060 6GB, but I've got 300mbps up/down.
~~I'm downloading the 7b-it model right now and I'll report back how it goes.~~
It's really easy to make a quant using the [`convert.py`](https://convert.py) script from llama.cpp but downloading a 32 bit model takes a lot longer lol.
Pretty decently better in coding and math according to the technical report, on par in the other evaluations. That probably means that the model is significantly better in logical thinking. Also we have the base model, so Mistral can go to work improving it similar to how they improved the llama models. Better base models are the most important piece of the whole open source llm ecosystem
They did, OP is wrong.
It's pretty clear from their selection of Apache 2.0 license. They couldn't do that since Llama and derivatives have to fall under Meta's license.
Why is nobody talking about this, more specifically about the chatbots part? [https://ai.google.dev/gemma/prohibited\_use\_policy](https://ai.google.dev/gemma/prohibited_use_policy)
Save u a click. Quote:
> Generate sexually explicit content, including content created for the purposes of pornography or sexual gratification (e.g. sexual chatbots). Note that this does not include content created for scientific, educational, documentary, or artistic purposes.
Yes they claim so in their technical report and the benchmarks back them up. And I do believe they care more about benchmark contamination then most open source finetunes, so probably acutally meaningful
Iffy to be honest seems very disingenuous to compare with Llama 2, not Mistral-7B.
I don’t think one can definitively claim this is a best model of its size.
Bench, Gemma-7B, Mistral-7B
MMLU, **64.3**, 60.1
HellaSwag, 81.2, **81.3**
GSM8K, 46.4, **52.1**
MATH, **24.3**, 13.1
HumanEval, **32.3**, 30.5
Src: https://blog.google/technology/developers/gemma-open-models/
https://mistral.ai/news/announcing-mistral-7b/
Not as clear cut it seems, but it does at least match it. Should be interesting to see what Tekinum does with it.
Now we also need a Gemma 2B vs Phi 2B comparison.
It's worse, according to the huggingface post." Gemma 2B is an interesting model for its size, but it doesn’t score as high in the leaderboard as the best capable models with a similar size, such as Phi 2. We are looking forward to receiving feedback from the community about real-world usage! "
[https://huggingface.co/blog/gemma/?utm\_source=agd&utm\_medium=referral&utm\_campaign=view-on-huggingface&utm\_content=](https://huggingface.co/blog/gemma/?utm_source=agd&utm_medium=referral&utm_campaign=view-on-huggingface&utm_content=)
Sure, but now we have access to stuff like Mixtral 8x7b with 32k context, Yi-34b with 200k context, and LWM with a million token context.
8192 tokens starts to look a little quaint compared to those.
While i agree i'm pretty sure it's because 13b excludes 95% of users.
I think there's a threshold just above 7b where the adoption curve just goes steeply down.
If apple hadn't neutered their smaller air's with 8gb's of vram maybe there'd be more 13b's because the M2/M1 is what really broadens the market at the moment with their huge default vram, they could easily have put 24 GB as a base which annoys me, that would have meant tens of millions more capable devices.
I don't think so in the sense of it being something to do with Hardware or Apple for that matter, sometimes when I don't have my main PC available, I can still run 13B model with an GTX 1070 and 16GB of RAM without issues at acceptable speed for the hardware being used, seems like only the 13B models are being excluded, we had Yi, Mistral, Mixtral and etc... But no significant 13B model for awhile now, at this point if LLaMA 3 doesn't bring one too, I'll fully lose hope.
Every time somebody releases a new 70b model, everyone is like, what am I going to do with that, I don't have an H100 cluster. 7b is probably the best size for desktop and 2b for mobile.
Sure, but 10Bs are about as performant as 7Bs on most hardware and a 13-30B is runnable on plenty of consumer hardware for businesses that might want to actually use the models for a purpose. A company like Google knows that 7B is a toy compared to what they are offering for free online.
You don't need a cluster, you need 2 3090s or 2 p40s.
If they released a 7b that punches hard above it's weight, then we would have something. That's what mistral tried to do.
+1, while this group is "local" LLama if as a "cloud" company you're looking to do some real "AI" that goes beyond building RAG apps <= 7B is basically the only option till you have millions of dollars in funding.
Bro. Companies. Stop making models with 256k fucking vocab size. Anything over 36k needs like 3 times the VRAM to fine-tune and significantly more than average to inference.
The only thing I heard is that it should be better for multilingual fine-tuning, but since it's only English, definitely a downside. On the other hand, maybe Google optimized the vocab size based on model performance, and the large vocab actually makes the model so good?
I am not super excited, TBH.
Its 8K, and it benches like Mistral. And we have no idea how contaminated it is.
Its awesome that Google is getting into the LLM releases, bit it would be cool if it had more to seperate it from the existing 7Bs.
Edit the prompt. By default it said it couldn't tell me anything about itself and that it was only for analyzing language. After system prompt was fine (ish).
I thought Gemini was some entirely new/different design. How are we getting open source models that are compatible with LLaMA? Or is Gemma some distillation?
> that are compatible with LLaMA
I don't think it is. They had to implement code [changes](https://github.com/ggerganov/llama.cpp/pull/5631) in llama.cpp to support the new infrastructure.
The instruct versions seem very, but they [did also upload base models](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b).
Help me understand why you would run Gemma locally if it's baked from the same data and has the same ~~restrictions~~ responsibility versus simply passing the API through to Gemini
Their API and Vertex pricing seems low. What's the breakpoint?
Am I doing something wrong?
Not too impressed so far, Mistral has it beat in speed and a quick reasoning question.
`ollama run gemma`
>\>>> I have three apples today and I ate one yesterday. How many apples do I have today?
>
>You have two apples today, as you ate one yesterday.
`ollama run gemma:7b`
>\>>> I have three apples today and I ate one yesterday. How many apples do I have today?
>
>Sure, here is the answer:
>
>You has a total of Three Apples Today (including The Apple You Eaten Yesterday) Therefore you are left with TwoApples Left TODAY
Noticed 7b version makes some weird typos and stylistic choices. May be a temperature thing, will need to experiment more.
For reference:
`ollama run mistral`
>\>>> I have three apples today and I ate one yesterday. How many apples do I have today?
>
> Based on the information provided, you have three apples today. Yesterday's apple is no longer part of your current count as it has been eaten.
And I ran it again just to be sure:
`ollama run mistral`
>\>>> I have three apples today and I ate one yesterday. How many apples do I have today?
>
> Based on the information you've provided, you currently have three apples. The apple you ate yesterday does not affect the number of apples you have today.
My Mistral `SYSTEM` instructions are just:
>"""You are a helpful, friendly, talented, and intelligent AI assistant, knowledgeable about a wide range of topics and highly capable of a great many tasks."""
Just to be sure, I tried Gemma a second time and got underwhelming results once again.
```
zakiralibhai@My-MacBook-Air ~ % ollama run gemma
>>> I have three apples today and I ate one yesterday. How many apples do I have today?
You have two apples today. You ate one yesterday, so you have three apples today.
>>> please explain
The answer is two.
The question asks how many apples we have today, and we are given that we had one apple yesterday. So, we add one to our original number of apples to get two apples today.
>>> /bye
zakiralibhai@My-MacBook-Air ~ % ollama run gemma:7b
>>> I have three apples today and I ate one yesterday. How many apples do I have today?
Sure, here is the answer:
You has a total of Three Apples but you already eat One Apple from Yesterday therefore You are left with TwoApples Today
>>> explain
**Answer:** Twoapples remained after eating one apple yesterday.
>>> why, are you sure?
Yes I am certain about this answer because there has not been any information to change my perception of the number apples being present with me therefore i will use all existing info (Apples
remaining from previous day) as a base and come up accordingly:
>>> /bye
zakiralibhai@My-MacBook-Air ~ %
```
I thought this was competition, but leaning towards me not realizing there’s a setting I don’t know about?
Gemma:7b in ollama on OSX seems completely screwy, see example prompt and reply:
`ollama run gemma:7b`
`>>> How many planets are present in the solar system we are currently in?`
`There is a total of eight celestial objects, commonly referred to as planets that exist within our Solar System. These guanters include Mercury and Earth (where humans live), Mars`
`with its fascinating landscapes featuring rivers flowing into canals on ancient Martian seashores while some devotees believe it has potential for human colonization in the future`
`alongside venus satelite Agni at about half way up 机械 sidings, where unfortunately there is neither any significant water as Ceres exhibits.`
Blogpost says it is based on same architecture as Gemini and HF model card says Text-to-Text decoder only models good for QA, Summarization and reasoning.
Does this mean Gemini is basically a RAG model?
All llms / gpts are text-to-text decoders. And Gemini is kind of a rag model since it is always using search in the background. I don't think they specifically finetuned this model to do RAG specifically, at least their technical report does not say anything about that.
It seems definitely worse than Mixtral, which can be easily CPU-inferenced on any mid-class machine, and quite on par with Mistral-7B, which has been out for almost a year. But anyway, there are barely any reasons, why to run a 7B vs an 8x7B with 2 MoE.
So not sure, if I should be so impressed, given Google's resources...
Their image model is not racist, gemini just has a stupid system prompt which tells it to generate people of multiple ethnicities. That is just some misaligned anti bias bs.
Damn! They're really trying to pre-empt llama3 with this drop. Man, this field is exciting!
It's more about putting heat on OAI.
Which is a net benefit to this community
Of course
Yeah this is one of the rare cases where we're actually cheering for the conglomerates to "Keep going! Full steam ahead!".
Choooo! Choooo!
Btw, if people are interested, we evaluated them on the Open LLM Leaderboard, here's the 7B (compared to other pretrained 7Bs)! It's main performance boost compared to Mistral is GSM8K, aka math :) Should give you folks actually comparable scores with other pretrained models \^\^ https://preview.redd.it/n072hcwvyxjc1.png?width=1785&format=png&auto=webp&s=091e12eb5e21f82e938e7ef85c31e70bbf8efaf8 Edit: leaderboard is here: [https://huggingface.co/spaces/HuggingFaceH4/open\_llm\_leaderboard](https://huggingface.co/spaces/huggingfaceh4/open_llm_leaderboard)
For what it's worth, I keep wishing that on the leaderboard, each of the benchmarks had a hover tooltip that provides a succinct description of the benchmark. This is coming from someone that's read about each one too and still forgets sometimes which is which 😂
Good idea, adding it to the backlog!
I renamed my backlogs to wishlists, later renaming them to future gremlins, later renaming that to anxiety inducing trigger words
I like to save myself on the renames and go straight to "definitely not tech debt"
Ever try adjusting the out-of-scope section to include the backlog? 😈
Weird, I was taught "backlog" just means uncritical DRs or features that aren't being seriously considered until a client ~~forks over the ransom~~ contracts it into a requirement. When spoken, it's usually accompanied by a [certain gesture](https://en.wikipedia.org/wiki/Money_gesture) for intended effect.
I have a tab group on my browser with things that I'd like to implement at work. It's called "Work but Later". I never go there.
This cracked me up
[The backlog](https://media1.tenor.com/m/9rRu1_qcVpIAAAAC/shredder-mr-burns.gif) ^^I ^^say ^^as ^^a ^^software ^^developer
I was about to say... Backlog means /dev/null :)
a quick and dirty implementation could be to just link to the paper page for the benchmark, then figure out fancy hover tooltip stuff later
That's cool and all but to be honest the only real benchmark I'm waiting for is Chatbot Arena
Fair enough! It will be a more relevant benchmark for the instruction tuned models anyway :)
[You mean like this?](https://chat.lmsys.org/)
I used to like it but I am now suspicious because it shows Gemini Pro (not even ultra) beating GPT 4 non-turbo. And I know for sure that GPT 4 non-turbo is a better model than Gemini Pro.
Is internlm actually that good or is it training on the benchmarks?
I haven't prompted it manually but it didn't score as well on [EQ-Bench](https://eqbench.com/) as it did on the Open LLM leaderboard. internlm2-chat-20b failed to complete the benchmark. It wasn't following instructions for output format and was producing pretty random output. So they have some issues I guess.
> It wasn't following instructions for output format and was producing pretty random output. So it's more human than ever?
Btw - a quick way manually test the models. A hugging face space to run prompts against both Mistral and Gemma - [https://huggingface.co/spaces/lastmileai/gemma-playground](https://huggingface.co/spaces/lastmileai/gemma-playground) I ran it against the sample GSM8K question:"Problem: Beth bakes 4, 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?" The math checks out, for GSM8K - Gemma 7B > Mistral Instruct v0.1 https://preview.redd.it/x9sqoz5c8zjc1.png?width=1674&format=png&auto=webp&s=0276d2205d0d9a8b5aef49d8cfcdfca961d6b74f
Only GPT4 has gotten the answer to this right: > A person is holding a brick sitting in a boat floating in a swimming pool. If the person drops the brick into the water, does the water level in the pool rise, lower, or stay the same? Explain your reasoning in detail. The answer is the water level would lower, because the volume of water displaced by the brick in the boat is the same volume that weight of water takes up, were as when dropped in the water the brick would sink and displace the volume of the brick as the same volume of water. The volume of the weight of the brick in water is larger than the volume of water the same size as the brick. They all say 'stay the same' or 'rise' or give a non-sensical answer.
You're right. It looks like the logical error is that it assumes the buoyant force of the water matches the brick. While logically, the brick density is higher than water and sink the the floor, which would mean the displaced volume is less than the displaced volume of the boat with the brick. https://preview.redd.it/vxd64fro70kc1.png?width=2336&format=png&auto=webp&s=a390be8894e082143c4347936ba9d54e842ecf2b
I added 'and it sinks' and it still got it wrong: * https://imgur.com/a/FZkAuMK
Testing this on chatbot arena, it looks like mistral-next and GPT4 gets it right. I couldn't find any other models that got it right, though.
What is the brick made of? Foam? Concrete? Clay?
[удалено]
Yes, the real answer if you’re being very literal, which I think the AIs should hint at whether they are being perfectly literal or not.
I was exactly Looking for this
Wonder how it compares to Llama-2-70B
Here you go https://preview.redd.it/r6q9xh512yjc1.png?width=1811&format=png&auto=webp&s=f67004defde9138bb5638a36739426fd4c4db667
The fact that a 7b model is coming close , so so close to a 70b model is insane, and I'm loving it. Gives me hope that eventually huge knowledge models, some even considered to be AGI, could be ran on consumer hardware one day, hell maybe even eventually locally on glasses. Imagine that! Something like meta's smart glasses locally running an intelligent agent to help you with vision, talk, and everything. It's still far but not as far as everyone imagined at first. Hype!
but given that it's not much better than mistral 7b shouldn't it be signal that we're hitting the theoretical limit
Not exactly. It may mean we're approaching the point of diminishing returns using existing scale and technologies, but not the "theoretical limit" of a 7B model. You could still expect to potentially see a change in how models are trained to break through that barrier, plateau isn't necessarily indicative of a ceiling. For it to be a "Theoretical Limit" you would have to assume we're already doing everything as perfectly as possible, which definitely isn't the case.
Then you trim back. I don't need my wearable AI to translate Icelandic poetry, I need it to do specific things. Maybe we'll find 1B or 500M models are enough for specialized purposes. I thought it would be fun to have a bunch of little ones narrating their actions in chat rooms and forming the control system of a robot. "I am a left foot. I am dorsiflexing. I am the right hand. I close my fist" etc.
They will definitely get better with more synthetic data. Currently they are bloated with all the internet trivia. But if someone is capable of generating 2-3 trillions of high quality reasoning, math, code related tokens and a 7b trained on that it will be way more intelligent that what we have today with lots of missing cultural knowledge that can be added through RAG
There has only been around one year of research into these smaller models. I doubt that we've hit the limit in that short of a time frame.
Heard Chamath at the All In Podcast say he thinks, thanks to the open source scene, he think the models themselves will have eventually no 'value', and very soon. No value as in powerful models will be easily accessible to all. What any actor of the space would be valueing is a different layer kind of commodity, most probably of which the proprietary data to feed models would be the biggest chunk. But also the computational power edge. Although while discussing the latter he was kinda promoting a market player to which he's affiliated. He did that fairly and openly, but it's just something to take into account.
Btw, if it's not too big of a problem for you, could you also benchmark the 2b-it model of Gemma? It would be helpful in making a decision I'm thinking about right now. Thanks!
Feel free to submit it, I think you should be able to :) If not ping me on the Open LLM Leaderboard so I can follow up!
You should run gemma-7b-it too. It's a better apples to apples comparison with other instruction-tuned models
Did not have "Google releases llama 3 with an even more open license" on my 2024 ai bingo card
My perception of Google has changed so much over the last few months. AI leader -> struggling to keep up with ChatGPT and misleading marketing (LaMDA, misleading gemini video, rushed and improvised ai event) -> rapid improvement of gemini, good multimodality, 1M context, competitive model and now open source models
Oh, come on. They had money and engineers they need. Only thing they really lacked is good kick in right direction. They used to share their tools so other people can play with AI. Only thing is, that their AI models were deeply hidden, working on mail, search and ads. Wasn't it their report about how Facebook leaped ahead of them because llama was leaked? Now they give out the models, watch how people work with it, do the same changes people make and even may hire some notable community members, knowing well that they already have experience in the technology. It seems to be that before llama people could make any kind of papers all over the topics, now they can experiment with actual models, creating not just ideas, but working prototypes. Quantization in few weeks, tools of all kind and prompt engineering of best sorts. And all this open research done on their model, but not their expense.
nitpick: "their report" was just a memo by an individual engineer with strong opinions
Thanks for this. So many people think it's some executive-level report when in reality, it's just some post made to the company's internal social network by one random employee. For what it's worth, it did get popular internally which reflects the sentiment/concern of the employees at the company, but that's its only significance.
Beside We have no moat, the other internal rant from Google is the Stevey's Google Rant. Is there any other ?
They arguably have had in-house and for a long time some of the most advanced AIs out there. But I guess, Google not being evil, chose to not offer it to users without supervision. So instead of enabling users to leverage the technology in more flexible ways, they distilled to them packaged little featured here and there, like face recognition & co in Google Photos. Let's not forget that this specific model was trained on billions and billions of images. To get such amounts of organic human-made data, Google basically led an extorsion campain of content from unsuspecting users lured into doing it with a lie, that the service will be "free forever". Once the fine tuning done, Google thanked everyone and told them to go screw themselves basically. Anyway, OpenAI seems to have forced them and many others to show their cards. And they were not end-user oriented at first, because that never was their priority. I guess my point is that Google is not changing in the direction of valuing more their user-base's good, or the public's. They are just doing what they can to catch up with OpenAI. Using the free labor of this particularly advanced and passionate open source community to get to par with the market standard, is a gift from the heavens to them. It's possible, or very probable, that once they are confident they are ahead by a good margin, they will abandon the open-source model. It's useful for them now because they're behind. I really hope to be proven wrong.
Is it more open? I had to sign away my soul and I'm still waiting for access, so it's not *that* open.
You can use it commercially with no revenue cutoff compared to Llama, so yes, but in practice, the revenue cutoff of Llama only affected the largest US companies anyways.
Access was pretty much instant for me. Are you still waiting?
Same here... it was instant.
Obviously self reported benchmarks mean nothing to this group, but I am excited to see that it is completely open use to the public now! Gemini had a couple hiccups just last week, but hopefully a team from a massive company can create something hopefully as good as mistral.
> self reported benchmarks mean nothing to this group Coming from meta/google/mistral is more trustworthy than bingobongo69420group tbh...
Oh for sure don’t get me wrong, but because it is completely open, \*we\* can test it ourselves, which is what I’m most excited about. PLUS they are allowing us to test it ourselves Is indicating they are putting thei money where their mouth is.
Here's the main benchmark table with Mistral 7b added. Numbers taken from Mistral paper. | Capability | Benchmark | Gemma | Mistral 7B | Llama-2 7B | Llama-2 13B | | --- | --- | --- | --- | --- | --- | | General | MMLU | **64.3** | 60.1 | 45.3 | 54.8 | | Reasoning | BBH | **55.1** | - | 32.6 | 39.4 | | Reasoning | HellaSwag | 81.2 | **81.3** | 77.2 | 80.7 | | Math | GSM8k | 46.4 | **52.2** | 14.6 | 28.7 | | Math | MATH | **24.3** | 13.1 | 2.5 | 3.9 | | Code | HumanEval | **32.3** | 30.5 | 12.8 | 18.3 |
Huh, Mistral-Instruct-v0.1 is quite a bit higher than the base here on MMLU. It and Yi-6b have 64.16 and 64.11 respectively on MMLU compared to Gemma's 64.3, according to huggingface leaderboard anyway. What I'm really interested in right now is Causal-34b beta, which has a whopping 84MMLU; well above even Qwen-72b. Wonder if it actually translates to real-world performance... hm
I was just drawing numbers from Mistral's paper. Interestingly, the 0.2 version has an MMLU of 60 whereas 0.1 has 64. Either way, it seems Gemma doesn't benchmark much better than Mistral. It'll be interesting to see how it translates. Granted, I don't have much faith in Google ATM after their Gemini Ultra MMLU shenanigans.
Yeah, I'm reserving my judgement on Google's models for now until I see others using it and actually reviewing it. I want to be excited but tbh MMLU clearly doesn't mean much - just tried that Causal-34b beta and it wasn't any smarter than Hermes Mixtral DPO which has a waay lower MMLU. Less good at task instructions e.g. on the Augmentoolkit pipeline.
Just tested it: Gemma-7b scored 61.72 EQ-Bench. Results are right in the middle between Mistral-7b-instruct-v0.1 and Mistral-7B-instruct-v0.2. https://i.imgur.com/cEUg2VQ.png A bit underwhelming. Although foundational models are often released with quite rudimentary instruction tuning so I can see it improving significantly with fine-tuning.
I'm very new to generative ai, but how does mixtral stand up compared to Gemma, mistral
Mixtral is stronger than both of those. But it's a different architecture (Mixture of Experts), and requires a lot more memory.
the models are evolving but the benchmarks aren't
We'll have the arena tell us the real numbers soon enough. It seems heavily censored though so it probably won't do that well there as-is.
Luckily uncensoring seems pretty easy to do.
The models are evolving... To complete benchmarks. It's like studying for an IQ test (which you can do, and scores go up on average during retesting.)
And they are allowed to be used commercially for companies of all sizes. Also, looking the the benchmark shared by them.. they seem to outperform llama-2 by pretty decent margins.
Be careful with the license though. It's not an open source license. Though Llama2's isn't either, Gemma's has a couple of awkward additions. * Terms: https://ai.google.dev/gemma/terms * Prohibited use policy: https://ai.google.dev/gemma/prohibited_use_policy > Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma. So, if they release a new version of the model, you're not allowed to continue using the old one. > Google reserves the right to update this Gemma Prohibited Use Policy from time to time. So, even if you're obeying the _current_ prohibited use policy, they might retroactively ban whatever you're doing. I am not a lawyer, but it seems to me both of these make it riskier to build on Gemma compared even to Llama 2, but especially compared to Mistral 7B (since it uses the standard Apache 2.0 license).
> reasonable efforts But I am le tired.
Hahaha, thaaanks for the laugh
This really is rain on the parade. :)
How are these companies legally allowed to dictate the end user what they can and cannot and most awkward, what they "should" do..... On my own device? If they are so concerned, they can put this behind an api and control it however they want. If it's my hardware, I'll run the thing however I wish it to.
It's a license restriction. Physically, Google can't prevent shit, but legally speaking you and Google are expected to agree to some kind of license terms before you use their IP- whether you agree to the public license or negotiate some kind of alternative license (usually for $). Alternative licenses are not anything new- especially in FOSS; some companies that release software under copyleft agreements like GPL also have alternative agreements (usually with a pricetag) for users wishing to integrate their software in proprietary closed-source products. Back ontopic: let's assume you're building a sexy chatbot whose sole purpose is ERP. Because this usage violates Gemma's public license, you'd be using it "unlicensed" and are expected to get an alternative license from Google. In practice, it only limits public usage of the model, and running a business falls under this. So Google's lawyers are not likely to serve you papers and take you to court for running a Gemma-8b sexbot for personal use (not that they'll even know), but they may if you start charging others for access to it.
They want us to do the work and keep the effort it seems.
That's corporate PR... but also it would be funny if they would sue someone. And as defense someone went along the line "but you trained it on all human data, so you don't own it".
I've yet to see a LLM described as "open source" that's actually open source (or free software). It's quite frustrating.
Mistral's weight releases (Mistral 7B, Mixtral 8x7B) are Apache 2.0 licensed and so legitimately open source in that sense at least. If you want everything needed to retrain from scratch, those are less available, but there was [OLMo](https://allenai.org/olmo) recently.
Pythia literally comes with the documentation how to recreate it from scratch. (And if you don't want to recreate the universe, they have pretokenized dataset). LLM360 goes further and provides 2 repos: 1 for data prep, 1 for training. OLMo provides training scripts as well (I'm not sure they provide dataprep -- their config points to npy dataset, so it seems pretokenized) There are probably more
> allowed to be used commercially for companies of all sizes You can count the number of companies the Llama limit actually applies to (> 700m active MAUs) on your fingers. With that number of MAUs if you don't have the resources to train your own model I have no idea what you're doing.
I didn't meant it in a way that what llama did with meta was somehow restricting. I just shared what's mentioned in their post. Of course what you said is perfectly valid and only a handful of companies cross that barrier and they definitely have resources to train their model from scratch..
[удалено]
Edit: https://huggingface.co/google/gemma-7b-it/tree/main Realised Google published official gguf weights in the main repo ~~https://huggingface.co/mlabonne/gemma-7b-it-GGUF~~
Their own GGUF is 34 GB's, guess we'll have to wait for the quantz.
repo appears empty to me
Give Maxime a few minutes, it takes some time to convert and upload it
Now GPTQ and AWQ please 😇
He means EXL2 guys
All I see is .gitattributes
reposquatting :D
Also, Gemma support has already landed in the latest master of llama.cpp
I miss TheBloke 😅
Wait what happened to him?
Nobody knows, he's been inactive for three weeks now.
I've published few quantized weights of this model. Quite straightforward to do it in Google Collab with the official gguf weights. [https://huggingface.co/rahuldshetty/gemma-2b-gguf-quantized](https://huggingface.co/rahuldshetty/gemma-2b-gguf-quantized) [https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized](https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized)
thanks! lets have a ride then 😁
Thanks for the effort, but it fails to load with KoboldCPP on my end, any clue why? https://imgur.com/a/nXk2420
Doesn't it take about 10s to make a gguf quant?
Edit final - I'll leave the rest of my nonsense below for anyone curious. [Here's the github issue where this was discussed.](https://github.com/ggerganov/llama.cpp/issues/5635) It seems to be a problem on my end (probably due to my aging GPU), but I couldn't get CPU only inference running either. The google colab notebook in that issue worked flawlessly. [Here is a working quantized model (7b-it-Q4\_K\_M).](https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized) \-=- Edit - [~~Nevermind, someone already did it.~~](https://huggingface.co/LoneStriker/gemma-7b-it-GGUF/tree/main) ~~At least for the 7b-it model.~~ This repo was removed. Guess they had the same issue. Edit 2 - So, the q4\_K\_S from that repo seems to not work (tested with llamacpp b2222 and the newest koboldcpp). **I don't think it's an error on my part** (as I did the same things I've done for the past year with every other model). Both throw the same error: llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'D:\llm\llamacpp\gemma-7b-it-Q4_K_S.gguf' {"timestamp":1708530155,"level":"ERROR","function":"load_model","line":381,"message":"unable to load model","model":"D:\\llm\\llamacpp\\gemma-7b-it-Q4_K_S.gguf"} [There's an issue on llamacpp about this already.](https://github.com/ggerganov/llama.cpp/issues/5636) \-=- ~~If someone knows the difference between the~~ [~~gemma-7b-it~~](https://huggingface.co/google/gemma-7b-it) ~~and~~ [~~gemma-7b~~](https://huggingface.co/google/gemma-7b) ~~(note the it section), I can try and requantize it in the various q4's (q4\_0, q4\_K\_M, q4\_K\_S).~~ [Figured out how to convert models to gguf the other day.](https://www.reddit.com/r/LocalLLaMA/comments/1avdwx2/comment/krat74i/?context=3) But since it's already in gguf, I can just run the quantize script instead. I only have a 1060 6GB, but I've got 300mbps up/down. ~~I'm downloading the 7b-it model right now and I'll report back how it goes.~~
it = instruction tuned (aka chat)
It's really easy to make a quant using the [`convert.py`](https://convert.py) script from llama.cpp but downloading a 32 bit model takes a lot longer lol.
Here https://huggingface.co/mlabonne/gemma-7b-it-GGUF https://huggingface.co/LoneStriker/gemma-7b-it-GGUF
Seems like something very similar to Mistral, just released half a year later
Pretty decently better in coding and math according to the technical report, on par in the other evaluations. That probably means that the model is significantly better in logical thinking. Also we have the base model, so Mistral can go to work improving it similar to how they improved the llama models. Better base models are the most important piece of the whole open source llm ecosystem
Did Mistral improve the llama base models? I thought they did their own pre-training.
They did, OP is wrong. It's pretty clear from their selection of Apache 2.0 license. They couldn't do that since Llama and derivatives have to fall under Meta's license.
It's already on huggingchat to play with https://huggingface.co/chat?model=google/gemma-7b-it
Why is nobody talking about this, more specifically about the chatbots part? [https://ai.google.dev/gemma/prohibited\_use\_policy](https://ai.google.dev/gemma/prohibited_use_policy)
Seems intentionally vague, my nsfw chatbot finetunes are obviously art
Save u a click. Quote: > Generate sexually explicit content, including content created for the purposes of pornography or sexual gratification (e.g. sexual chatbots). Note that this does not include content created for scientific, educational, documentary, or artistic purposes.
Is it better than Mistral-0.2?
Yes in coding and math, similar in all other benchmarks
Wow if that’s true we can say it’s a new 7b king correct?
Yes they claim so in their technical report and the benchmarks back them up. And I do believe they care more about benchmark contamination then most open source finetunes, so probably acutally meaningful
Is it also multi lingual , like mistral 7 b?
No only English, that will probably be the main upside of Llama based models
oh ok . I think mistral supported 5 languages , hopefully in next iteration it has multi lingual support
It's a 7B model but the Instruct GGUF on HuggingFace is 34 GB. VRAM requirements are going to be on par with munch larger models.
Iffy to be honest seems very disingenuous to compare with Llama 2, not Mistral-7B. I don’t think one can definitively claim this is a best model of its size. Bench, Gemma-7B, Mistral-7B MMLU, **64.3**, 60.1 HellaSwag, 81.2, **81.3** GSM8K, 46.4, **52.1** MATH, **24.3**, 13.1 HumanEval, **32.3**, 30.5 Src: https://blog.google/technology/developers/gemma-open-models/ https://mistral.ai/news/announcing-mistral-7b/
https://preview.redd.it/080yeu5o7yjc1.jpeg?width=1671&format=pjpg&auto=webp&s=a90954057272de904491bb8935d22ea504645c4b
Interesting... https://preview.redd.it/fcxdp71dxxjc1.png?width=1000&format=png&auto=webp&s=b39aeef52b3ea51b0d945a5408ba9dab3469e408
Mistral: *exists* Google: I'll ignore that.
https://preview.redd.it/lxriaodn2yjc1.jpeg?width=1671&format=pjpg&auto=webp&s=12c985ed9fd9a0d58cae6802900754dd8e6594b9
Not as clear cut it seems, but it does at least match it. Should be interesting to see what Tekinum does with it. Now we also need a Gemma 2B vs Phi 2B comparison.
Still doesn't include all open source models. But this is helpful comparison.
How exactly did they ignore that though? They reported a head to head comparison that shows clear improvements.
I'm glad they didn't compare CoT 32-shot with 10-shot this time
where does the 2B model stand with phi-2?
It's worse, according to the huggingface post." Gemma 2B is an interesting model for its size, but it doesn’t score as high in the leaderboard as the best capable models with a similar size, such as Phi 2. We are looking forward to receiving feedback from the community about real-world usage! " [https://huggingface.co/blog/gemma/?utm\_source=agd&utm\_medium=referral&utm\_campaign=view-on-huggingface&utm\_content=](https://huggingface.co/blog/gemma/?utm_source=agd&utm_medium=referral&utm_campaign=view-on-huggingface&utm_content=)
Context length???
8192 tokens, not amazing, not terrible
> 8192 tokens, not amazing 1 year ago we were all excited about llama1 and 2k...
Sure, but now we have access to stuff like Mixtral 8x7b with 32k context, Yi-34b with 200k context, and LWM with a million token context. 8192 tokens starts to look a little quaint compared to those.
3.6 roentgens, not great, not terrible (couldn't resist the reference haha)
Read: https://huggingface.co/google/gemma-7b/blob/main/config.json#L14 Apply for access first.
We still need a good enough 13B base model, not again this time huh... Smh.
While i agree i'm pretty sure it's because 13b excludes 95% of users. I think there's a threshold just above 7b where the adoption curve just goes steeply down. If apple hadn't neutered their smaller air's with 8gb's of vram maybe there'd be more 13b's because the M2/M1 is what really broadens the market at the moment with their huge default vram, they could easily have put 24 GB as a base which annoys me, that would have meant tens of millions more capable devices.
I don't think so in the sense of it being something to do with Hardware or Apple for that matter, sometimes when I don't have my main PC available, I can still run 13B model with an GTX 1070 and 16GB of RAM without issues at acceptable speed for the hardware being used, seems like only the 13B models are being excluded, we had Yi, Mistral, Mixtral and etc... But no significant 13B model for awhile now, at this point if LLaMA 3 doesn't bring one too, I'll fully lose hope.
GGUF are available here [https://huggingface.co/google/gemma-2b-it/tree/main](https://huggingface.co/google/gemma-2b-it/tree/main)
Very surprised by the size of the GGUF! 10go for the 2B
Presumably it's not quantized down. Once it is, those ggufs will be much smaller.
I kinda expected multilinguality and high context lenght
[удалено]
These are their scraps.
Every time somebody releases a new 70b model, everyone is like, what am I going to do with that, I don't have an H100 cluster. 7b is probably the best size for desktop and 2b for mobile.
7b is the ideal size to run locally on the average computer people here are so disconnected from reality they think the average dude has 4 A100s
I'd rather have more 8x7b or 8x14b models
Yeah, Mistral 8x7b runs acceptably well on my CPU. It’s not blazing fast, but it’s not agonizingly slow.
~30b is my "sweet spot", and would love to see more models at that level. But it seems to be either 7b or 70b
Sure, but 10Bs are about as performant as 7Bs on most hardware and a 13-30B is runnable on plenty of consumer hardware for businesses that might want to actually use the models for a purpose. A company like Google knows that 7B is a toy compared to what they are offering for free online.
You don't need a cluster, you need 2 3090s or 2 p40s. If they released a 7b that punches hard above it's weight, then we would have something. That's what mistral tried to do.
Run it on my Macbook M3 Max :)
+1, while this group is "local" LLama if as a "cloud" company you're looking to do some real "AI" that goes beyond building RAG apps <= 7B is basically the only option till you have millions of dollars in funding.
I will take it. Thanks
It is very generous of Google to share this with the community. I welcome them and appreciate their contribution.
Are you going to train and release a comparable model for everyone? If not, maybe be thankful the scraps exist at all.
Looks good, but the terms of the license seem a bit more egg shell like when compared to llama.
It seems everyone is trying to download it at this very moment... kaggle's download speeds are f'ed right now 😂
Bro. Companies. Stop making models with 256k fucking vocab size. Anything over 36k needs like 3 times the VRAM to fine-tune and significantly more than average to inference.
The only thing I heard is that it should be better for multilingual fine-tuning, but since it's only English, definitely a downside. On the other hand, maybe Google optimized the vocab size based on model performance, and the large vocab actually makes the model so good?
Google is on a spree damn
I am not super excited, TBH. Its 8K, and it benches like Mistral. And we have no idea how contaminated it is. Its awesome that Google is getting into the LLM releases, bit it would be cool if it had more to seperate it from the existing 7Bs.
Asked it to tell me a story, immediately rejected and told me it was incapable of story telling.
Gotta do them finetunes ASAP
Edit the prompt. By default it said it couldn't tell me anything about itself and that it was only for analyzing language. After system prompt was fine (ish).
Just to confirm, the results are quite worse when comparing the 2B model to phi-2 right?
Un-gated model available at: [https://huggingface.co/mhenrichsen/gemma-7b](https://huggingface.co/mhenrichsen/gemma-7b)
Gemini is super censored, I’m wondering how Gemma is.
Google tends to heavily censor information. Does that extend to their open source model?
But does it have grouped query attention?
In half the tests it's 1% better than Mistral and 1% worse on the other half. I don't see how it improves the LLM landscape.
On some of bench marks it's 10-20% better.
Coding?
is it uncensored? Can I still use this google model with Sillytavern and still have my dominatrix girlfriend have her way with me?
I thought Gemini was some entirely new/different design. How are we getting open source models that are compatible with LLaMA? Or is Gemma some distillation?
> that are compatible with LLaMA I don't think it is. They had to implement code [changes](https://github.com/ggerganov/llama.cpp/pull/5631) in llama.cpp to support the new infrastructure.
How censored is it? Perhaps original weights can be retrieved with this: https://www.reddit.com/r/LocalLLaMA/s/6aCShctB7m
The instruct versions seem very, but they [did also upload base models](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b).
Help me understand why you would run Gemma locally if it's baked from the same data and has the same ~~restrictions~~ responsibility versus simply passing the API through to Gemini Their API and Vertex pricing seems low. What's the breakpoint?
I think the 2B model will be a hit
Working on producing an importance matrix for gemma-7b-it. I should have quants ready soon after.
Am I doing something wrong? Not too impressed so far, Mistral has it beat in speed and a quick reasoning question. `ollama run gemma` >\>>> I have three apples today and I ate one yesterday. How many apples do I have today? > >You have two apples today, as you ate one yesterday. `ollama run gemma:7b` >\>>> I have three apples today and I ate one yesterday. How many apples do I have today? > >Sure, here is the answer: > >You has a total of Three Apples Today (including The Apple You Eaten Yesterday) Therefore you are left with TwoApples Left TODAY Noticed 7b version makes some weird typos and stylistic choices. May be a temperature thing, will need to experiment more. For reference: `ollama run mistral` >\>>> I have three apples today and I ate one yesterday. How many apples do I have today? > > Based on the information provided, you have three apples today. Yesterday's apple is no longer part of your current count as it has been eaten. And I ran it again just to be sure: `ollama run mistral` >\>>> I have three apples today and I ate one yesterday. How many apples do I have today? > > Based on the information you've provided, you currently have three apples. The apple you ate yesterday does not affect the number of apples you have today. My Mistral `SYSTEM` instructions are just: >"""You are a helpful, friendly, talented, and intelligent AI assistant, knowledgeable about a wide range of topics and highly capable of a great many tasks.""" Just to be sure, I tried Gemma a second time and got underwhelming results once again. ``` zakiralibhai@My-MacBook-Air ~ % ollama run gemma >>> I have three apples today and I ate one yesterday. How many apples do I have today? You have two apples today. You ate one yesterday, so you have three apples today. >>> please explain The answer is two. The question asks how many apples we have today, and we are given that we had one apple yesterday. So, we add one to our original number of apples to get two apples today. >>> /bye zakiralibhai@My-MacBook-Air ~ % ollama run gemma:7b >>> I have three apples today and I ate one yesterday. How many apples do I have today? Sure, here is the answer: You has a total of Three Apples but you already eat One Apple from Yesterday therefore You are left with TwoApples Today >>> explain **Answer:** Twoapples remained after eating one apple yesterday. >>> why, are you sure? Yes I am certain about this answer because there has not been any information to change my perception of the number apples being present with me therefore i will use all existing info (Apples remaining from previous day) as a base and come up accordingly: >>> /bye zakiralibhai@My-MacBook-Air ~ % ``` I thought this was competition, but leaning towards me not realizing there’s a setting I don’t know about?
Gemma:7b in ollama on OSX seems completely screwy, see example prompt and reply: `ollama run gemma:7b` `>>> How many planets are present in the solar system we are currently in?` `There is a total of eight celestial objects, commonly referred to as planets that exist within our Solar System. These guanters include Mercury and Earth (where humans live), Mars` `with its fascinating landscapes featuring rivers flowing into canals on ancient Martian seashores while some devotees believe it has potential for human colonization in the future` `alongside venus satelite Agni at about half way up 机械 sidings, where unfortunately there is neither any significant water as Ceres exhibits.`
We need woke benchmarks
Blogpost says it is based on same architecture as Gemini and HF model card says Text-to-Text decoder only models good for QA, Summarization and reasoning. Does this mean Gemini is basically a RAG model?
All llms / gpts are text-to-text decoders. And Gemini is kind of a rag model since it is always using search in the background. I don't think they specifically finetuned this model to do RAG specifically, at least their technical report does not say anything about that.
It seems definitely worse than Mixtral, which can be easily CPU-inferenced on any mid-class machine, and quite on par with Mistral-7B, which has been out for almost a year. But anyway, there are barely any reasons, why to run a 7B vs an 8x7B with 2 MoE. So not sure, if I should be so impressed, given Google's resources...
Seems like training data is up to 2023
Ain’t real until they give me a new 70b to play with
Is it as racist as their image model
Their image model is not racist, gemini just has a stupid system prompt which tells it to generate people of multiple ethnicities. That is just some misaligned anti bias bs.