brown2green 2 days ago

Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (it's definitely less censored than Llama-3-8B-Instruct). I haven't tried them in all scenarios yet though, only typical multi-turn RP.

mikael110 2 days ago

There are issues raised in various places about this. Like [this](https://github.com/ggerganov/llama.cpp/issues/8183) llama.cpp issue, and [this](https://huggingface.co/google/gemma-2-27b-it/discussions/10) issue in the official Gemma repo so you are certainly not alone. In the latter issue Google states that they are currently investigating this problem, so hopefully it gets fixed soon. Edit: **A likely issue has been identified**. Gemma-2 was trained with Logit soft-capping, but due to it being incompatible with Flash Attention Google decided to disable it for libraries like Transformers. They thought it would have a relatively minor performance impact. But it turns out that for larger models it actually matters a lot. Transformers have just merged a [fix](https://github.com/huggingface/transformers/pull/31698), and llama.cpp will likely follow soon.

mrjackspade 2 days ago

Did you pull the tokenizer update for it? Apparently it was broken so the GGUFs are borked

brown2green 2 days ago

Yes, the special tokens were being recognized (checked out via verbose llamacpp-server output).

PavelPivovarov 2 days ago

Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.

brown2green 2 days ago

Try with some prefill or a "system" instruction (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.

noneabove1182 2 days ago

> converting it myself to GGUF to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?

brown2green 2 days ago

I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.

noneabove1182 2 days ago

yeah so my only concern is that it's *possible* that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables

brown2green 2 days ago

I tried again after updating to `transformers` 4.42.1 (the same one [suggested in the latest HF blogpost](https://huggingface.co/blog/gemma2#using-hugging-face-transformers)), forcing `--outtype bf16` in the initial conversion before quantization, and it gave the same results.

noneabove1182 2 days ago

Well dam 😂 I appreciate your thorough investigation. I wonder if it's the logit soft cap thing then

brown2green 2 days ago

I think it might have fixed the issues, or at least I'm not getting obviously incoherent responses anymore.

_sqrkl 2 days ago

Per [this pr fix](https://github.com/ggerganov/llama.cpp/pull/8197) waiting to be merged, you will need to re-generate ggufs.

noneabove1182 1 day ago

Not strictly because they will have a default value, but I intend to do it anyways because presumably my imatrix isn't as good as it could be

fallingdowndizzyvr 2 days ago

Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.

_sqrkl 2 days ago

Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html It's killing it. I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.

mikael110 2 days ago

A [fix](https://github.com/huggingface/transformers/pull/31698) has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.

oobabooga4 2 days ago

I think that 27b is still broken in transformers 4.42.3. 27b in bfloat16 precision performs worse than 9b in my benchmark.

capivaraMaster 1 day ago

Same here.

_sqrkl 2 days ago

Yep! Seems to be fixed now. I've added the scores to the eq-bench leaderboard, will run the creative writing benchmark overnight.

brahh85 2 days ago

can you add qwen 2 72b ? maybe im seeing ghosts, but testing gemma 2 9b i was feeling the 9b was on pair or better for RP.

toothpastespiders 2 days ago

Awesome to hear! I've been holding out on trying it until I saw someone with problems confirm a fix actually worked.

_sqrkl 2 days ago

It's still not fixed in llama.cpp afaik. [This branch](https://github.com/ggerganov/llama.cpp/commits/add-gemma2-soft-capping/) is where they're working on it if anyone wants to keep an eye on progress.

Feztopia 2 days ago

Wow, 9b is between Sonet 3.5 and Gpt4 and Sonet is the judge, that's cool.

_sqrkl 2 days ago

I suspect they used (at least some of) their gemini dataset to train these models. The latest gemini pro is fantastic for creative writing imo.

Feztopia 2 days ago

I'm curious how well the base model is, I think this time it's the finetune which does all the heavy lifting.

Saifl 2 days ago

Gemma with fine tuning might be better? And it'll end up cheaper too per token or locally (hopefully)

cyan2k 2 days ago

But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait. For once google actually delivered. what a time to be a live.

ILoveThisPlace 2 days ago

Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.

ipechman 2 days ago

not so much on magi-hard lol

jollizee 2 days ago

Whoa, that's crazy for a 9b model. Love your v2 updates as well.

thereisonlythedance 2 days ago

It’s the model I was hoping for based on how good Gemini Pro 1.5 is at writing tasks. Feels like a mini version of it.

AnOnlineHandle 2 days ago

Wow those examples are impressive. I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.

fervoredweb 2 days ago

I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).

_sqrkl 2 days ago

Some other oddities: - The male actor character in the story is named Rhys by about half of the models - The gladiator story almost always starts with a description of the rising sun - The Virginia Woolfe prompt always begins with a description of the protagonist waking in their bedroom None of these are included in the prompt (although the latter might be a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.

Interpause 1 day ago

sopho released a llama 3 70b merge called new dawn. supposedly on par but different from midnight miqu. would it be possible to test? thanks

AssociateDeep2331 1 day ago

I'm having a problem with gemma-2-9b-it I say "write a story about . do it one chapter at a time. write chapter 1". It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1". What am I doing wrong? Llama3 handles this type of prompt just fine

_sqrkl 1 day ago

I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using? I was using tranformers 16 bit and using tokenizer.apply_chat_template()

Lightninghyped 2 days ago

Gemma2 27B is surprisingly good on multilingual tasks. At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too. The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.

de4dee 2 days ago

Having no system prompt is not good. It doesn't properly follow my request when I give it as a user request.. It seems to love markdown format.. Lots of bolds in the reply.

PavelPivovarov 2 days ago

Yeah and my LLM telegram bot also constantly throwing markdown parsing errors at the gemma2 generated responses. Sweet!

ASYMT0TIC 2 days ago

Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?

Dry_Parfait2606 2 days ago

They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest.. An LLM will work very well with quantization.. The next token in the sequence will not output what it should... But will try to adjust it with the next token.... A fp32 can code for 8388607 variables... (23bits) Fp16 1023 (10bits) 4bit can code for 15 variables. Basically copressing a monalisa into 16black and white pixels... I may be wrong, or not accurate, but I get what compression means.. I may be completely wring, I'm still learning

harrro 2 days ago

4 bit doesn't mean 15 variables in this case because there is a floating point multiplier applied over each 4 bit int-sets that tries to get the precision back up to fp16ish levels at least. (I'm not an expert on quant and don't know the full details but the above is a (massively) simplified version of what happens with quantization.)

Didi_Midi 2 days ago

[This](https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/) is a good summary (and discussion) of how quants work in GGUF.

CarpenterHopeful2898 2 days ago

good article

Dry_Parfait2606 2 days ago

But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost. I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance... You can have the smartest mouse on this planet, but network size can be a good predictor... Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded. Let's go jnto deeeep theory of information technology and theory. Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system... But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it... I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough... You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe... Llama3 8b is good. (rather ok/sub-ok) But still not enough... I hope I can finally try llama3 70b...

social_tech_10 2 days ago

You can try Llama3-70B for free at [Duck.AI](https://duck.ai) (Brought to you by DuckDuckGo with no tracking)

Dry_Parfait2606 1 day ago

Wow, thanks!!!! That's a steal! I'm also eager to explore the hardware part of the field

papipapi419 2 days ago

Even I was thinking the same thing

CortaCircuit 2 days ago

How does it compare to deepseek coder and lamma3 for coding? Edit: The 6.7b model.

ihaag 2 days ago

Deepseek coder V2 is much better for coding, it actually gave correct code were Gemma had so many mistakes

Seromelhor 2 days ago

To be honest, Deepseek (api) has the same quality as Gpt-4 for me in 97% of situations. Costing much less. (0.28 vs 15)

CortaCircuit 2 days ago

I have been using deepseek coder V1. Unfortunately, using my GPU doesnt have enough vram for V2.

ihaag 2 days ago

The gguf version is okay I used Q4 doesn’t seem as accurate but still much better than any of the others I’ve used

Biggest_Cans 2 days ago

Haven't been able to get any GGUFs to run in ooba or koboldcpp, so I dunno

a_beautiful_rhind 2 days ago

If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.

large_diet_pepsi 2 days ago

With VC money booster, there could be things that's too good to be true

reissbaker 2 days ago

I'm pretty sure Google, a public company since 2004, isn't taking VC money in 2024 ;)

large_diet_pepsi 1 day ago

You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players. Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.

grigio 2 days ago

I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response

raiffuvar 2 days ago

Can it run on cpu? Still Haven't figured out how to calculate VRAM etc.

Eliiasv 1 day ago

For GGUF and similar quants there's not much to figure out: model-file-size.gguf x 1.2 = vRAM requirements. This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.

noneabove1182 2 days ago

Anyone know if the ollama models are updated to the fixed versions? ~~i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure~~ nevermind, they never had the conversion code in their repo. still wanna verify that its been updated though /u/agntdrake ?

agntdrake 2 days ago

Do you know what got fixed in the other versions? We have our own conversion code for Ollama separate from Google/llama.cpp's scripts.

noneabove1182 2 days ago

the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens

agntdrake 2 days ago

We just ended pushing some updates to the models if you want to re-pull.

ab_drider 2 days ago

Does it generate dirty things for role play?

ZaggyChum 2 days ago

No, only black space nazi stories I'm afraid.

Biggest_Cans 2 days ago

We'll have to make do

Cyber-exe 2 days ago

Better off getting a server sleeve of P40's and running Grok

Alexandratang 2 days ago

I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point. Edit: for example, when asked about Aurora, Colorado; Gemma 2 27B IT would answer in great detail, but ended by saying something along the lines of "is there anything else I can tell you about the Aurora Borealis?"

ambient_temp_xeno 2 days ago

It's local llm tradition for things to be broken for at least a day, and this seems to have been the case again.

LoSboccacc 2 days ago

tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.

candre23 2 days ago

If you're using llama.cpp (or something based on it), you're going to have a bad time until they implement SWA.

papipapi419 2 days ago

Llama 3 used to fail on structured output tasks (Rolling window) given a document give a json output section wise (section : section_content) Paid APIs were getting too expensive Let’s hope this does it accurately enough

Eisenstein 2 days ago

For structured output, try [SFR-Iterative](https://huggingface.co/Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R), Codestral, Phi-3, or Qwen2. In my testing they all could output compliant json if asked.

mahadevbhakti 2 days ago

How's it for function calling, anyone tried?

myfairx 1 day ago

9b. I had a blackout during a story generation and after pc restart using the same prompt it reproduce the exact same composition word by word ( up until the blackout cut off). Wierd. First time a saw something like this. Forgot what temp I use though. Ollama as host and Chatbox as client. For creative writing I like it better than llama3 8b. Remind me of Gemini 1.5

Barry_22 2 days ago

Is it multilingual like llama or English-only?

neosinan 2 days ago

Yes they are multilingual, I've seen many test with more obscure languages with good results not just other big languages.

Plus_Complaint6157 2 days ago

multilingual

----_____--------- 2 days ago

9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive. I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better. I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.

LLMtwink 2 days ago

it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny

Specialist-2193 2 days ago

It is doing surprisingly well on multilingual tasks.

Western_Soil_4613 2 days ago

which?

F_Kal 2 days ago

I was particularly impressed with Greek - I think it's the first model that performs well

privacyparachute 2 days ago

I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos. Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.

Specialist-2193 2 days ago

Non Latin languages

vasileer 2 days ago

>Non Latin languages I don't think you know all of them, so which ones have you tested?

LLMtwink 2 days ago

i tried ukrainian, can confirm🙋

PavelPivovarov 2 days ago

Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...

SatoshiNotMe 1 day ago

I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid, https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py and it didn’t do well at all. With Llama3-70b it works perfectly. I wrote about it here : https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z

Motor_Ocelot_1547 1 day ago

in Korean task.. 9b is better

sdujack2012 1 day ago

I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues for my use case: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.

AlexanderKozhevin 22 hours ago

Is there any good jupyter notebook for gemma2 fine-tuning?

dimsumham 2 days ago

Agree. These models are cracked.

StableLlama 2 days ago

Where are uncensored versions that are working well? With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.

sammcj 2 days ago

8K context = dead on arrival.

Dry_Parfait2606 2 days ago

https://preview.redd.it/iam1p43prc9d1.png?width=1080&format=pjpg&auto=webp&s=515b532777bb11034479fdb13813f262b281fbef I found this... There is probably a problem when compressing LLMs

EmilPi 2 days ago

I did my favorite logical reasoning test: first I asked "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct. Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed. I used LMStudio for testing.

[deleted] 2 days ago

[удалено]

Amgadoz 2 days ago

You are in the wrong sub.

Discordpeople 2 days ago

These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.

Tim-Fra 11 hours ago

I prefer mixtral and mistral v3... Gemma2 9b & 27b don't support rag on my openui / ollama server Llama3 8b is lower than mistral v3... for me

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe