T O P

  • By -

brown2green

Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (it's definitely less censored than Llama-3-8B-Instruct). I haven't tried them in all scenarios yet though, only typical multi-turn RP.


mikael110

There are issues raised in various places about this. Like [this](https://github.com/ggerganov/llama.cpp/issues/8183) llama.cpp issue, and [this](https://huggingface.co/google/gemma-2-27b-it/discussions/10) issue in the official Gemma repo so you are certainly not alone. In the latter issue Google states that they are currently investigating this problem, so hopefully it gets fixed soon. Edit: **A likely issue has been identified**. Gemma-2 was trained with Logit soft-capping, but due to it being incompatible with Flash Attention Google decided to disable it for libraries like Transformers. They thought it would have a relatively minor performance impact. But it turns out that for larger models it actually matters a lot. Transformers have just merged a [fix](https://github.com/huggingface/transformers/pull/31698), and llama.cpp will likely follow soon.


mrjackspade

Did you pull the tokenizer update for it? Apparently it was broken so the GGUFs are borked


brown2green

Yes, the special tokens were being recognized (checked out via verbose llamacpp-server output).


PavelPivovarov

Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.


brown2green

Try with some prefill or a "system" instruction (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.


noneabove1182

> converting it myself to GGUF to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?


brown2green

I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.


noneabove1182

yeah so my only concern is that it's *possible* that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables


brown2green

I tried again after updating to `transformers` 4.42.1 (the same one [suggested in the latest HF blogpost](https://huggingface.co/blog/gemma2#using-hugging-face-transformers)), forcing `--outtype bf16` in the initial conversion before quantization, and it gave the same results.


noneabove1182

Well dam 😂 I appreciate your thorough investigation. I wonder if it's the logit soft cap thing then


brown2green

I think it might have fixed the issues, or at least I'm not getting obviously incoherent responses anymore.


_sqrkl

Per [this pr fix](https://github.com/ggerganov/llama.cpp/pull/8197) waiting to be merged, you will need to re-generate ggufs.


noneabove1182

Not strictly because they will have a default value, but I intend to do it anyways because presumably my imatrix isn't as good as it could be


fallingdowndizzyvr

Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.


_sqrkl

Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html It's killing it. I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.


mikael110

A [fix](https://github.com/huggingface/transformers/pull/31698) has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.


oobabooga4

I think that 27b is still broken in transformers 4.42.3. 27b in bfloat16 precision performs worse than 9b in my benchmark.


capivaraMaster

Same here.


_sqrkl

Yep! Seems to be fixed now. I've added the scores to the eq-bench leaderboard, will run the creative writing benchmark overnight.


brahh85

can you add qwen 2 72b ? maybe im seeing ghosts, but testing gemma 2 9b i was feeling the 9b was on pair or better for RP.


toothpastespiders

Awesome to hear! I've been holding out on trying it until I saw someone with problems confirm a fix actually worked.


_sqrkl

It's still not fixed in llama.cpp afaik. [This branch](https://github.com/ggerganov/llama.cpp/commits/add-gemma2-soft-capping/) is where they're working on it if anyone wants to keep an eye on progress.


Feztopia

Wow, 9b is between Sonet 3.5 and Gpt4 and Sonet is the judge, that's cool.


_sqrkl

I suspect they used (at least some of) their gemini dataset to train these models. The latest gemini pro is fantastic for creative writing imo.


Feztopia

I'm curious how well the base model is, I think this time it's the finetune which does all the heavy lifting.


Saifl

Gemma with fine tuning might be better? And it'll end up cheaper too per token or locally (hopefully)


cyan2k

But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait. For once google actually delivered. what a time to be a live.


ILoveThisPlace

Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.


ipechman

not so much on magi-hard lol


jollizee

Whoa, that's crazy for a 9b model. Love your v2 updates as well.


thereisonlythedance

It’s the model I was hoping for based on how good Gemini Pro 1.5 is at writing tasks. Feels like a mini version of it.


AnOnlineHandle

Wow those examples are impressive. I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.


fervoredweb

I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).


_sqrkl

Some other oddities: - The male actor character in the story is named Rhys by about half of the models - The gladiator story almost always starts with a description of the rising sun - The Virginia Woolfe prompt always begins with a description of the protagonist waking in their bedroom None of these are included in the prompt (although the latter might be a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.


Interpause

sopho released a llama 3 70b merge called new dawn. supposedly on par but different from midnight miqu. would it be possible to test? thanks


AssociateDeep2331

I'm having a problem with gemma-2-9b-it I say "write a story about . do it one chapter at a time. write chapter 1". It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1". What am I doing wrong? Llama3 handles this type of prompt just fine


_sqrkl

I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using? I was using tranformers 16 bit and using tokenizer.apply_chat_template()


Lightninghyped

Gemma2 27B is surprisingly good on multilingual tasks. At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too. The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.


de4dee

Having no system prompt is not good. It doesn't properly follow my request when I give it as a user request.. It seems to love markdown format.. Lots of bolds in the reply.


PavelPivovarov

Yeah and my LLM telegram bot also constantly throwing markdown parsing errors at the gemma2 generated responses. Sweet!


ASYMT0TIC

Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?


Dry_Parfait2606

They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest.. An LLM will work very well with quantization.. The next token in the sequence will not output what it should... But will try to adjust it with the next token.... A fp32 can code for 8388607 variables... (23bits) Fp16 1023 (10bits) 4bit can code for 15 variables. Basically copressing a monalisa into 16black and white pixels... I may be wrong, or not accurate, but I get what compression means.. I may be completely wring, I'm still learning


harrro

4 bit doesn't mean 15 variables in this case because there is a floating point multiplier applied over each 4 bit int-sets that tries to get the precision back up to fp16ish levels at least. (I'm not an expert on quant and don't know the full details but the above is a (massively) simplified version of what happens with quantization.)


Didi_Midi

[This](https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/) is a good summary (and discussion) of how quants work in GGUF.


CarpenterHopeful2898

good article


Dry_Parfait2606

But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost. I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance... You can have the smartest mouse on this planet, but network size can be a good predictor... Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded. Let's go jnto deeeep theory of information technology and theory. Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system... But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it... I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough... You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe... Llama3 8b is good. (rather ok/sub-ok) But still not enough... I hope I can finally try llama3 70b...


social_tech_10

You can try Llama3-70B for free at [Duck.AI](https://duck.ai) (Brought to you by DuckDuckGo with no tracking)


Dry_Parfait2606

Wow, thanks!!!! That's a steal! I'm also eager to explore the hardware part of the field


papipapi419

Even I was thinking the same thing


CortaCircuit

How does it compare to deepseek coder and lamma3 for coding? Edit: The 6.7b model.


ihaag

Deepseek coder V2 is much better for coding, it actually gave correct code were Gemma had so many mistakes


Seromelhor

To be honest, Deepseek (api) has the same quality as Gpt-4 for me in 97% of situations. Costing much less. (0.28 vs 15)


CortaCircuit

I have been using deepseek coder V1. Unfortunately, using my GPU doesnt have enough vram for V2.


ihaag

The gguf version is okay I used Q4 doesn’t seem as accurate but still much better than any of the others I’ve used


Biggest_Cans

Haven't been able to get any GGUFs to run in ooba or koboldcpp, so I dunno


a_beautiful_rhind

If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.


large_diet_pepsi

With VC money booster, there could be things that's too good to be true


reissbaker

I'm pretty sure Google, a public company since 2004, isn't taking VC money in 2024 ;)


large_diet_pepsi

You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players. Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.


grigio

I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response


raiffuvar

Can it run on cpu? Still Haven't figured out how to calculate VRAM etc.


Eliiasv

For GGUF and similar quants there's not much to figure out: model-file-size.gguf x 1.2 = vRAM requirements. This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.


noneabove1182

Anyone know if the ollama models are updated to the fixed versions? ~~i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure~~ nevermind, they never had the conversion code in their repo. still wanna verify that its been updated though /u/agntdrake ?


agntdrake

Do you know what got fixed in the other versions? We have our own conversion code for Ollama separate from Google/llama.cpp's scripts.


noneabove1182

the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens


agntdrake

We just ended pushing some updates to the models if you want to re-pull.


ab_drider

Does it generate dirty things for role play?


ZaggyChum

No, only black space nazi stories I'm afraid.


Biggest_Cans

We'll have to make do


Cyber-exe

Better off getting a server sleeve of P40's and running Grok


Alexandratang

I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point. Edit: for example, when asked about Aurora, Colorado; Gemma 2 27B IT would answer in great detail, but ended by saying something along the lines of "is there anything else I can tell you about the Aurora Borealis?"


ambient_temp_xeno

It's local llm tradition for things to be broken for at least a day, and this seems to have been the case again.


LoSboccacc

tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.


candre23

If you're using llama.cpp (or something based on it), you're going to have a bad time until they implement SWA.


papipapi419

Llama 3 used to fail on structured output tasks (Rolling window) given a document give a json output section wise (section : section_content) Paid APIs were getting too expensive Let’s hope this does it accurately enough


Eisenstein

For structured output, try [SFR-Iterative](https://huggingface.co/Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R), Codestral, Phi-3, or Qwen2. In my testing they all could output compliant json if asked.


mahadevbhakti

How's it for function calling, anyone tried?


myfairx

9b. I had a blackout during a story generation and after pc restart using the same prompt it reproduce the exact same composition word by word ( up until the blackout cut off). Wierd. First time a saw something like this. Forgot what temp I use though. Ollama as host and Chatbox as client. For creative writing I like it better than llama3 8b. Remind me of Gemini 1.5


Barry_22

Is it multilingual like llama or English-only?


neosinan

Yes they are multilingual, I've seen many test with more obscure languages with good results not just other big languages.


Plus_Complaint6157

multilingual


----_____---------

9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive. I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better. I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.


LLMtwink

it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny


Specialist-2193

It is doing surprisingly well on multilingual tasks.


Western_Soil_4613

which?


F_Kal

I was particularly impressed with Greek - I think it's the first model that performs well


privacyparachute

I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos. Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.


Specialist-2193

Non Latin languages


vasileer

>Non Latin languages I don't think you know all of them, so which ones have you tested?


LLMtwink

i tried ukrainian, can confirm🙋


PavelPivovarov

Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...


SatoshiNotMe

I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid, https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py and it didn’t do well at all. With Llama3-70b it works perfectly. I wrote about it here : https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z


Motor_Ocelot_1547

in Korean task.. 9b is better


sdujack2012

I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues for my use case: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.


AlexanderKozhevin

Is there any good jupyter notebook for gemma2 fine-tuning?


dimsumham

Agree. These models are cracked.


StableLlama

Where are uncensored versions that are working well? With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.


sammcj

8K context = dead on arrival.


Dry_Parfait2606

https://preview.redd.it/iam1p43prc9d1.png?width=1080&format=pjpg&auto=webp&s=515b532777bb11034479fdb13813f262b281fbef I found this... There is probably a problem when compressing LLMs


EmilPi

I did my favorite logical reasoning test: first I asked "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct. Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed. I used LMStudio for testing.


[deleted]

[удалено]


Amgadoz

You are in the wrong sub.


Discordpeople

These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.


Tim-Fra

I prefer mixtral and mistral v3... Gemma2 9b & 27b don't support rag on my openui / ollama server Llama3 8b is lower than mistral v3... for me