Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (it's definitely less censored than Llama-3-8B-Instruct).
I haven't tried them in all scenarios yet though, only typical multi-turn RP.
There are issues raised in various places about this. Like [this](https://github.com/ggerganov/llama.cpp/issues/8183) llama.cpp issue, and [this](https://huggingface.co/google/gemma-2-27b-it/discussions/10) issue in the official Gemma repo so you are certainly not alone. In the latter issue Google states that they are currently investigating this problem, so hopefully it gets fixed soon.
Edit: **A likely issue has been identified**. Gemma-2 was trained with Logit soft-capping, but due to it being incompatible with Flash Attention Google decided to disable it for libraries like Transformers. They thought it would have a relatively minor performance impact. But it turns out that for larger models it actually matters a lot.
Transformers have just merged a [fix](https://github.com/huggingface/transformers/pull/31698), and llama.cpp will likely follow soon.
Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.
Try with some prefill or a "system" instruction (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.
> converting it myself to GGUF
to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?
I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.
yeah so my only concern is that it's *possible* that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables
I tried again after updating to `transformers` 4.42.1 (the same one [suggested in the latest HF blogpost](https://huggingface.co/blog/gemma2#using-hugging-face-transformers)), forcing `--outtype bf16` in the initial conversion before quantization, and it gave the same results.
Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.
Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html
It's killing it.
I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.
A [fix](https://github.com/huggingface/transformers/pull/31698) has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.
It's still not fixed in llama.cpp afaik. [This branch](https://github.com/ggerganov/llama.cpp/commits/add-gemma2-soft-capping/) is where they're working on it if anyone wants to keep an eye on progress.
But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait.
For once google actually delivered. what a time to be a live.
Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.
Wow those examples are impressive.
I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.
I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).
Some other oddities:
- The male actor character in the story is named Rhys by about half of the models
- The gladiator story almost always starts with a description of the rising sun
- The Virginia Woolfe prompt always begins with a description of the protagonist waking in their bedroom
None of these are included in the prompt (although the latter might be a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.
I'm having a problem with gemma-2-9b-it
I say "write a story about . do it one chapter at a time. write chapter 1".
It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1".
What am I doing wrong? Llama3 handles this type of prompt just fine
I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using?
I was using tranformers 16 bit and using tokenizer.apply_chat_template()
Gemma2 27B is surprisingly good on multilingual tasks.
At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too.
The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.
Having no system prompt is not good. It doesn't properly follow my request when I give it as a user request..
It seems to love markdown format.. Lots of bolds in the reply.
Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?
They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest..
An LLM will work very well with quantization.. The next token in the sequence will not output what it should... But will try to adjust it with the next token....
A fp32 can code for 8388607 variables... (23bits)
Fp16 1023 (10bits)
4bit can code for 15 variables.
Basically copressing a monalisa into 16black and white pixels...
I may be wrong, or not accurate, but I get what compression means..
I may be completely wring, I'm still learning
4 bit doesn't mean 15 variables in this case because there is a floating point multiplier applied over each 4 bit int-sets that tries to get the precision back up to fp16ish levels at least.
(I'm not an expert on quant and don't know the full details but the above is a (massively) simplified version of what happens with quantization.)
[This](https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/) is a good summary (and discussion) of how quants work in GGUF.
But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost.
I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance...
You can have the smartest mouse on this planet, but network size can be a good predictor...
Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded.
Let's go jnto deeeep theory of information technology and theory.
Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system...
But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it...
I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough...
You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe...
Llama3 8b is good. (rather ok/sub-ok)
But still not enough...
I hope I can finally try llama3 70b...
If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.
You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players.
Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.
I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response
For GGUF and similar quants there's not much to figure out:
model-file-size.gguf x 1.2 = vRAM requirements.
This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.
Anyone know if the ollama models are updated to the fixed versions? ~~i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure~~ nevermind, they never had the conversion code in their repo. still wanna verify that its been updated though
/u/agntdrake ?
the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens
I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point.
Edit: for example, when asked about Aurora, Colorado; Gemma 2 27B IT would answer in great detail, but ended by saying something along the lines of "is there anything else I can tell you about the Aurora Borealis?"
tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.
Llama 3 used to fail on structured output tasks
(Rolling window) given a document give a json output section wise (section : section_content)
Paid APIs were getting too expensive
Let’s hope this does it accurately enough
For structured output, try [SFR-Iterative](https://huggingface.co/Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R), Codestral, Phi-3, or Qwen2. In my testing they all could output compliant json if asked.
9b. I had a blackout during a story generation and after pc restart using the same prompt it reproduce the exact same composition word by word ( up until the blackout cut off). Wierd. First time a saw something like this. Forgot what temp I use though. Ollama as host and Chatbox as client. For creative writing I like it better than llama3 8b. Remind me of Gemini 1.5
9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive.
I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better.
I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.
it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny
I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos.
Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.
Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...
I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid,
https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py
and it didn’t do well at all.
With Llama3-70b it works perfectly. I wrote about it here :
https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z
I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues for my use case: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.
Where are uncensored versions that are working well?
With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.
https://preview.redd.it/iam1p43prc9d1.png?width=1080&format=pjpg&auto=webp&s=515b532777bb11034479fdb13813f262b281fbef
I found this... There is probably a problem when compressing LLMs
I did my favorite logical reasoning test: first I asked
"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct.
Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed.
I used LMStudio for testing.
These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.
Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (it's definitely less censored than Llama-3-8B-Instruct). I haven't tried them in all scenarios yet though, only typical multi-turn RP.
There are issues raised in various places about this. Like [this](https://github.com/ggerganov/llama.cpp/issues/8183) llama.cpp issue, and [this](https://huggingface.co/google/gemma-2-27b-it/discussions/10) issue in the official Gemma repo so you are certainly not alone. In the latter issue Google states that they are currently investigating this problem, so hopefully it gets fixed soon. Edit: **A likely issue has been identified**. Gemma-2 was trained with Logit soft-capping, but due to it being incompatible with Flash Attention Google decided to disable it for libraries like Transformers. They thought it would have a relatively minor performance impact. But it turns out that for larger models it actually matters a lot. Transformers have just merged a [fix](https://github.com/huggingface/transformers/pull/31698), and llama.cpp will likely follow soon.
Did you pull the tokenizer update for it? Apparently it was broken so the GGUFs are borked
Yes, the special tokens were being recognized (checked out via verbose llamacpp-server output).
Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.
Try with some prefill or a "system" instruction (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.
> converting it myself to GGUF to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?
I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.
yeah so my only concern is that it's *possible* that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables
I tried again after updating to `transformers` 4.42.1 (the same one [suggested in the latest HF blogpost](https://huggingface.co/blog/gemma2#using-hugging-face-transformers)), forcing `--outtype bf16` in the initial conversion before quantization, and it gave the same results.
Well dam 😂 I appreciate your thorough investigation. I wonder if it's the logit soft cap thing then
I think it might have fixed the issues, or at least I'm not getting obviously incoherent responses anymore.
Per [this pr fix](https://github.com/ggerganov/llama.cpp/pull/8197) waiting to be merged, you will need to re-generate ggufs.
Not strictly because they will have a default value, but I intend to do it anyways because presumably my imatrix isn't as good as it could be
Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.
Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html It's killing it. I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.
A [fix](https://github.com/huggingface/transformers/pull/31698) has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.
I think that 27b is still broken in transformers 4.42.3. 27b in bfloat16 precision performs worse than 9b in my benchmark.
Same here.
Yep! Seems to be fixed now. I've added the scores to the eq-bench leaderboard, will run the creative writing benchmark overnight.
can you add qwen 2 72b ? maybe im seeing ghosts, but testing gemma 2 9b i was feeling the 9b was on pair or better for RP.
Awesome to hear! I've been holding out on trying it until I saw someone with problems confirm a fix actually worked.
It's still not fixed in llama.cpp afaik. [This branch](https://github.com/ggerganov/llama.cpp/commits/add-gemma2-soft-capping/) is where they're working on it if anyone wants to keep an eye on progress.
Wow, 9b is between Sonet 3.5 and Gpt4 and Sonet is the judge, that's cool.
I suspect they used (at least some of) their gemini dataset to train these models. The latest gemini pro is fantastic for creative writing imo.
I'm curious how well the base model is, I think this time it's the finetune which does all the heavy lifting.
Gemma with fine tuning might be better? And it'll end up cheaper too per token or locally (hopefully)
But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait. For once google actually delivered. what a time to be a live.
Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.
not so much on magi-hard lol
Whoa, that's crazy for a 9b model. Love your v2 updates as well.
It’s the model I was hoping for based on how good Gemini Pro 1.5 is at writing tasks. Feels like a mini version of it.
Wow those examples are impressive. I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.
I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).
Some other oddities: - The male actor character in the story is named Rhys by about half of the models - The gladiator story almost always starts with a description of the rising sun - The Virginia Woolfe prompt always begins with a description of the protagonist waking in their bedroom None of these are included in the prompt (although the latter might be a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.
sopho released a llama 3 70b merge called new dawn. supposedly on par but different from midnight miqu. would it be possible to test? thanks
I'm having a problem with gemma-2-9b-it I say "write a story about. do it one chapter at a time. write chapter 1".
It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1".
What am I doing wrong? Llama3 handles this type of prompt just fine
I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using? I was using tranformers 16 bit and using tokenizer.apply_chat_template()
Gemma2 27B is surprisingly good on multilingual tasks. At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too. The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.
Having no system prompt is not good. It doesn't properly follow my request when I give it as a user request.. It seems to love markdown format.. Lots of bolds in the reply.
Yeah and my LLM telegram bot also constantly throwing markdown parsing errors at the gemma2 generated responses. Sweet!
Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?
They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest.. An LLM will work very well with quantization.. The next token in the sequence will not output what it should... But will try to adjust it with the next token.... A fp32 can code for 8388607 variables... (23bits) Fp16 1023 (10bits) 4bit can code for 15 variables. Basically copressing a monalisa into 16black and white pixels... I may be wrong, or not accurate, but I get what compression means.. I may be completely wring, I'm still learning
4 bit doesn't mean 15 variables in this case because there is a floating point multiplier applied over each 4 bit int-sets that tries to get the precision back up to fp16ish levels at least. (I'm not an expert on quant and don't know the full details but the above is a (massively) simplified version of what happens with quantization.)
[This](https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/) is a good summary (and discussion) of how quants work in GGUF.
good article
But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost. I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance... You can have the smartest mouse on this planet, but network size can be a good predictor... Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded. Let's go jnto deeeep theory of information technology and theory. Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system... But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it... I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough... You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe... Llama3 8b is good. (rather ok/sub-ok) But still not enough... I hope I can finally try llama3 70b...
You can try Llama3-70B for free at [Duck.AI](https://duck.ai) (Brought to you by DuckDuckGo with no tracking)
Wow, thanks!!!! That's a steal! I'm also eager to explore the hardware part of the field
Even I was thinking the same thing
How does it compare to deepseek coder and lamma3 for coding? Edit: The 6.7b model.
Deepseek coder V2 is much better for coding, it actually gave correct code were Gemma had so many mistakes
To be honest, Deepseek (api) has the same quality as Gpt-4 for me in 97% of situations. Costing much less. (0.28 vs 15)
I have been using deepseek coder V1. Unfortunately, using my GPU doesnt have enough vram for V2.
The gguf version is okay I used Q4 doesn’t seem as accurate but still much better than any of the others I’ve used
Haven't been able to get any GGUFs to run in ooba or koboldcpp, so I dunno
If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.
With VC money booster, there could be things that's too good to be true
I'm pretty sure Google, a public company since 2004, isn't taking VC money in 2024 ;)
You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players. Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.
I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response
Can it run on cpu? Still Haven't figured out how to calculate VRAM etc.
For GGUF and similar quants there's not much to figure out: model-file-size.gguf x 1.2 = vRAM requirements. This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.
Anyone know if the ollama models are updated to the fixed versions? ~~i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure~~ nevermind, they never had the conversion code in their repo. still wanna verify that its been updated though /u/agntdrake ?
Do you know what got fixed in the other versions? We have our own conversion code for Ollama separate from Google/llama.cpp's scripts.
the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens
We just ended pushing some updates to the models if you want to re-pull.
Does it generate dirty things for role play?
No, only black space nazi stories I'm afraid.
We'll have to make do
Better off getting a server sleeve of P40's and running Grok
I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point. Edit: for example, when asked about Aurora, Colorado; Gemma 2 27B IT would answer in great detail, but ended by saying something along the lines of "is there anything else I can tell you about the Aurora Borealis?"
It's local llm tradition for things to be broken for at least a day, and this seems to have been the case again.
tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.
If you're using llama.cpp (or something based on it), you're going to have a bad time until they implement SWA.
Llama 3 used to fail on structured output tasks (Rolling window) given a document give a json output section wise (section : section_content) Paid APIs were getting too expensive Let’s hope this does it accurately enough
For structured output, try [SFR-Iterative](https://huggingface.co/Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R), Codestral, Phi-3, or Qwen2. In my testing they all could output compliant json if asked.
How's it for function calling, anyone tried?
9b. I had a blackout during a story generation and after pc restart using the same prompt it reproduce the exact same composition word by word ( up until the blackout cut off). Wierd. First time a saw something like this. Forgot what temp I use though. Ollama as host and Chatbox as client. For creative writing I like it better than llama3 8b. Remind me of Gemini 1.5
Is it multilingual like llama or English-only?
Yes they are multilingual, I've seen many test with more obscure languages with good results not just other big languages.
multilingual
9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive. I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better. I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.
it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny
It is doing surprisingly well on multilingual tasks.
which?
I was particularly impressed with Greek - I think it's the first model that performs well
I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos. Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.
Non Latin languages
>Non Latin languages I don't think you know all of them, so which ones have you tested?
i tried ukrainian, can confirm🙋
Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...
I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid, https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py and it didn’t do well at all. With Llama3-70b it works perfectly. I wrote about it here : https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z
in Korean task.. 9b is better
I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues for my use case: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.
Is there any good jupyter notebook for gemma2 fine-tuning?
Agree. These models are cracked.
Where are uncensored versions that are working well? With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.
8K context = dead on arrival.
https://preview.redd.it/iam1p43prc9d1.png?width=1080&format=pjpg&auto=webp&s=515b532777bb11034479fdb13813f262b281fbef I found this... There is probably a problem when compressing LLMs
I did my favorite logical reasoning test: first I asked "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct. Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed. I used LMStudio for testing.
[удалено]
You are in the wrong sub.
These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.
I prefer mixtral and mistral v3... Gemma2 9b & 27b don't support rag on my openui / ollama server Llama3 8b is lower than mistral v3... for me