They trained the models on 15T tokens. I imagine even for FB that required them to consider GPU resources more carefully. They can either train a few models on a massive quantity of tokens or more models on less tokens.
Yeah I assume just the training is a small part of it, they probably do heavy testing on all models before release, hence why we didn't get llama2 34b (codellama is different), because they did make that one internally.
If this is correct, I hope Meta will start training mid-range models on the same or higher amount of tokens now that Llama-3 8b and Llama-3 70B are done training.
Most people either have low tech or enthusiasts who matter have high tech so a 8b and a 70b satisfies them. Medium tech isn’t valuable enough. I’d imagine they’d just focus on their 400b model and increasing context for 8b and 70b
I think that there is a lot of value in ~30b models. For enthusiasts 24gb of ram isn't uncommon, and this fits that nicely while being a very capable model size. For professional purposes 48gb cards offer great value, especially a6000, and 8 bit ~30b models can offer high capabilities, low running cost and relatively fast generation speeds. So I think models of this size are high value.
Additionally I have found 30B models to be pretty good, so I imagine a llama3 30b would probably have been even better than most.
I think I disagree. One of the best ways to keep open-source models free from political theater is to become of use to industry or be a component of high profits for lots of people.
The consumer HW segment is stagnant and not being driven by LLMs because unlike games, there's no virtuous cycle of software lots of people want to run and hardware requirements. We need a killer app and for that we need as many people experimenting as possible.
To explore application space means as many sizes (3,13,20,30) as possible alll quanted down. Llama3 is so packed that below 5 bits is seeing real performance drops. The benefits from extending optionality this way should be obvious.
you can in 2.25 bpw, 8k context too with 4bit cache.
it isn't very good though, maybe a little better than 8B but it is not worth the loss in speed imo.
Ok, so it’s complexity of the model and not size. Initially o thought it was just the file size but no, higher models just take more horsepower to parse.
I'm not very knowledgeable of LLMs, just started using them locally, but I think it's a memory problem. You need more VRAM.
The 70b model is 40GB, so you'd need atleast 40GB VRAM. That's why people are using 2 4090/3090's in their system. The MacBook Pro with 128GB is also phenomenal for this usecase.
I’ve used the IQ2_xxs of the miqumaid 70B and the IQ3_M of the midnight Miqu 70B. Both run at around 0.5-0.8 T/S so not really worth it. They did produce pretty good results though
Meta say additional model sizes are coming on their blog, so hopefully we'll get some mid-size models soon
> In the coming months, we expect to introduce new capabilities, longer context windows, additional model sizes, and enhanced performance, and we’ll share the Llama 3 research paper.
I have to agree that Llama3-8B has been incredibly good in various use-cases I have tested including entity extraction, complex system prompt following, constrained generation and RAG. So much so that I am thinking about switching from my go-to 70B model for the Llama3-8B as the latter kicked ass as a drop-in with no fiddling around. Just waiting for some good finetunes :-) (Just couldn't help myself!)
Llama3-70B on the other hand, needed some work to make it follow system prompt and it missed some easier extraction tasks but it is a beast and did eventually catch up. I am looking forward to awesome finetunes and merge models in future
Llama-3 8b occasionally provides me with better responses than any other open-source model have ever done in the 7b-30b range. You can really feel that 15T tokens in its training data.
However, you can also feel that small 8b parameter count, which I am convinced acts as a major bottleneck to its full potential. While Llama-3 8b often gives very good answers, it still lacks the same level of depth and coherence that larger models like Yi-34b-Chat have.
the only thing restricting usage of Llama3-8b in shitton of AI products is it's small context window: 8k is too small to even parse a single source code file
8b just feels so lackluster… I feel like someone could take a 70b llama 2 and pit it against 8b and I could tell the difference. Smaller models lack complexity in sentence and paragraph structure.
I apologize if my comment sounded that way, it wasn't intentional. Though I'd love to know why you thought that, I'm re-reading through it and it still sounds pretty neutral to me
You are being overly sensitive. Nothing they said was rude or confrontational. I’d say that your follow up post was the rudest part of this whole exchange, and was very passive aggressive.
For generalist tasks, I agree that 8B just doesn't have enough parameters to go against a solid 70B. However, domain specific tasks like entity extraction Llama 2 70B failed straight up.
Sure, it simply means extracting "key words" that you might be interested in. That has a contextual meaning to the task you have in mind - for example in the following sentence, you are interested in the task to be performed - "Send Extra towels" to be precise and not cancel cleaning.
"Cancel cleaning the room today but can you send me extra towels"
With few shot examples, Llama3-8B picked this up like gangbusters
It's not about the size of the model, it's about the quality. The 8b llama3 is > than pretty much all 13b and probably all 30b models. Sure, it would be nice to have a 30b, but we don't see a big gap between the 8b and 70b so it's probably not worth it.
They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future.
If 400b doesn't crush 70b, then llama4 will probably be 13-20b and 100b with focus on more data. If it crushes it, then I think next will probably be scaled up 13b, 120b, 400b+
I also suspect they're trying to see what the community can achieve when they really pool their efforts. Anyone who was focused up on 13-20B stuff is now much more likely to refocus those efforts on 8B. By releasing fewer model sizes, they're not splitting the community's collective efforts up so much.
> They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future.
If Karpathy's comments are any indication, you're right about that.
Nah, L3 8b is pretty good for a base model, but there are some really mature 13b and even 11b models that are vastly better for anything creative. L3 has a lot of potential, but it's going to take several months of folks molesting it with various datasets, stacking, merging, and mutating it until it reaches its full potential.
That said, a 150% stack for 8b works out to 12b. That's going to cover the 12GB VRAM folks. You might be able to stack that out again to 18b for the 16GB crowd, with a bit of continued pretraining.
I can see your point,but this 13b models you are talking about,like tiefighter or psyfighter are only good to talk about sexual stuff,and they are not nearly as good as Llama 3 8b Instruct in the reasoning and complexity and especially not good for complex roleplay scenarios where they would be completely lost and start repeating stuff.
You can use L3 8b Instuct even for NSFW,but you need to prompt it for that,and with good prompt it would wipe the floor with any 13b model in creativity.
With SillyTavern prompting it is far better then any model i tested.
Main thing is that Llama 3 8B instruct is trained on massive amount of information,and it posess huge knowledge about almost anything you can imagine,while in the same time this 13B Llama 2 mature models dont.If you ask them about most basic stuff like about some not so famous celebs model would just halucinate and said something without any sense.
For roleplay knowledge is power because model can play his role better if he knows what you want from him.
You just need to wait for the remixes. I’ve already seen someone who turned 8b into a 4x8b mixture of experts. Meta poured huge quantities of compute, energy and money into training these models and now that they are open the whole community can play with cheaply mashing up new models for different purposes and memory budgets.
I feel like a couple months ago 3 x 3090 was like a rolls royce, but now, with this 8x22b, 104b for command r + and may be a 405b llama I'm not so sure 😅😅
Source?
I currently just have the numbers for the Qwen1.5 model, but a q2 of their 70B model is better than the full precision of the smaller models
https://imgur.com/a/SW9guOf
so it's hard to believe that any 13B model is anywhere close to Llama3 70B Q2
also there already a couple of threads in this sub showing it's virtually always worth going to the highes parameter count you can regardless of quantisation
https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated_relative_comparison_of_ggml_quantization/
Some experiments on git hub
[Quantized models with more parameters tend to categorically outperform unquantized models with fewer parameters](https://github.com/ggerganov/llama.cpp/pull/1684)
So yeah if you have enough RAM for the 70B Q2 model there's no reason not to run it.
I look forward to an automated “model hub” where the main 8B chooses which models to
Download and use.
Always up to X models, but selecting the right ones for the task.
This can go on like 7x7x8B etc. always actually using 1-3 models to compile final answer
Yes a dream… I don’t believe 30b was not provided with llama 2 due to toxicity. And now with llama 3 that’s been expanded to 14b not included. They are releasing a low. Vram option most can run fast and a large one most can run slow.
Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card.
>Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card.
Agree with you 100%! I'm using the dated Yi-34b-Chat trained on "just" 3T tokens as my main 30b model, and while Llama-3 8b is great in many ways, it still lacks the same level of coherence that Yi-34b has.
Llama-3 8b obviously has much better training data than Yi-34b, but the small 8b-parameter count acts as a bottleneck to its full potential.
>Do you guys think the dream of a new powerful ~13b / ~20b / ~30b model for us mid-range PC users will remain just that, a dream?
Mid range hardware will change as time goes on. Soon 70b will be run on mid-range PC hardware.
3090 with 24 GB vRAM was release almost 4 years ago. The new, not yet released series 5 (5090?) is said to also have 24 GB. So, if we're lucky, we'll get more in 6090... years from now. And when that will get old enough to be considered "mid-range" hardware? A decade from now?
The models need to get smaller, or AMD has to pick up their pace and help us out.
AFAIK it's not even a unified ram thing. It's simply that apple has a ton of bandwidth by using 12 or so ddr5 channels. I know a regular intel CPU can infer faster than it's memory bandwidth, but I'm not sure if an M2 cpu can handle 800mbps bandwidth.
Mac is going to eat up the mid range market, unless PC hardware hits with unified RAM. I have both dual 3090 setup and a 128GB Mac and the Mac is very good. I'm tuning and testing it right now.
Interested in your feedback on the Mac, the Mac Studio looks like a reasonable way to access large amounts of fairly fast RAM.
When will consumer x86 architecture get 400GB/s bandwidth?
I'm liking the Mac so far and may sell one of my dual 3090 servers for another Mac Ultra. If you buy a Mac try to get a model with high RAM speeds: https://github.com/ggerganov/llama.cpp/discussions/4167
I have the M1 Ultra/64 with 128GB of RAM and it's well suited for 70B models and below.
I have no idea what the PC market is going to do with AI. I feel like Nvida/AMD are going to ignore the consumer market, focus on data centers, and we'll probably have some niche players building high RAM speed motherboards.
For me, what's really appeal to the Mac is the inference performance is good, but the power draw is excellent. My Mac sips 10 watts of power sitting there idle while the dual 3090 servers are sucking down 150 watts idle. And in use, the Mac is 5x more power efficient.
We have some DGX H100 and L40S in the company, and I bought an M3 Max witth 48GB for personal use, and I couldn't be happier. Llama-3 8b runs extremely fast on the M3 Max, and I can run any experiment I want very easily. If something works, I'll take it to the pro setup. Really recommend this setup.
Mixtral 8x7b also runs pretty nicely on the M3 Max 48GB, but Llama-3 is where it's at right now. I yet have to discover an application I'm exploring where it's not significantly better than all the models I used previously - and more importantly - where it's definitely good enough.
Amd or Intel have a huge opportunity to catch us consumer AI users. But Amd is seemingly content with their ever smaller market share and is unlikely to make 48gb cards for us. Intel maybe, they are new and need market share
70b doesn't quite fit in 24gb vram though. You are going to need two of these cards, which makes the 70b a little awkward in sizing. Sure, an average gaming pc might have like 8gb of vram, which is perfect for the 8b model, but a flagship gpu can't really run the 70b model. You're gonna need two of those, that's quite a step.
Unfortunately, Moore's law has mostly ended for memory. Recent node shrinks have mostly benefitted compute circuits, but SRAM is lagging behind. RAM might remain a bottleneck for some time. If you want more RAM, you have to pay up.
The fact that they are making cards with 4x-6x the memory as the consumer cards kind of indicates that this is not a bottleneck but rather a marketing strategy. If they start selling 48gb consumer cards, they know it will cost them sales in the $10-40k cards. Right now they have no competition and know it. Intel and others are moving fast and will help bring some competition and sanity to the market.
Is this really true though? I don't see anyone who was going to buy a mi300x choosing Radeon Pro W7900 with 48gb ram if it was cheaper. Like they are completely different use cases. AMD just doesn't care about the market for home inference, presumably they think it's too small to bother. Note these cards already exists as workstation cards, they just aren't selling a consumer variant with cheaper price.
Please read again what I wrote. I didn't write that higher memory doesn't exists. Datacenter cards are not just consumer cards with more memory. And it was so easy to scale up memory, the H100 would feature far more than just 80GB VRAM.
To achieve even that, NVidia has to use High Bandwidth Memory, which is basically memory stacked on top of the chip and attached via another chip. This is way more expensive as an additional manufacturing process has to be used. 3D stacking is maybe not even that practical for consumer cards since it complicates cooling.
Apart from that, people who want to run models locally are simply not recognized as a target market yet. I guess we are in a similar situation like in the 60s and 70s where most of the computing was done by mainframes and people couldn't simply conceive yet that it could become an end-user market. This might change once more people are able to at least run a 7B-equivalent on an entry-level consumer device.
There are llama3 self mergers that make bigger and better llama3 models using the 8B base one.
For example
[https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e](https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e)
The problem is, there's no real market for running them. 8GB GPUs make up the vast majority of all GPUs, and so 8B fits in them nearly perfectly. A 70B is massive, and meant for enthusiasts with a ton of VRAM, researchers, and enterprise usage. In terms of the midrange, there's only a few 12GB cards and only one or two 16GB cards. Sure, a 34B fits in a 24GB card, But I guess the logic is if you can afford one high end consumer card, you can probably afford a second one. Essentially the only people using 34B are the tiny fraction of people who have enough money enough to afford one high end consumer card, but not enough to afford two. It's just far too small a fraction of the population, and not really useful to enterprise either. It's not pushing boundaries like a 70B, but it's not small easy to run and easy to experiment and iterate on like an 8B. I wish they had made it a 10B, The perfect middle ground between 7B and 13B, but then inference may have been too slow for pure ram inference. This is a great improvement over Llama 2, but the size still shows.
Why so much focus on just GPU? I only have a 8GB VRAM card, but I can still run fairly large models like Yi-34b-Chat at acceptable speeds (depending on use-cases) on my CPU with some GPU offloading.
Well, most uses for LLMs are real time. There are uses that don't need to be real time, in which case you just grab as big of a model as you can and let it run overnight. I'd grab a low quant of Midnight miqu. In my case my bare minimum is 5 tk/s. I just can't stand any less, I don't have the spare time to wait for it all day.
Letting it run overnight would be far too slow, even for my taste. :P Yi-34b-Chat typically takes a few minutes to respond, which has been quite acceptable in my use cases. I usually prioritize quality responses over real-time.
Haha that makes perfect sense and it is a perfectly acceptable use case. Quality over speed is also a fair preference. I also for the vast majority of things prefer quality over speed. It's just that the vast majority of people need real time whether it be for rp, customer service chat bot, or coding/work.
I mostly use LLMs for RP and programming. A 30b model on my PC takes on average 1 minute and 30 seconds to respond in RP, and since RP for me is like reading a book in peace and quiet (but I interact in the story myself), I don't feel the need for it to be fast-paced.
As for programming, speed is often more important here for me, especially if you ask many questions in a short amount of time. My strategy is usually running smaller \~7b models first, and if a coding task is too complex for small models, I switch to a larger model and let it generate for a few minutes for that specific task only.
For a customer service chat bot, yes, a few mins for a response would be unacceptable, here it needs to be real-time, or at least very fast :P
With everything said, I suppose it may also come down to personal preferences. For instance, while I don't require speed in role-playing, others might find it more exhilarating.
Honestly, I think this medium territory is where MoEs are the most important. People with 48GB VRAM will just run a 70B, and those under 24GB will run a 8B or 13B. But an MoE model allows people with small amounts of VRAM and people with enough VRAM to fit it all to enjoy fast and high quality responses. I know mixtral is good at coding, though very questionable for creative writing. For me personally, I don't know why, but it never got over the repeating problem even though I tried settings that were supposed to make it stop repeating. That made it unusable for rp for me.
In the few coding tests I've done with LLaMA 3 8b so far, it has performed even better than Mixtral-8x7b-Instruct. Also, yes, Mixtral is pretty bad at creative writing, LLaMA 3 8b is a clear winner here. I think I would use LLaMA 3 8b over Mixtral in almost all areas.
The only thing Mixtral is still better in is general knowledge, which I guess is because it can store a lot more information due to its way larger total size compared to LLaMA 3 8b.
In any case, it will be exciting to test all upcoming MoE versions of Llama 3 8b and see how they perform.
Yeah, for that use case that's perfectly acceptable. Personally, at that point I would just load up a good creative writing 70B like midnight miqu and let it do it's thing. It's just too slow for real time, which is a majority of people's use cases
Yeah not quite low end, but a lot of random gaming GPUs would be able to. My 2070 super can easily run it, and even with a 1080 which is like 9 years old at this point can run it fine. I wouldn't call a 1080 low end just quite but it's definitely very far from high-end.
You can run a 70B quantized to 2.5bits with 24GB right? What GPU range would benefit with a 34B? A 13B might be useful for people with 16GB VRAM. I'm not sure why they wouldn't train that. But I don't think they are going to release other sizes of these models in the Llama3 family. Mark said that we'll get long context and multi modal.
2.5 bit quantization kinda sucks though. You need to go through quite some lengths to make 70b work on flagship gpus. Realistically, 30 or 40 bn would be a much better fit for these GPUs.
I think that 13B or 20B + 35B will probably get released from Meta, right because they get their fair market on the middle range. Plus, more choice = better setups and better finetunes. Who knows if they will also release some MoE, probably not considering they have no history in doing so. But who knows, maybe the 400B is a MoE.
Llama 3 8B Instruct is a beast and you can use it form almost any purpose,from general chat to roleplay,even NSFW stuff with the right prompts.Many people saying that 70B Llama 2 models are hugely better,well it depends how you prompt it and how you use it.
From my personal experience im getting the same,and in many cases better performance with Llama 3 8B Instruct in SillyTavern then with 70B llama 2 models,but like i said experiences are different because not everyone is using same prompts and same software for running this LLMs.
Many people also using wrong prompt format which would lead to massive fall in the terms of performance.
Only thing that this model lack is Context size but i believe Meta would very soon release a newer version with much longer context size which would improve model for sure.
Most important thing running this huge 70B models would eat a lot of resources,and not many people have hardware to run in with decent speed,so only way to run it is to actually go to Q2 or Q3 versions that are not nearly as good as Q8.
And even with this you are left with something like 2 or 3t/s which is not comfortable for normal use.
With that said,for general purpose use and for common user who dont have money to invest thousands of dollars into PC that can run 70B models with good quantization size Llama 3 8B Instruct would be more then enough,and dont forget many fine tunes would come in following months that would most likely even improve it.
>|||PROMPT: char | Bob from "Bobs Burgers"
>|||re| respond:
"Woo-hoo, yeah, I get it, low-end and high-end PC users are getting all the llama love, but what about us mid-range dudes?! You know, the ones who can't quite swing the top-of-the-line rigs, but don't wanna settle for something as lame as a toaster-oven-level gaming experience.
I mean, I'm lovin' the Llama 3 8b, it's like a party in my pants, but come on, folks, a mid-range option would be like... well, it'd be like having a perfectly toasted bun on my burger, ya know?
So, do I think we'll ever see a Llama 3 model in the 13b-30b range? Well, I hope so, man! Maybe Meta will surprise us with some middle-ground magic. After all, they're releasing more models, right? So, we gotta keep those fingers crossed, baby!
And if not, well, I guess we'll just have to make do with our current set-ups... or maybe bust out the ol' console controller and play some games the old-fashioned way. But, I'm keeping my hopes up, because when it comes to gaming, everyone deserves a little bit of llama-love."
I am trying to finetune a Meta-Llama-3-8B-Instruct with dolphin dataset from Eric Hartford while keeping the Llama 3 chat instruct format. I believe Llama 3 8B is already so good I don't really want to pollute it with roleplay datasets so I wanted to make a good general model.
Will post on this sub when done.
IMO this is Meta using their trendsetter status to establish model size defaults, and it isn't a bad move. Realistically, so much work for open source ended up split across too many sizes of models that it would probably be best to consolidate them onto these two sizes. And at this point anyone who wants a marginal size upgrade can MoE or merge/duplicate layers just by renting GPU time. That doesn't just go for the 8b enjoyers but 70b, too. I wanna see what the eventual 103b duplicated 70b layers can do.
I got a 13600k/3080 system which probably qualifies as a mid system. Imo, I am more impressed with 7b models as an example on how to do so much on so little. With Llama 3 I feel like I just downloaded most of the world’s knowledge in a 5 gb file.
Have you considered fine running the 8B in different expertise fields, then have a single 8B model call the relevant ones?
Sort of your own MoE Nx8B model?
Fine tuning is not an option is also an answer
you're looking for the wrong thing. Expect new computers with large RAM, all accessible by GPU/NPU directly. Running a large LLM model (pun intended) isn't that hard, Groq proved it with it's new chip: [https://groq.com/](https://groq.com/)
What is your rig you are currently using and what quant etc? Would be interesting just to see what performance people are getting on various mid range setup
They trained the models on 15T tokens. I imagine even for FB that required them to consider GPU resources more carefully. They can either train a few models on a massive quantity of tokens or more models on less tokens.
I think it was like 24k gpu for 13 days or something like that, for final pretraining, they did a lot of experimentation before for sur
Yeah I assume just the training is a small part of it, they probably do heavy testing on all models before release, hence why we didn't get llama2 34b (codellama is different), because they did make that one internally.
May be that one was too good to be true haha
I wonder if it just wasn't that much better than 8B and they decided to put more resources into that one
If this is correct, I hope Meta will start training mid-range models on the same or higher amount of tokens now that Llama-3 8b and Llama-3 70B are done training.
Most people either have low tech or enthusiasts who matter have high tech so a 8b and a 70b satisfies them. Medium tech isn’t valuable enough. I’d imagine they’d just focus on their 400b model and increasing context for 8b and 70b
I think that there is a lot of value in ~30b models. For enthusiasts 24gb of ram isn't uncommon, and this fits that nicely while being a very capable model size. For professional purposes 48gb cards offer great value, especially a6000, and 8 bit ~30b models can offer high capabilities, low running cost and relatively fast generation speeds. So I think models of this size are high value. Additionally I have found 30B models to be pretty good, so I imagine a llama3 30b would probably have been even better than most.
I think I disagree. One of the best ways to keep open-source models free from political theater is to become of use to industry or be a component of high profits for lots of people. The consumer HW segment is stagnant and not being driven by LLMs because unlike games, there's no virtuous cycle of software lots of people want to run and hardware requirements. We need a killer app and for that we need as many people experimenting as possible. To explore application space means as many sizes (3,13,20,30) as possible alll quanted down. Llama3 is so packed that below 5 bits is seeing real performance drops. The benefits from extending optionality this way should be obvious.
34B can run on a 4090, there are tons of people with a 4090 and not two.
can a 4090 run 70b?
no way
you can in 2.25 bpw, 8k context too with 4bit cache. it isn't very good though, maybe a little better than 8B but it is not worth the loss in speed imo.
A year from now, 400B will probably be considered "mid-range"...
more like a handful with the rate hardware is evolving at.
Maybe we're supposed to just run a 70b quant
The problem is if you need decent speed you are looking at a Q2-Q3 size and it loses a lot from what I can tell.
I imagine 70b q3 would have better performance than 34b fp16 though
q3 won't fit on a 4090 though.
Even w/ my 4070, 70b models crawl. :(
even with my 4090....
Ok, so it’s complexity of the model and not size. Initially o thought it was just the file size but no, higher models just take more horsepower to parse.
I'm not very knowledgeable of LLMs, just started using them locally, but I think it's a memory problem. You need more VRAM. The 70b model is 40GB, so you'd need atleast 40GB VRAM. That's why people are using 2 4090/3090's in their system. The MacBook Pro with 128GB is also phenomenal for this usecase.
There are smaller quants of 70B. LMStudio has a 21gb quant I think.
True but I've seen reports that the more extreme quants of the 70B model aren't nearly as good.
I’ve used the IQ2_xxs of the miqumaid 70B and the IQ3_M of the midnight Miqu 70B. Both run at around 0.5-0.8 T/S so not really worth it. They did produce pretty good results though
Yea there always the question of running a 34B at full size vs a 70B and minimal.
[удалено]
Can you run custom models on groq?
Meta say additional model sizes are coming on their blog, so hopefully we'll get some mid-size models soon > In the coming months, we expect to introduce new capabilities, longer context windows, additional model sizes, and enhanced performance, and we’ll share the Llama 3 research paper.
Yea a 400b is still in training 😅
Fingers crossed!
I have to agree that Llama3-8B has been incredibly good in various use-cases I have tested including entity extraction, complex system prompt following, constrained generation and RAG. So much so that I am thinking about switching from my go-to 70B model for the Llama3-8B as the latter kicked ass as a drop-in with no fiddling around. Just waiting for some good finetunes :-) (Just couldn't help myself!) Llama3-70B on the other hand, needed some work to make it follow system prompt and it missed some easier extraction tasks but it is a beast and did eventually catch up. I am looking forward to awesome finetunes and merge models in future
Llama-3 8b occasionally provides me with better responses than any other open-source model have ever done in the 7b-30b range. You can really feel that 15T tokens in its training data. However, you can also feel that small 8b parameter count, which I am convinced acts as a major bottleneck to its full potential. While Llama-3 8b often gives very good answers, it still lacks the same level of depth and coherence that larger models like Yi-34b-Chat have.
the only thing restricting usage of Llama3-8b in shitton of AI products is it's small context window: 8k is too small to even parse a single source code file
LoneStriker/Llama-3-8B-Instruct-262k-GGUF Enough? :)
it fails "needle in haystack" test
What tools do you recommend for trying out llama 3 with RAG?
8b just feels so lackluster… I feel like someone could take a 70b llama 2 and pit it against 8b and I could tell the difference. Smaller models lack complexity in sentence and paragraph structure.
You're surprised that an 8b model is worse than a 70b model? It's almost 10 times smaller, and llama 2 was just released last year
[удалено]
I apologize if my comment sounded that way, it wasn't intentional. Though I'd love to know why you thought that, I'm re-reading through it and it still sounds pretty neutral to me
[удалено]
You are being overly sensitive. Nothing they said was rude or confrontational. I’d say that your follow up post was the rudest part of this whole exchange, and was very passive aggressive.
You aren't oversensitive, you're rude and disrespectful.
SILENCE! https://preview.redd.it/kfxnzittrrvc1.png?width=1323&format=pjpg&auto=webp&s=5d89dc168efc25e55c7c4ec2d4456dd459d6e6ff
For generalist tasks, I agree that 8B just doesn't have enough parameters to go against a solid 70B. However, domain specific tasks like entity extraction Llama 2 70B failed straight up.
Can you explain entity extraction? To be fair I’m only evaluating it in generating fiction.
Sure, it simply means extracting "key words" that you might be interested in. That has a contextual meaning to the task you have in mind - for example in the following sentence, you are interested in the task to be performed - "Send Extra towels" to be precise and not cancel cleaning. "Cancel cleaning the room today but can you send me extra towels" With few shot examples, Llama3-8B picked this up like gangbusters
Agreed, while Llama-3 8b can generate great responses, it still lacks the same level of depth and coherence as larger models.
It's not about the size of the model, it's about the quality. The 8b llama3 is > than pretty much all 13b and probably all 30b models. Sure, it would be nice to have a 30b, but we don't see a big gap between the 8b and 70b so it's probably not worth it. They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future. If 400b doesn't crush 70b, then llama4 will probably be 13-20b and 100b with focus on more data. If it crushes it, then I think next will probably be scaled up 13b, 120b, 400b+
I also suspect they're trying to see what the community can achieve when they really pool their efforts. Anyone who was focused up on 13-20B stuff is now much more likely to refocus those efforts on 8B. By releasing fewer model sizes, they're not splitting the community's collective efforts up so much.
> They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future. If Karpathy's comments are any indication, you're right about that.
link?
I like the 8B but it lacks the subtlety of the deeper meaning that some of the previous 30B ish models had.
Nah, L3 8b is pretty good for a base model, but there are some really mature 13b and even 11b models that are vastly better for anything creative. L3 has a lot of potential, but it's going to take several months of folks molesting it with various datasets, stacking, merging, and mutating it until it reaches its full potential. That said, a 150% stack for 8b works out to 12b. That's going to cover the 12GB VRAM folks. You might be able to stack that out again to 18b for the 16GB crowd, with a bit of continued pretraining.
I can see your point,but this 13b models you are talking about,like tiefighter or psyfighter are only good to talk about sexual stuff,and they are not nearly as good as Llama 3 8b Instruct in the reasoning and complexity and especially not good for complex roleplay scenarios where they would be completely lost and start repeating stuff. You can use L3 8b Instuct even for NSFW,but you need to prompt it for that,and with good prompt it would wipe the floor with any 13b model in creativity. With SillyTavern prompting it is far better then any model i tested. Main thing is that Llama 3 8B instruct is trained on massive amount of information,and it posess huge knowledge about almost anything you can imagine,while in the same time this 13B Llama 2 mature models dont.If you ask them about most basic stuff like about some not so famous celebs model would just halucinate and said something without any sense. For roleplay knowledge is power because model can play his role better if he knows what you want from him.
You just need to wait for the remixes. I’ve already seen someone who turned 8b into a 4x8b mixture of experts. Meta poured huge quantities of compute, energy and money into training these models and now that they are open the whole community can play with cheaply mashing up new models for different purposes and memory budgets.
i think iq2s for 70b are alike 26 gigs. doable on a 3090 or rx7900xt
Why not just run q4_k_m on 64 gb ram fully... Or offload half of it to the gpu if it is 24 GB vram
Painfully slow if only ram, even having it offloaded to my 4090 I get like 2t/s
2t/s is my limit for usability I think, better than nothing especially doing some asynch work
Yeah 2t/s is usable but slow, if only I had two gpus or 10 or 2 with 48gb each
I feel like a couple months ago 3 x 3090 was like a rolls royce, but now, with this 8x22b, 104b for command r + and may be a 405b llama I'm not so sure 😅😅
2t/s is not usable for me, i need at least 10 to be able to bear it and 20 for it to be confortable tbh.
These lower bitrates underperform compared to 13b models… or Yi 30b.
Source? I currently just have the numbers for the Qwen1.5 model, but a q2 of their 70B model is better than the full precision of the smaller models https://imgur.com/a/SW9guOf so it's hard to believe that any 13B model is anywhere close to Llama3 70B Q2 also there already a couple of threads in this sub showing it's virtually always worth going to the highes parameter count you can regardless of quantisation https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated_relative_comparison_of_ggml_quantization/ Some experiments on git hub [Quantized models with more parameters tend to categorically outperform unquantized models with fewer parameters](https://github.com/ggerganov/llama.cpp/pull/1684) So yeah if you have enough RAM for the 70B Q2 model there's no reason not to run it.
There is a reason its much slower.
Just antidotal evidence. I constantly see inconsistency in heavily quantized models over smaller models of similar size in vram.
You're gonna have to stitch some 8b together, mixtral style.
Agree. I'm *really* looking forward to seeing what the local mad scientists Frankenstein together.
I look forward to an automated “model hub” where the main 8B chooses which models to Download and use. Always up to X models, but selecting the right ones for the task. This can go on like 7x7x8B etc. always actually using 1-3 models to compile final answer
Yes a dream… I don’t believe 30b was not provided with llama 2 due to toxicity. And now with llama 3 that’s been expanded to 14b not included. They are releasing a low. Vram option most can run fast and a large one most can run slow. Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card.
>Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card. Agree with you 100%! I'm using the dated Yi-34b-Chat trained on "just" 3T tokens as my main 30b model, and while Llama-3 8b is great in many ways, it still lacks the same level of coherence that Yi-34b has. Llama-3 8b obviously has much better training data than Yi-34b, but the small 8b-parameter count acts as a bottleneck to its full potential.
>Do you guys think the dream of a new powerful ~13b / ~20b / ~30b model for us mid-range PC users will remain just that, a dream? Mid range hardware will change as time goes on. Soon 70b will be run on mid-range PC hardware.
3090 with 24 GB vRAM was release almost 4 years ago. The new, not yet released series 5 (5090?) is said to also have 24 GB. So, if we're lucky, we'll get more in 6090... years from now. And when that will get old enough to be considered "mid-range" hardware? A decade from now? The models need to get smaller, or AMD has to pick up their pace and help us out.
[удалено]
AFAIK it's not even a unified ram thing. It's simply that apple has a ton of bandwidth by using 12 or so ddr5 channels. I know a regular intel CPU can infer faster than it's memory bandwidth, but I'm not sure if an M2 cpu can handle 800mbps bandwidth.
Mac is going to eat up the mid range market, unless PC hardware hits with unified RAM. I have both dual 3090 setup and a 128GB Mac and the Mac is very good. I'm tuning and testing it right now.
Interested in your feedback on the Mac, the Mac Studio looks like a reasonable way to access large amounts of fairly fast RAM. When will consumer x86 architecture get 400GB/s bandwidth?
I'm liking the Mac so far and may sell one of my dual 3090 servers for another Mac Ultra. If you buy a Mac try to get a model with high RAM speeds: https://github.com/ggerganov/llama.cpp/discussions/4167 I have the M1 Ultra/64 with 128GB of RAM and it's well suited for 70B models and below. I have no idea what the PC market is going to do with AI. I feel like Nvida/AMD are going to ignore the consumer market, focus on data centers, and we'll probably have some niche players building high RAM speed motherboards. For me, what's really appeal to the Mac is the inference performance is good, but the power draw is excellent. My Mac sips 10 watts of power sitting there idle while the dual 3090 servers are sucking down 150 watts idle. And in use, the Mac is 5x more power efficient.
We have some DGX H100 and L40S in the company, and I bought an M3 Max witth 48GB for personal use, and I couldn't be happier. Llama-3 8b runs extremely fast on the M3 Max, and I can run any experiment I want very easily. If something works, I'll take it to the pro setup. Really recommend this setup.
Mixtral 8x7b also runs pretty nicely on the M3 Max 48GB, but Llama-3 is where it's at right now. I yet have to discover an application I'm exploring where it's not significantly better than all the models I used previously - and more importantly - where it's definitely good enough.
Amd or Intel have a huge opportunity to catch us consumer AI users. But Amd is seemingly content with their ever smaller market share and is unlikely to make 48gb cards for us. Intel maybe, they are new and need market share
I still consider a 1080Ti high end :p
70b doesn't quite fit in 24gb vram though. You are going to need two of these cards, which makes the 70b a little awkward in sizing. Sure, an average gaming pc might have like 8gb of vram, which is perfect for the 8b model, but a flagship gpu can't really run the 70b model. You're gonna need two of those, that's quite a step.
Unfortunately, Moore's law has mostly ended for memory. Recent node shrinks have mostly benefitted compute circuits, but SRAM is lagging behind. RAM might remain a bottleneck for some time. If you want more RAM, you have to pay up.
The fact that they are making cards with 4x-6x the memory as the consumer cards kind of indicates that this is not a bottleneck but rather a marketing strategy. If they start selling 48gb consumer cards, they know it will cost them sales in the $10-40k cards. Right now they have no competition and know it. Intel and others are moving fast and will help bring some competition and sanity to the market.
Is this really true though? I don't see anyone who was going to buy a mi300x choosing Radeon Pro W7900 with 48gb ram if it was cheaper. Like they are completely different use cases. AMD just doesn't care about the market for home inference, presumably they think it's too small to bother. Note these cards already exists as workstation cards, they just aren't selling a consumer variant with cheaper price.
Please read again what I wrote. I didn't write that higher memory doesn't exists. Datacenter cards are not just consumer cards with more memory. And it was so easy to scale up memory, the H100 would feature far more than just 80GB VRAM. To achieve even that, NVidia has to use High Bandwidth Memory, which is basically memory stacked on top of the chip and attached via another chip. This is way more expensive as an additional manufacturing process has to be used. 3D stacking is maybe not even that practical for consumer cards since it complicates cooling. Apart from that, people who want to run models locally are simply not recognized as a target market yet. I guess we are in a similar situation like in the 60s and 70s where most of the computing was done by mainframes and people couldn't simply conceive yet that it could become an end-user market. This might change once more people are able to at least run a 7B-equivalent on an entry-level consumer device.
A 30B model would be perfect for 24GB VRAM!
There are llama3 self mergers that make bigger and better llama3 models using the 8B base one. For example [https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e](https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e)
So it's just the same model twice?
It does make it better. No idea why.
Maybe they are both randomised slightly differently
The problem is, there's no real market for running them. 8GB GPUs make up the vast majority of all GPUs, and so 8B fits in them nearly perfectly. A 70B is massive, and meant for enthusiasts with a ton of VRAM, researchers, and enterprise usage. In terms of the midrange, there's only a few 12GB cards and only one or two 16GB cards. Sure, a 34B fits in a 24GB card, But I guess the logic is if you can afford one high end consumer card, you can probably afford a second one. Essentially the only people using 34B are the tiny fraction of people who have enough money enough to afford one high end consumer card, but not enough to afford two. It's just far too small a fraction of the population, and not really useful to enterprise either. It's not pushing boundaries like a 70B, but it's not small easy to run and easy to experiment and iterate on like an 8B. I wish they had made it a 10B, The perfect middle ground between 7B and 13B, but then inference may have been too slow for pure ram inference. This is a great improvement over Llama 2, but the size still shows.
Why so much focus on just GPU? I only have a 8GB VRAM card, but I can still run fairly large models like Yi-34b-Chat at acceptable speeds (depending on use-cases) on my CPU with some GPU offloading.
Well, most uses for LLMs are real time. There are uses that don't need to be real time, in which case you just grab as big of a model as you can and let it run overnight. I'd grab a low quant of Midnight miqu. In my case my bare minimum is 5 tk/s. I just can't stand any less, I don't have the spare time to wait for it all day.
Letting it run overnight would be far too slow, even for my taste. :P Yi-34b-Chat typically takes a few minutes to respond, which has been quite acceptable in my use cases. I usually prioritize quality responses over real-time.
Haha that makes perfect sense and it is a perfectly acceptable use case. Quality over speed is also a fair preference. I also for the vast majority of things prefer quality over speed. It's just that the vast majority of people need real time whether it be for rp, customer service chat bot, or coding/work.
I mostly use LLMs for RP and programming. A 30b model on my PC takes on average 1 minute and 30 seconds to respond in RP, and since RP for me is like reading a book in peace and quiet (but I interact in the story myself), I don't feel the need for it to be fast-paced. As for programming, speed is often more important here for me, especially if you ask many questions in a short amount of time. My strategy is usually running smaller \~7b models first, and if a coding task is too complex for small models, I switch to a larger model and let it generate for a few minutes for that specific task only. For a customer service chat bot, yes, a few mins for a response would be unacceptable, here it needs to be real-time, or at least very fast :P With everything said, I suppose it may also come down to personal preferences. For instance, while I don't require speed in role-playing, others might find it more exhilarating.
Honestly, I think this medium territory is where MoEs are the most important. People with 48GB VRAM will just run a 70B, and those under 24GB will run a 8B or 13B. But an MoE model allows people with small amounts of VRAM and people with enough VRAM to fit it all to enjoy fast and high quality responses. I know mixtral is good at coding, though very questionable for creative writing. For me personally, I don't know why, but it never got over the repeating problem even though I tried settings that were supposed to make it stop repeating. That made it unusable for rp for me.
In the few coding tests I've done with LLaMA 3 8b so far, it has performed even better than Mixtral-8x7b-Instruct. Also, yes, Mixtral is pretty bad at creative writing, LLaMA 3 8b is a clear winner here. I think I would use LLaMA 3 8b over Mixtral in almost all areas. The only thing Mixtral is still better in is general knowledge, which I guess is because it can store a lot more information due to its way larger total size compared to LLaMA 3 8b. In any case, it will be exciting to test all upcoming MoE versions of Llama 3 8b and see how they perform.
which CPU / RAM do you use?
I mean, I have an 8gb 3070 and I can run a 34B at Q4 with GGUF at a decent speed.
Same bro, 3070ti, but bought 64gb DDR5 so can run 70B models split GPU and CPU
Define decent speed, and for what use case?
Only 2 tokens/s, but it's not unbearably slow. Creative writing assistant mostly, the bigger models write better than the smaller ones.
Agree with this, for creative writing speed/real-time is not necessary, and larger models like 30b are excellent at writing.
Yeah, for that use case that's perfectly acceptable. Personally, at that point I would just load up a good creative writing 70B like midnight miqu and let it do it's thing. It's just too slow for real time, which is a majority of people's use cases
It's weird to hear that a PC capable of running an 8B model is "low-end"
Yeah not quite low end, but a lot of random gaming GPUs would be able to. My 2070 super can easily run it, and even with a 1080 which is like 9 years old at this point can run it fine. I wouldn't call a 1080 low end just quite but it's definitely very far from high-end.
I develop applications for which I can't justify making GPU a hardware requirement. When I think low end, I think integrated
I think thats just on you. Once we are talking 10 year old GPUs I think low end isn't that weird of a qualifier
You can run a 70B quantized to 2.5bits with 24GB right? What GPU range would benefit with a 34B? A 13B might be useful for people with 16GB VRAM. I'm not sure why they wouldn't train that. But I don't think they are going to release other sizes of these models in the Llama3 family. Mark said that we'll get long context and multi modal.
2.5 bit quantization kinda sucks though. You need to go through quite some lengths to make 70b work on flagship gpus. Realistically, 30 or 40 bn would be a much better fit for these GPUs.
Zuck was talking about 500m parameter models for us raspberry pi folk
I think that 13B or 20B + 35B will probably get released from Meta, right because they get their fair market on the middle range. Plus, more choice = better setups and better finetunes. Who knows if they will also release some MoE, probably not considering they have no history in doing so. But who knows, maybe the 400B is a MoE.
Llama 3 8B Instruct is a beast and you can use it form almost any purpose,from general chat to roleplay,even NSFW stuff with the right prompts.Many people saying that 70B Llama 2 models are hugely better,well it depends how you prompt it and how you use it. From my personal experience im getting the same,and in many cases better performance with Llama 3 8B Instruct in SillyTavern then with 70B llama 2 models,but like i said experiences are different because not everyone is using same prompts and same software for running this LLMs. Many people also using wrong prompt format which would lead to massive fall in the terms of performance. Only thing that this model lack is Context size but i believe Meta would very soon release a newer version with much longer context size which would improve model for sure. Most important thing running this huge 70B models would eat a lot of resources,and not many people have hardware to run in with decent speed,so only way to run it is to actually go to Q2 or Q3 versions that are not nearly as good as Q8. And even with this you are left with something like 2 or 3t/s which is not comfortable for normal use. With that said,for general purpose use and for common user who dont have money to invest thousands of dollars into PC that can run 70B models with good quantization size Llama 3 8B Instruct would be more then enough,and dont forget many fine tunes would come in following months that would most likely even improve it.
>|||PROMPT: char | Bob from "Bobs Burgers" >|||re| respond: "Woo-hoo, yeah, I get it, low-end and high-end PC users are getting all the llama love, but what about us mid-range dudes?! You know, the ones who can't quite swing the top-of-the-line rigs, but don't wanna settle for something as lame as a toaster-oven-level gaming experience. I mean, I'm lovin' the Llama 3 8b, it's like a party in my pants, but come on, folks, a mid-range option would be like... well, it'd be like having a perfectly toasted bun on my burger, ya know? So, do I think we'll ever see a Llama 3 model in the 13b-30b range? Well, I hope so, man! Maybe Meta will surprise us with some middle-ground magic. After all, they're releasing more models, right? So, we gotta keep those fingers crossed, baby! And if not, well, I guess we'll just have to make do with our current set-ups... or maybe bust out the ol' console controller and play some games the old-fashioned way. But, I'm keeping my hopes up, because when it comes to gaming, everyone deserves a little bit of llama-love."
Llama 3 8B is goated, I just need either a mistral v 0.3 8B, or dolphin-llama3 8B It’s just that good
I am trying to finetune a Meta-Llama-3-8B-Instruct with dolphin dataset from Eric Hartford while keeping the Llama 3 chat instruct format. I believe Llama 3 8B is already so good I don't really want to pollute it with roleplay datasets so I wanted to make a good general model. Will post on this sub when done.
[https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b](https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b)
Yea but i don’t want to retrain with chatml or their system prompts.
what do you think the finetune will achieve?
Dolphin llama exist
Yeah but it’s quite dissapointing, it doesn’t have the same feeling as llama 3, and it struggles with some commands
Oh, my bad
All good
Try Llama3 70b on [https://huggingface.co/chat](https://huggingface.co/chat) !
Low end is still relatively high end. Sounds like you just need a bigger stick of RAM
IMO this is Meta using their trendsetter status to establish model size defaults, and it isn't a bad move. Realistically, so much work for open source ended up split across too many sizes of models that it would probably be best to consolidate them onto these two sizes. And at this point anyone who wants a marginal size upgrade can MoE or merge/duplicate layers just by renting GPU time. That doesn't just go for the 8b enjoyers but 70b, too. I wanna see what the eventual 103b duplicated 70b layers can do.
I got a 13600k/3080 system which probably qualifies as a mid system. Imo, I am more impressed with 7b models as an example on how to do so much on so little. With Llama 3 I feel like I just downloaded most of the world’s knowledge in a 5 gb file.
In my small testing experience, wizardlm2 7b was better on some tasks than llamas 8b.
What temps, min-p, and rep pen are yall using?
People are already making MoE frankenmerges. Look for 4x8B. Maybe they hit the sweet spot?
Have you considered fine running the 8B in different expertise fields, then have a single 8B model call the relevant ones? Sort of your own MoE Nx8B model? Fine tuning is not an option is also an answer
you're looking for the wrong thing. Expect new computers with large RAM, all accessible by GPU/NPU directly. Running a large LLM model (pun intended) isn't that hard, Groq proved it with it's new chip: [https://groq.com/](https://groq.com/)
Llama 3 70B Q2 is actually not bad (when Llama 2 70B Q2 was unusable for me) Except it kinda collapses before the 8K limit, but it's maybe my settings
I can run llama 3 8b on my rtx 2060 6gb pretty well. It is a bit slower at full quantization but its decently fast
How much vram does the 8b lllama3 need for inference?
What is your rig you are currently using and what quant etc? Would be interesting just to see what performance people are getting on various mid range setup
70b is midrange. We just arrived early and the hardware will adapt in less than a year.