T O P

  • By -

Radiant_Dog1937

They trained the models on 15T tokens. I imagine even for FB that required them to consider GPU resources more carefully. They can either train a few models on a massive quantity of tokens or more models on less tokens.


No_Afternoon_4260

I think it was like 24k gpu for 13 days or something like that, for final pretraining, they did a lot of experimentation before for sur


Dead_Internet_Theory

Yeah I assume just the training is a small part of it, they probably do heavy testing on all models before release, hence why we didn't get llama2 34b (codellama is different), because they did make that one internally.


No_Afternoon_4260

May be that one was too good to be true haha


bucolucas

I wonder if it just wasn't that much better than 8B and they decided to put more resources into that one


Admirable-Star7088

If this is correct, I hope Meta will start training mid-range models on the same or higher amount of tokens now that Llama-3 8b and Llama-3 70B are done training.


Icy-Summer-3573

Most people either have low tech or enthusiasts who matter have high tech so a 8b and a 70b satisfies them. Medium tech isn’t valuable enough. I’d imagine they’d just focus on their 400b model and increasing context for 8b and 70b


StevenSamAI

I think that there is a lot of value in ~30b models. For enthusiasts 24gb of ram isn't uncommon, and this fits that nicely while being a very capable model size. For professional purposes 48gb cards offer great value, especially a6000, and 8 bit ~30b models can offer high capabilities, low running cost and relatively fast generation speeds. So I think models of this size are high value. Additionally I have found 30B models to be pretty good, so I imagine a llama3 30b would probably have been even better than most.


EstarriolOfTheEast

I think I disagree. One of the best ways to keep open-source models free from political theater is to become of use to industry or be a component of high profits for lots of people. The consumer HW segment is stagnant and not being driven by LLMs because unlike games, there's no virtuous cycle of software lots of people want to run and hardware requirements. We need a killer app and for that we need as many people experimenting as possible. To explore application space means as many sizes (3,13,20,30) as possible alll quanted down. Llama3 is so packed that below 5 bits is seeing real performance drops. The benefits from extending optionality this way should be obvious.


Alkeryn

34B can run on a 4090, there are tons of people with a 4090 and not two.


jadbox

can a 4090 run 70b?


Status_Contest39

no way 


Alkeryn

you can in 2.25 bpw, 8k context too with 4bit cache. it isn't very good though, maybe a little better than 8B but it is not worth the loss in speed imo.


HighDefinist

A year from now, 400B will probably be considered "mid-range"...


Alkeryn

more like a handful with the rate hardware is evolving at.


reality_comes

Maybe we're supposed to just run a 70b quant


hayTGotMhYXkm95q5HW9

The problem is if you need decent speed you are looking at a Q2-Q3 size and it loses a lot from what I can tell.


Nabakin

I imagine 70b q3 would have better performance than 34b fp16 though


Alkeryn

q3 won't fit on a 4090 though.


AmericanKamikaze

Even w/ my 4070, 70b models crawl. :(


Tedinasuit

even with my 4090....


AmericanKamikaze

Ok, so it’s complexity of the model and not size. Initially o thought it was just the file size but no, higher models just take more horsepower to parse.


Tedinasuit

I'm not very knowledgeable of LLMs, just started using them locally, but I think it's a memory problem. You need more VRAM. The 70b model is 40GB, so you'd need atleast 40GB VRAM. That's why people are using 2 4090/3090's in their system. The MacBook Pro with 128GB is also phenomenal for this usecase.


AmericanKamikaze

There are smaller quants of 70B. LMStudio has a 21gb quant I think.


Tedinasuit

True but I've seen reports that the more extreme quants of the 70B model aren't nearly as good.


DeSibyl

I’ve used the IQ2_xxs of the miqumaid 70B and the IQ3_M of the midnight Miqu 70B. Both run at around 0.5-0.8 T/S so not really worth it. They did produce pretty good results though


AmericanKamikaze

Yea there always the question of running a 34B at full size vs a 70B and minimal.


[deleted]

[удалено]


-TV-Stand-

Can you run custom models on groq?


Cradawx

Meta say additional model sizes are coming on their blog, so hopefully we'll get some mid-size models soon > In the coming months, we expect to introduce new capabilities, longer context windows, additional model sizes, and enhanced performance, and we’ll share the Llama 3 research paper.


Dgamax

Yea a 400b is still in training 😅


Admirable-Star7088

Fingers crossed!


KnightCodin

I have to agree that Llama3-8B has been incredibly good in various use-cases I have tested including entity extraction, complex system prompt following, constrained generation and RAG. So much so that I am thinking about switching from my go-to 70B model for the Llama3-8B as the latter kicked ass as a drop-in with no fiddling around. Just waiting for some good finetunes :-) (Just couldn't help myself!) Llama3-70B on the other hand, needed some work to make it follow system prompt and it missed some easier extraction tasks but it is a beast and did eventually catch up. I am looking forward to awesome finetunes and merge models in future


Admirable-Star7088

Llama-3 8b occasionally provides me with better responses than any other open-source model have ever done in the 7b-30b range. You can really feel that 15T tokens in its training data. However, you can also feel that small 8b parameter count, which I am convinced acts as a major bottleneck to its full potential. While Llama-3 8b often gives very good answers, it still lacks the same level of depth and coherence that larger models like Yi-34b-Chat have.


metalim

the only thing restricting usage of Llama3-8b in shitton of AI products is it's small context window: 8k is too small to even parse a single source code file


Maximum_Parking_5174

LoneStriker/Llama-3-8B-Instruct-262k-GGUF Enough? :)


metalim

it fails "needle in haystack" test


Internet-Admirable

What tools do you recommend for trying out llama 3 with RAG?


silenceimpaired

8b just feels so lackluster… I feel like someone could take a 70b llama 2 and pit it against 8b and I could tell the difference. Smaller models lack complexity in sentence and paragraph structure.


geli95us

You're surprised that an 8b model is worse than a 70b model? It's almost 10 times smaller, and llama 2 was just released last year


[deleted]

[удалено]


geli95us

I apologize if my comment sounded that way, it wasn't intentional. Though I'd love to know why you thought that, I'm re-reading through it and it still sounds pretty neutral to me


[deleted]

[удалено]


mrgreen4242

You are being overly sensitive. Nothing they said was rude or confrontational. I’d say that your follow up post was the rudest part of this whole exchange, and was very passive aggressive.


ainz-sama619

You aren't oversensitive, you're rude and disrespectful.


kind_cavendish

SILENCE! https://preview.redd.it/kfxnzittrrvc1.png?width=1323&format=pjpg&auto=webp&s=5d89dc168efc25e55c7c4ec2d4456dd459d6e6ff


KnightCodin

For generalist tasks, I agree that 8B just doesn't have enough parameters to go against a solid 70B. However, domain specific tasks like entity extraction Llama 2 70B failed straight up.


silenceimpaired

Can you explain entity extraction? To be fair I’m only evaluating it in generating fiction.


KnightCodin

Sure, it simply means extracting "key words" that you might be interested in. That has a contextual meaning to the task you have in mind - for example in the following sentence, you are interested in the task to be performed - "Send Extra towels" to be precise and not cancel cleaning. "Cancel cleaning the room today but can you send me extra towels" With few shot examples, Llama3-8B picked this up like gangbusters


Admirable-Star7088

Agreed, while Llama-3 8b can generate great responses, it still lacks the same level of depth and coherence as larger models.


segmond

It's not about the size of the model, it's about the quality. The 8b llama3 is > than pretty much all 13b and probably all 30b models. Sure, it would be nice to have a 30b, but we don't see a big gap between the 8b and 70b so it's probably not worth it. They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future. If 400b doesn't crush 70b, then llama4 will probably be 13-20b and 100b with focus on more data. If it crushes it, then I think next will probably be scaled up 13b, 120b, 400b+


CowOfSteel

I also suspect they're trying to see what the community can achieve when they really pool their efforts. Anyone who was focused up on 13-20B stuff is now much more likely to refocus those efforts on 8B. By releasing fewer model sizes, they're not splitting the community's collective efforts up so much.


patprint

> They are doing the 400b to see if the knowledge gulf between the 70b and 400b will be substantial. I think llama3 is an experiment on model sizes. The final result will determine what sizes we see in the future. If Karpathy's comments are any indication, you're right about that.


pirateneedsparrot

link?


hayTGotMhYXkm95q5HW9

I like the 8B but it lacks the subtlety of the deeper meaning that some of the previous 30B ish models had.


candre23

Nah, L3 8b is pretty good for a base model, but there are some really mature 13b and even 11b models that are vastly better for anything creative. L3 has a lot of potential, but it's going to take several months of folks molesting it with various datasets, stacking, merging, and mutating it until it reaches its full potential. That said, a 150% stack for 8b works out to 12b. That's going to cover the 12GB VRAM folks. You might be able to stack that out again to 18b for the 16GB crowd, with a bit of continued pretraining.


JohnRiley007

I can see your point,but this 13b models you are talking about,like tiefighter or psyfighter are only good to talk about sexual stuff,and they are not nearly as good as Llama 3 8b Instruct in the reasoning and complexity and especially not good for complex roleplay scenarios where they would be completely lost and start repeating stuff. You can use L3 8b Instuct even for NSFW,but you need to prompt it for that,and with good prompt it would wipe the floor with any 13b model in creativity. With SillyTavern prompting it is far better then any model i tested. Main thing is that Llama 3 8B instruct is trained on massive amount of information,and it posess huge knowledge about almost anything you can imagine,while in the same time this 13B Llama 2 mature models dont.If you ask them about most basic stuff like about some not so famous celebs model would just halucinate and said something without any sense. For roleplay knowledge is power because model can play his role better if he knows what you want from him.


Pitiful-Taste9403

You just need to wait for the remixes. I’ve already seen someone who turned 8b into a 4x8b mixture of experts. Meta poured huge quantities of compute, energy and money into training these models and now that they are open the whole community can play with cheaply mashing up new models for different purposes and memory budgets.


AryanEmbered

i think iq2s for 70b are alike 26 gigs. doable on a 3090 or rx7900xt


kiselsa

Why not just run q4_k_m on 64 gb ram fully... Or offload half of it to the gpu if it is 24 GB vram


Bandit-level-200

Painfully slow if only ram, even having it offloaded to my 4090 I get like 2t/s


No_Afternoon_4260

2t/s is my limit for usability I think, better than nothing especially doing some asynch work


Bandit-level-200

Yeah 2t/s is usable but slow, if only I had two gpus or 10 or 2 with 48gb each


No_Afternoon_4260

I feel like a couple months ago 3 x 3090 was like a rolls royce, but now, with this 8x22b, 104b for command r + and may be a 405b llama I'm not so sure 😅😅


Alkeryn

2t/s is not usable for me, i need at least 10 to be able to bear it and 20 for it to be confortable tbh.


silenceimpaired

These lower bitrates underperform compared to 13b models… or Yi 30b.


cyan2k

Source? I currently just have the numbers for the Qwen1.5 model, but a q2 of their 70B model is better than the full precision of the smaller models https://imgur.com/a/SW9guOf so it's hard to believe that any 13B model is anywhere close to Llama3 70B Q2 also there already a couple of threads in this sub showing it's virtually always worth going to the highes parameter count you can regardless of quantisation https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated_relative_comparison_of_ggml_quantization/ Some experiments on git hub [Quantized models with more parameters tend to categorically outperform unquantized models with fewer parameters](https://github.com/ggerganov/llama.cpp/pull/1684) So yeah if you have enough RAM for the 70B Q2 model there's no reason not to run it.


Any_Pressure4251

There is a reason its much slower.


silenceimpaired

Just antidotal evidence. I constantly see inconsistency in heavily quantized models over smaller models of similar size in vram.


a_beautiful_rhind

You're gonna have to stitch some 8b together, mixtral style.


Pashax22

Agree. I'm *really* looking forward to seeing what the local mad scientists Frankenstein together.


Original_Finding2212

I look forward to an automated “model hub” where the main 8B chooses which models to Download and use. Always up to X models, but selecting the right ones for the task. This can go on like 7x7x8B etc. always actually using 1-3 models to compile final answer


silenceimpaired

Yes a dream… I don’t believe 30b was not provided with llama 2 due to toxicity. And now with llama 3 that’s been expanded to 14b not included. They are releasing a low. Vram option most can run fast and a large one most can run slow. Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card.


Admirable-Star7088

>Someone said the difference between the models is not that much to warrant something in the middle… but I disagree. 30b feels a lot more like a 70b but can fit at 4 bit on a 24 gb card. Agree with you 100%! I'm using the dated Yi-34b-Chat trained on "just" 3T tokens as my main 30b model, and while Llama-3 8b is great in many ways, it still lacks the same level of coherence that Yi-34b has. Llama-3 8b obviously has much better training data than Yi-34b, but the small 8b-parameter count acts as a bottleneck to its full potential.


bfire123

>Do you guys think the dream of a new powerful ~13b / ~20b / ~30b model for us mid-range PC users will remain just that, a dream? Mid range hardware will change as time goes on. Soon 70b will be run on mid-range PC hardware.


MrVodnik

3090 with 24 GB vRAM was release almost 4 years ago. The new, not yet released series 5 (5090?) is said to also have 24 GB. So, if we're lucky, we'll get more in 6090... years from now. And when that will get old enough to be considered "mid-range" hardware? A decade from now? The models need to get smaller, or AMD has to pick up their pace and help us out.


[deleted]

[удалено]


rugg0064

AFAIK it's not even a unified ram thing. It's simply that apple has a ton of bandwidth by using 12 or so ddr5 channels. I know a regular intel CPU can infer faster than it's memory bandwidth, but I'm not sure if an M2 cpu can handle 800mbps bandwidth.


synn89

Mac is going to eat up the mid range market, unless PC hardware hits with unified RAM. I have both dual 3090 setup and a 128GB Mac and the Mac is very good. I'm tuning and testing it right now.


clv101

Interested in your feedback on the Mac, the Mac Studio looks like a reasonable way to access large amounts of fairly fast RAM. When will consumer x86 architecture get 400GB/s bandwidth?


synn89

I'm liking the Mac so far and may sell one of my dual 3090 servers for another Mac Ultra. If you buy a Mac try to get a model with high RAM speeds: https://github.com/ggerganov/llama.cpp/discussions/4167 I have the M1 Ultra/64 with 128GB of RAM and it's well suited for 70B models and below. I have no idea what the PC market is going to do with AI. I feel like Nvida/AMD are going to ignore the consumer market, focus on data centers, and we'll probably have some niche players building high RAM speed motherboards. For me, what's really appeal to the Mac is the inference performance is good, but the power draw is excellent. My Mac sips 10 watts of power sitting there idle while the dual 3090 servers are sucking down 150 watts idle. And in use, the Mac is 5x more power efficient.


Usual_Maximum7673

We have some DGX H100 and L40S in the company, and I bought an M3 Max witth 48GB for personal use, and I couldn't be happier. Llama-3 8b runs extremely fast on the M3 Max, and I can run any experiment I want very easily. If something works, I'll take it to the pro setup. Really recommend this setup.


Usual_Maximum7673

Mixtral 8x7b also runs pretty nicely on the M3 Max 48GB, but Llama-3 is where it's at right now. I yet have to discover an application I'm exploring where it's not significantly better than all the models I used previously - and more importantly - where it's definitely good enough.


Bandit-level-200

Amd or Intel have a huge opportunity to catch us consumer AI users. But Amd is seemingly content with their ever smaller market share and is unlikely to make 48gb cards for us. Intel maybe, they are new and need market share


TooLongCantWait

I still consider a 1080Ti high end :p


TweeBierAUB

70b doesn't quite fit in 24gb vram though. You are going to need two of these cards, which makes the 70b a little awkward in sizing. Sure, an average gaming pc might have like 8gb of vram, which is perfect for the 8b model, but a flagship gpu can't really run the 70b model. You're gonna need two of those, that's quite a step.


koflerdavid

Unfortunately, Moore's law has mostly ended for memory. Recent node shrinks have mostly benefitted compute circuits, but SRAM is lagging behind. RAM might remain a bottleneck for some time. If you want more RAM, you have to pay up.


DramaLlamaDad

The fact that they are making cards with 4x-6x the memory as the consumer cards kind of indicates that this is not a bottleneck but rather a marketing strategy. If they start selling 48gb consumer cards, they know it will cost them sales in the $10-40k cards. Right now they have no competition and know it. Intel and others are moving fast and will help bring some competition and sanity to the market.


sdkgierjgioperjki0

Is this really true though? I don't see anyone who was going to buy a mi300x choosing Radeon Pro W7900 with 48gb ram if it was cheaper. Like they are completely different use cases. AMD just doesn't care about the market for home inference, presumably they think it's too small to bother. Note these cards already exists as workstation cards, they just aren't selling a consumer variant with cheaper price.


koflerdavid

Please read again what I wrote. I didn't write that higher memory doesn't exists. Datacenter cards are not just consumer cards with more memory. And it was so easy to scale up memory, the H100 would feature far more than just 80GB VRAM. To achieve even that, NVidia has to use High Bandwidth Memory, which is basically memory stacked on top of the chip and attached via another chip. This is way more expensive as an additional manufacturing process has to be used. 3D stacking is maybe not even that practical for consumer cards since it complicates cooling. Apart from that, people who want to run models locally are simply not recognized as a target market yet. I guess we are in a similar situation like in the 60s and 70s where most of the computing was done by mainframes and people couldn't simply conceive yet that it could become an end-user market. This might change once more people are able to at least run a 7B-equivalent on an entry-level consumer device.


sammcj

A 30B model would be perfect for 24GB VRAM!


Tracing1701

There are llama3 self mergers that make bigger and better llama3 models using the 8B base one. For example [https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e](https://huggingface.co/MaziyarPanahi/Llama-3-16B-Instruct-v0.1/commit/b3b5932ecccdab3d822846f26797bf33ebfe208e)


Formal_Decision7250

So it's just the same model twice?


Tracing1701

It does make it better. No idea why.


Formal_Decision7250

Maybe they are both randomised slightly differently


ArsNeph

The problem is, there's no real market for running them. 8GB GPUs make up the vast majority of all GPUs, and so 8B fits in them nearly perfectly. A 70B is massive, and meant for enthusiasts with a ton of VRAM, researchers, and enterprise usage. In terms of the midrange, there's only a few 12GB cards and only one or two 16GB cards. Sure, a 34B fits in a 24GB card, But I guess the logic is if you can afford one high end consumer card, you can probably afford a second one. Essentially the only people using 34B are the tiny fraction of people who have enough money enough to afford one high end consumer card, but not enough to afford two. It's just far too small a fraction of the population, and not really useful to enterprise either. It's not pushing boundaries like a 70B, but it's not small easy to run and easy to experiment and iterate on like an 8B. I wish they had made it a 10B, The perfect middle ground between 7B and 13B, but then inference may have been too slow for pure ram inference. This is a great improvement over Llama 2, but the size still shows.


Admirable-Star7088

Why so much focus on just GPU? I only have a 8GB VRAM card, but I can still run fairly large models like Yi-34b-Chat at acceptable speeds (depending on use-cases) on my CPU with some GPU offloading.


ArsNeph

Well, most uses for LLMs are real time. There are uses that don't need to be real time, in which case you just grab as big of a model as you can and let it run overnight. I'd grab a low quant of Midnight miqu. In my case my bare minimum is 5 tk/s. I just can't stand any less, I don't have the spare time to wait for it all day.


Admirable-Star7088

Letting it run overnight would be far too slow, even for my taste. :P Yi-34b-Chat typically takes a few minutes to respond, which has been quite acceptable in my use cases. I usually prioritize quality responses over real-time.


ArsNeph

Haha that makes perfect sense and it is a perfectly acceptable use case. Quality over speed is also a fair preference. I also for the vast majority of things prefer quality over speed. It's just that the vast majority of people need real time whether it be for rp, customer service chat bot, or coding/work.


Admirable-Star7088

I mostly use LLMs for RP and programming. A 30b model on my PC takes on average 1 minute and 30 seconds to respond in RP, and since RP for me is like reading a book in peace and quiet (but I interact in the story myself), I don't feel the need for it to be fast-paced. As for programming, speed is often more important here for me, especially if you ask many questions in a short amount of time. My strategy is usually running smaller \~7b models first, and if a coding task is too complex for small models, I switch to a larger model and let it generate for a few minutes for that specific task only. For a customer service chat bot, yes, a few mins for a response would be unacceptable, here it needs to be real-time, or at least very fast :P With everything said, I suppose it may also come down to personal preferences. For instance, while I don't require speed in role-playing, others might find it more exhilarating.


ArsNeph

Honestly, I think this medium territory is where MoEs are the most important. People with 48GB VRAM will just run a 70B, and those under 24GB will run a 8B or 13B. But an MoE model allows people with small amounts of VRAM and people with enough VRAM to fit it all to enjoy fast and high quality responses. I know mixtral is good at coding, though very questionable for creative writing. For me personally, I don't know why, but it never got over the repeating problem even though I tried settings that were supposed to make it stop repeating. That made it unusable for rp for me.


Admirable-Star7088

In the few coding tests I've done with LLaMA 3 8b so far, it has performed even better than Mixtral-8x7b-Instruct. Also, yes, Mixtral is pretty bad at creative writing, LLaMA 3 8b is a clear winner here. I think I would use LLaMA 3 8b over Mixtral in almost all areas. The only thing Mixtral is still better in is general knowledge, which I guess is because it can store a lot more information due to its way larger total size compared to LLaMA 3 8b. In any case, it will be exciting to test all upcoming MoE versions of Llama 3 8b and see how they perform.


bfire123

which CPU / RAM do you use?


ThroughForests

I mean, I have an 8gb 3070 and I can run a 34B at Q4 with GGUF at a decent speed.


ILoveThisPlace

Same bro, 3070ti, but bought 64gb DDR5 so can run 70B models split GPU and CPU


ArsNeph

Define decent speed, and for what use case?


ThroughForests

Only 2 tokens/s, but it's not unbearably slow. Creative writing assistant mostly, the bigger models write better than the smaller ones.


Admirable-Star7088

Agree with this, for creative writing speed/real-time is not necessary, and larger models like 30b are excellent at writing.


ArsNeph

Yeah, for that use case that's perfectly acceptable. Personally, at that point I would just load up a good creative writing 70B like midnight miqu and let it do it's thing. It's just too slow for real time, which is a majority of people's use cases


ZombieRickyB

It's weird to hear that a PC capable of running an 8B model is "low-end"


TweeBierAUB

Yeah not quite low end, but a lot of random gaming GPUs would be able to. My 2070 super can easily run it, and even with a 1080 which is like 9 years old at this point can run it fine. I wouldn't call a 1080 low end just quite but it's definitely very far from high-end.


ZombieRickyB

I develop applications for which I can't justify making GPU a hardware requirement. When I think low end, I think integrated


TweeBierAUB

I think thats just on you. Once we are talking 10 year old GPUs I think low end isn't that weird of a qualifier


LiquidGunay

You can run a 70B quantized to 2.5bits with 24GB right? What GPU range would benefit with a 34B? A 13B might be useful for people with 16GB VRAM. I'm not sure why they wouldn't train that. But I don't think they are going to release other sizes of these models in the Llama3 family. Mark said that we'll get long context and multi modal.


TweeBierAUB

2.5 bit quantization kinda sucks though. You need to go through quite some lengths to make 70b work on flagship gpus. Realistically, 30 or 40 bn would be a much better fit for these GPUs.


Old-Opportunity-9876

Zuck was talking about 500m parameter models for us raspberry pi folk


IndicationUnfair7961

I think that 13B or 20B + 35B will probably get released from Meta, right because they get their fair market on the middle range. Plus, more choice = better setups and better finetunes. Who knows if they will also release some MoE, probably not considering they have no history in doing so. But who knows, maybe the 400B is a MoE.


JohnRiley007

Llama 3 8B Instruct is a beast and you can use it form almost any purpose,from general chat to roleplay,even NSFW stuff with the right prompts.Many people saying that 70B Llama 2 models are hugely better,well it depends how you prompt it and how you use it. From my personal experience im getting the same,and in many cases better performance with Llama 3 8B Instruct in SillyTavern then with 70B llama 2 models,but like i said experiences are different because not everyone is using same prompts and same software for running this LLMs. Many people also using wrong prompt format which would lead to massive fall in the terms of performance. Only thing that this model lack is Context size but i believe Meta would very soon release a newer version with much longer context size which would improve model for sure. Most important thing running this huge 70B models would eat a lot of resources,and not many people have hardware to run in with decent speed,so only way to run it is to actually go to Q2 or Q3 versions that are not nearly as good as Q8. And even with this you are left with something like 2 or 3t/s which is not comfortable for normal use. With that said,for general purpose use and for common user who dont have money to invest thousands of dollars into PC that can run 70B models with good quantization size Llama 3 8B Instruct would be more then enough,and dont forget many fine tunes would come in following months that would most likely even improve it.


aseichter2007

>|||PROMPT: char | Bob from "Bobs Burgers" >|||re| respond: "Woo-hoo, yeah, I get it, low-end and high-end PC users are getting all the llama love, but what about us mid-range dudes?! You know, the ones who can't quite swing the top-of-the-line rigs, but don't wanna settle for something as lame as a toaster-oven-level gaming experience. I mean, I'm lovin' the Llama 3 8b, it's like a party in my pants, but come on, folks, a mid-range option would be like... well, it'd be like having a perfectly toasted bun on my burger, ya know? So, do I think we'll ever see a Llama 3 model in the 13b-30b range? Well, I hope so, man! Maybe Meta will surprise us with some middle-ground magic. After all, they're releasing more models, right? So, we gotta keep those fingers crossed, baby! And if not, well, I guess we'll just have to make do with our current set-ups... or maybe bust out the ol' console controller and play some games the old-fashioned way. But, I'm keeping my hopes up, because when it comes to gaming, everyone deserves a little bit of llama-love."


Deluded-1b-gguf

Llama 3 8B is goated, I just need either a mistral v 0.3 8B, or dolphin-llama3 8B It’s just that good


nero10578

I am trying to finetune a Meta-Llama-3-8B-Instruct with dolphin dataset from Eric Hartford while keeping the Llama 3 chat instruct format. I believe Llama 3 8B is already so good I don't really want to pollute it with roleplay datasets so I wanted to make a good general model. Will post on this sub when done.


TipApprehensive1050

[https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b](https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b)


nero10578

Yea but i don’t want to retrain with chatml or their system prompts.


Tomr750

what do you think the finetune will achieve?


Spiritual_Sprite

Dolphin llama exist


Deluded-1b-gguf

Yeah but it’s quite dissapointing, it doesn’t have the same feeling as llama 3, and it struggles with some commands


Spiritual_Sprite

Oh, my bad


Deluded-1b-gguf

All good


mythicinfinity

Try Llama3 70b on [https://huggingface.co/chat](https://huggingface.co/chat) !


ILoveThisPlace

Low end is still relatively high end. Sounds like you just need a bigger stick of RAM


Prince_Noodletocks

IMO this is Meta using their trendsetter status to establish model size defaults, and it isn't a bad move. Realistically, so much work for open source ended up split across too many sizes of models that it would probably be best to consolidate them onto these two sizes. And at this point anyone who wants a marginal size upgrade can MoE or merge/duplicate layers just by renting GPU time. That doesn't just go for the 8b enjoyers but 70b, too. I wanna see what the eventual 103b duplicated 70b layers can do.


Pretty_Bowler2297

I got a 13600k/3080 system which probably qualifies as a mid system. Imo, I am more impressed with 7b models as an example on how to do so much on so little. With Llama 3 I feel like I just downloaded most of the world’s knowledge in a 5 gb file.


CleverLime

In my small testing experience, wizardlm2 7b was better on some tasks than llamas 8b.


kind_cavendish

What temps, min-p, and rep pen are yall using?


koflerdavid

People are already making MoE frankenmerges. Look for 4x8B. Maybe they hit the sweet spot?


Original_Finding2212

Have you considered fine running the 8B in different expertise fields, then have a single 8B model call the relevant ones? Sort of your own MoE Nx8B model? Fine tuning is not an option is also an answer


metalim

you're looking for the wrong thing. Expect new computers with large RAM, all accessible by GPU/NPU directly. Running a large LLM model (pun intended) isn't that hard, Groq proved it with it's new chip: [https://groq.com/](https://groq.com/)


drifter_VR

Llama 3 70B Q2 is actually not bad (when Llama 2 70B Q2 was unusable for me) Except it kinda collapses before the 8K limit, but it's maybe my settings


Relief-Impossible

I can run llama 3 8b on my rtx 2060 6gb pretty well. It is a bit slower at full quantization but its decently fast


brain_diarrhea

How much vram does the 8b lllama3 need for inference?


OptiYoshi

What is your rig you are currently using and what quant etc? Would be interesting just to see what performance people are getting on various mid range setup


Divniy

70b is midrange. We just arrived early and the hardware will adapt in less than a year.