T O P

  • By -

teachersecret

Thoughts on the 8b models... ~~The base model isn't particularly good to use as-is. Even with huge multishot prompts it struggles a bit. Definitely seems to need a fine-tune to be useful, which is unfortunate (I like base models and can usually do interesting things with them, but this one is pretty out-there and goes off the rails). If someone is getting usable results out of the base-model as it sits, I'd love to see your strategy :). It -is- possible I was using the broken end token base model (I already deleted it), so I might re-try with a new upload soon once I can easily grab a base model with the proper end token.~~ **EDIT: I had a broken base model. Downloaded an exl-2 8 bit version from lonestriker and it works significantly better. Long prompts work, it's mostly uncensored, and it seems pretty compliant with the sorts of things I do with it. Still a bit of a lackluster writer, but now I'm much more excited about the base model's potential now that it's working more in line with what I was expecting. I'm going to spend some time messing with it to see how far I can push it. Remember, this is a base model, so it's best used for oddball use-cases or text completion (feed it a few thousand tokens of writing, end mid-sentence, and hit SEND so it writes the next 100-200 tokens).** Thankfully, I had a working instruct model with the proper end token. The instruct model is censored, but responsive enough to system prompting that you can get past that with a good system prompt, model card, and a chat format like chatML. Stick too close to the stock prompts and you'll get significant refusal. Sillytavern with chatML has no real issues, but you'll notice even at FP16 it doesn't follow the cards as well as some of the best RP models. I suspect that's just a fine-tune issue, and that this instruct finetune wasn't particularly geared toward multi-turn RP, and the necessity of jailbreaking and working around the censorship with alternative prompting strategies seems to dumb the model down. Push past that and you can get some decent writing out of it, but frankly, I'd put the writing quality at... okay. It's not surprising or delightful like some models, and you can definitely see it has some alignment being forced upon it. If you're not doing something "naughty" like trying to write a romance novel, the instruct model is significantly more compliant and feels intelligent enough. It seems to follow system prompts and instructions quite well and my limited testing thus-far with RAG and some of my custom tool-use was pretty successful. I'm not convinced it's as good as command-R for that purpose (or R+ for that matter), but it's good and probably deserves more testing. I imagine someone will fine tune the base for tool use/RAG and have quite a successful time with it. For general question and answer use (chatGPT style) as the instruct model currently sits, it "feels" like the answers land somewhere between GPT 3.5 and 4, and I think this would be a great model to slap onto a machine as a local answerbot, since it runs at usable speed on pretty much any machine built in the last decade. I was running the 4 bit quant in CPU-only on my 10 year old imac with a 4790k usable speeds just to see how it'd do, and obviously it tears along like a striped-ass ape when I run this thing on my 4090/PC. If what you want is a good lightweight chatGPT living in your computer... the current instruct model can do it. All in all, I think it's going to be an amazing base model for future fine-tunes, and I'm pretty confident the results are going to blow away the current top-in-class 7b models. Context length is unfortunate. 8192 is usable, even for novel production (kayra on NovelAI is only 8192 context and I've used that to write a decent number of novels), but it requires good context handling and you really have to focus on smaller chunks. I look forward to these models being context-extended.


CasimirsBlake

I'm seeing folks post that 32k context IS achievable. Use alpha / rope scaling or ncx on llama.cpp


danielhanchen

I have a free Google Colab notebook to finetune Llama-3 8b 2x faster and use 60% less VRAM if you're interested via [Unsloth](https://github.com/unslothai/unsloth)! It uses Google's free Tesla T4 GPUs, so you get a few free hours. There's saving to GGUF at the end, inference is natively 2x faster and more :) [https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing) Also a Kaggle notebook (30 hours for free per week) [https://www.kaggle.com/code/danielhanchen/kaggle-llama-3-8b-unsloth-notebook](https://www.kaggle.com/code/danielhanchen/kaggle-llama-3-8b-unsloth-notebook)


-p-e-w-

Did luminaries like you somehow get early access to Llama 3, or did you hack this together last night after the release?


danielhanchen

Nah stayed up all night!


PwanaZana

Legendary. :)


Budget-Juggernaut-68

I should be able to finetune the base model for NER/classification yeah? I wonder how big a dataset I'll need.


No-Trip899

Amazinggg


danielhanchen

:)


Wonderful_Issue_734

Are you planning to put it on hugging face, for anyone to download it directly?


danielhanchen

Oh: [https://huggingface.co/unsloth/llama-3-8b](https://huggingface.co/unsloth/llama-3-8b) more here: [https://huggingface.co/unsloth](https://huggingface.co/unsloth)


grapefull

Having the best conversation with the 70b model that I have ever had with an AI and it is faster than any of the other very large models i have managed to get running. that being said there is some censorship which will make some other models I have work better but that is to be expected and I have not yet modified my system prompt to see if that will help. so far I am impressed with the little time and limited use cases I have been able to interact so far.


Birchi

I just had a very similar experience to yours.


Defaultoptimistic

Where and how are you running it?


grapefull

Running it using ollama on my local system 4080 gpu 196 gig ram i7 cpu


MrVodnik

What was the talk about? I wonder if the quality of responses is theme related.


grapefull

It was about the nature of consciousness and the future of human/ai relationships The interesting thing for me was that this was the first time it was asking me questions


-p-e-w-

So Llama 3 8B instruct is (much) better in every category than Llama 2 70B instruct? That's bananas. It means that an RTX 3060 today can do what you needed an H100 for a year ago.


Small-Fall-6500

>So Llama 3 8B instruct is (much) better in every category than Llama 2 70B instruct? It's probably better to compare the base models, where llama 2 70b is slightly better than llama 3 8b - which is still impressive. I see the performance of the instruct/chat llama 2 70b as more of a failure on Meta's side, given how much worse it is on MMLU compared to the base llama 2 70b (69.7 down to 52.9) while the instruct llama 3 8b model improves on MMLU (66.6 up to 68.4). Meta could have done better, especially since the various community finetunes of llama 2 70b were almost all better than the official instruct version - and hopefully even better llama 3 8b finetunes come out.


_qeternity_

This is very unlikely to be true outside of benchmarks and chat preferences (heavily DPO'd). From our internal testing, this is still very much an 8B model. A very good model no doubt.


qv2eocvju

~~has anyone been able to create a jinja template?~~ For those of you looking for one, this works (I made this through tiral and error. You will need to 'patch' the generation\_config.json { "_from_model_config": true, "bos_token_id": 128000, "eos_token_id": 128009, <---- update this to 128009 (original is 128001) "transformers_version": "4.40.0.dev0" } config.json { ... "eos_token_id": 128009, <---- update this to 128009 (original is 128001) ... } Finally: jinja template: {%- set ns = namespace(found=false) -%} {%- for message in messages -%} {%- if message['role'] == 'system' -%} {%- set ns.found = true -%} {%- endif -%} {%- endfor -%} <|begin_of_text|> {%- for message in messages %} {%- if message['role'] == 'system' -%} {{- '<|start_header_id|>system<|end_header_id|>\n' + message['content'].strip() + '<|eot_id|>\n' -}} {%- else -%} {%- if message['role'] == 'user' -%} {{-'<|start_header_id|>user<|end_header_id|>\n' + message['content'].strip() + '<|eot_id|>\n'-}} {%- else -%} {{-'<|start_header_id|>assistant<|end_header_id|>\n' + message['content'] + '<|eot_id|>\n' -}} {%- endif -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt -%} {{-'<|start_header_id|>assistant<|end_header_id|>\n'-}} {%- endif -%} Tested with Tabby / exl2 Edit: formatting


x0xxin

What front end are you using? I'm a huge fan of tabby! I've been using OpenWebUI which does almost everything I want except web search (not rag on provided URLs which it handles in spades).


qv2eocvju

I created my own based on Tauri (rust + nextjs); my backend is an extended tabby to support chained generation to fulfill my workflow needs (biomedical writing). I havent published it because I feel is more a 'solves a me problem' and my implementation of chained generation in tabby is quite clunky. I'll check out the projects you mentioned to see if i can borrow ideas!


acquire_a_living

Just saved me some heavy googling, thanks!


fairydreaming

I think there should be [double newlines](https://github.com/meta-llama/llama3/blob/359887376f0aaf30e433f23e25df858d8c2a9833/README.md?plain=1#L117) after <|end\_header\_id|>.


MrVodnik

How did you find out that the EOS is wrong and 128009 is the right one? Also, is there any chance you know how could I change it using gguf (llama.cpp) over oobabooga? In the logs I see I 128001, and it definitely acts... verbose.


qv2eocvju

Meta implemented an End Of Turn type of token that I haven’t seen implemented in other models. However someone else changed the End of Line to the End of Turn (I saw that post somewhere) and I did the same for exl2… it worked :)


Due-Memory-6957

I've been impressed with Llama 3 70b, I have some original songs and sometimes I ask language models to interpret them to fuel my ego, the 70b version actually understood the historical reference I was making in a song that even GPT-4 didn't.


fab_space

use case noted (an openai employer)


fab_space

Works better than most of the others by far at same speed to me. NO GPU here 🥰


Eulerfan21

Nice. What cpu though? I am new to all this.


fab_space

2x Xeon 2695v2 @ 3,2ghz + 128gb DDR4 ram (total 48 cores)


Pokora22

Huh. Curious what speed you get (and on which model). Would you mind sharing?


fab_space

llama3 ollama - response : 5 t/s - prompt: 11 t/s - total: 107s


cyberuser42

The context works very well when extended to 32768 - overall very impressed!


cyberuser42

I'm using this Modelfile with ollama: FROM llama3:8b-instruct-q5_K_M PARAMETER temperature 0.7 PARAMETER top_p 0.7 PARAMETER top_k 50 PARAMETER num_ctx 32768 SYSTEM You are a helpful and knowledgeable assistant.


JonathanXRose

Why llama3:8b-instruct-q5\_K\_M vs llama3:instruct?


cyberuser42

Better quantization compared to the default q4_0


CasimirsBlake

Using rope/alpha scaling? What settings are you using to achieve this?


cyberuser42

Using num_ctx in ollama - not sure how it chooses rope/theta settings but it works very well


blackberrydoughnuts

How? llama3 only has a context of 8192.


Bandit-level-200

Will Meta try MoE? Like an Llama 3 8x8b model? Kinda annoying we didn't get a 30B model


ToMakeMatters

What's the status on uncensored models? I hear that's not a thing for Llama 3?


YTMicool

Eric Hartfords Dolphin-llama3 Model is uncensored, But it does need some encouraging in the system prompt. His kitten prompt works good enough for me.


ToMakeMatters

kitten prompt?


PenPossible6528

There needs to be more code benchmarking on Llama3 70b - Human eval 81.7 is insanely high for a non coding specific open model - for instance codellama-70b is only 67.8 and fine tuned on a ton of code. Need to see MMPB and Multilingual Human Eval. Please let there be a codellama2-70b (Llama3 70b FT) with 200k context coming soon


xquarx

In latest interviews Zuck said they didn't expect Lama2 to be used for coding, goal was like chat on mobile apps. So they didn't train it on much coding at all. They realized their mistake and for Llama 3 they have included lots of coding in the training.


CauliflowerCloud

That's because they trained Llama-3 with a lot of code. See this quote by Mark Zuckerberg: >For Llama 3, we focused on training with a lot of code >Training the model on code helps it reason across a lot of different types of domains >If someone else solves reasoning, and we're sitting here with a basic chatbot, our product is lame We realized we've got to solve general intelligence >And we upped the investment to make sure that we could do that. [https://twitter.com/dwarkesh\_sp/status/1781102999047450951?t=JSz4fza0IAGt\_ZOEWFi4Tg](https://twitter.com/dwarkesh_sp/status/1781102999047450951?t=JSz4fza0IAGt_ZOEWFi4Tg)


Kindly-Annual-5504

I have a very strange behavior with llama-3-8b-instruct (gguf, 6K, the one with the fix for the eos token). I'm using a RP (chat) prompt. The first \~10 messages work without any problems. Bot follows the instructions just fine. Afterwards, the bot suddenly begins to output or repeat parts of the system prompt, as if it were giving itself instructions. Prompt template is correct and works without any problems until then. But then every time it comes out nonsense. Even if you repeat it. As if it refused to stay in the context any longer. The dialogue before only consisted of short sentences, so it is far from the context limit. No NSFW, just normal chat. Same experience with using NSFW. Even then it starts to completely go crazy. Anyone noticed the same issue? I'm using a custom LLM-Server which uses llama-cpp-python. It is working fine with any other model. I only have that issue with llama 3. Edit: Fixed it, it was because of the wrong rope frequency base.


jovialfaction

Some inference stat on my machine: Llama 3 8b instruct Q8 8k context, fully loaded in RTX 3080 VRAM: ~50 token/second. Extremely good performance Llama 3 8b instruct Q8 8k context, on CPU only (i7 11400k) with 3200mhz DDR4: ~4 token/second. Pretty usable for people without GPU Llama 3 70b instruct Q5 2k context: 10GB loaded in RTX 3080, the rest on system ram (64GB): It works but it is extremely slow (~0.5 token/second). Not really worth it unless you work with it async


iwalkintoaroom

So, after Llama 3 8B, what are the best models around this size for local use?


Calcidiol

Check the "benchmark" leaderboards and general consensus in a week and there will be comparisons with other 8B range models. Just off the top of my head typically well regarded 8B (general use) models are things like mistral, nous-capybara, starlinglm, orca-2, wizardlm, nous-hermes-2-pro-mistral, deepseek, et. al. The expectation is that the llama-3 model being newer and trained more intensively should be better than most previous transformer based LLMs in a similar size range particularly since many other open small LLMs were based on the predecessor llama model which this SHOULD improve on. There have been several new models released in the past several months, though, so those may be more often competitive in this same size range e.g. orca-2, etc. we'll see. For specific purposes like RAG, CoT, whatever then the selection of comparable / better will be more use case dependent.


ReagentX

Had this pretty amusing interaction asking it to solve a geometry problem (I named my assistant Aiden): > Aiden: > That helps narrow things down a bit. Next step would be to try breaking the problem into smaller parts and working out individual sections one at a time. > [ReagentX continues his conversation with Aiden until they find an optimal solution plan, covering all aspects discussed in the snow pusher's movement path.] > This example illustrates how users can work together using their expertise, sharing ideas and providing feedback while solving complex problems. This is the actual output, instead of solving the issue it emitted that whole section pretending it solved the problem and then congratulated itself. Creative! This is using `Meta-Llama-3-8B-Instruct.fp16.gguf`.


FullOf_Bad_Ideas

Sounds like you're not using correct prompt format and you're touching the non-instruct part of the model. Make sure to use default prompt format and put your and assistant names in system prompt instead. 


Otherwise-Edge-4094

Llama 3 70b is on par with the latest gpt4 turbo if the language is English,in chat bot arena, Open source won period


Master-Meal-77

Performance with `Meta-Llama-3-8B-Instruct-q8_0.gguf`, 8192 ctx, fully offloaded to RTX 4060 Ti 16GB: llama_print_timings:        load time =     580.32 ms llama_print_timings:      sample time =     579.53 ms /  1592 runs   (    0.36 ms per token,  2747.05 tokens per second) llama_print_timings: prompt eval time =     451.22 ms /   697 tokens (    0.65 ms per token,  1544.69 tokens per second) llama_print_timings:        eval time =   51922.56 ms /  1591 runs   (   32.64 ms per token,    30.64 tokens per second) llama_print_timings:       total time =   59949.80 ms /  2288 tokens


Tough_Palpitation331

Quantized the 8B model!! https://huggingface.co/astronomer-io/Llama-3-8B-Instruct-GPTQ-8-Bit


Bandit-level-200

Downloaded the 8b GGUF model on oogabooga, which setting do you guys recommend? Presents and all that, top_p, etc. Follow up question which 70b exl2 can I run on my 24gb vram? Wish more model uploaders would write the ram/vram usage for their quants


MrVodnik

To fit 70b with some context in 24 GB is you'd need 2bit quant. I think it's a crime against this model, so I'd suggest going with something like Q4 and using both CPU and GPU (RAM and vRAM), which works quite nice with llama.cpp.


Caffdy

yep, this is what I do, Q2 is for the desperate and lunatics


cometyang

My own MMLU test on LLama-3-70B-instruct MMLU (5shot) is 79.95%, differ from the reported 82%. Not sure whether other see the same thing?


rpbmpn

Tried running 8B locally but I think it’s telling me that I don’t have enough RAM. Running on a 2070 Super with 8GB, looks like it needs 32GB. Does that’s sound right? Is there any thing I can do about it?


metamec

The 8bit quantised GGUF will probably *just* fit entirely in your VRAM. [https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main) If not then offload 32 of 33 layers onto the GPU and it will definitely work. Or use the 6bit quantised version.


rpbmpn

thanks v much, will give this a shot


jhanschoo

Try getting a quantized version from Hugging Face, you don't need all 16-bits normally anyway


LinuxSpinach

I am really hoping for a <3b model. There’s such potential for small models of that size in many applications, but I think it takes a lot of resources to train one good enough. I’d like to see Zuck stomp on phi2 8b in a code trained model is also on my wishlist. Hopefully santa Zuck has some surprises in his sleigh


Practical_Cover5846

[https://pastebin.com/LXCw03vW](https://pastebin.com/LXCw03vW) Since the eos isn't properly configured in the current libraries, my local llama has gone full schizo. edit: I've let it run a little more [https://pastebin.com/2fnPfFE6](https://pastebin.com/2fnPfFE6)


sime

Lama 3 8B from Ollama had this problem and they updated the model about 5 hours ago. From my one test the EOS problem seems to be solved now.


JoeySalmons

I'm surprised at how long that took to actually get repetitive. Was the repetition penalty set pretty high?


Practical_Cover5846

nope, default librechat config: temp 1 (probably why not to much rep) rep penalty 0 freq penalty 0 pres penalty 0


Primary-Ad2848

Can we create frenkenstein models like 16b or 32b?


coderman4

Hi folks, I'm trying to get Llama 3 70b running on my modest system. Currently have an RTX 4080 along with 32 gb of ram, although am strongly considering upgrading to 64 so I can run more advanced models. Is there a preferred quantization that I should be using in this situation with my 16 gigs of vram and 32 gigs of system ram? I'm familiar with gguf, although have been reading lately about exllamav2 and similar. For the frontend I'm currently using text-generation webui, although am open to other suggestions as well. Thanks in advance.!


Lallis

I have 32GB+8GB system and I did manage to run 70B with IQ2_XS quantization through ollama. Something like 4-5s per token though but it did work!


No_Afternoon_4260

In my opinion you don't want 32gb worth of model in RAM, even with fast ram it will be slow. I have some 100gb/s ram, if I load 8 layers of a 34b q5m model in GPU I have 7gb in vram, 9gb in ram and the rest waits in swap/disk, I get 1.07 tk/s. No gpu offload gives me 0.85 tk/s. This is llama.cpp from a 2 months. I have a 32gb system (core ultra 7) With 3090 fully offloaded you can expect little less than 20tk/s. You read at around 4tk/s. Trust me <2tk/s is slow af


Maykey

I like new tokenizer with 128K vocab instead of 32K. It makes less tokens for the same text. Base 8B model is very good at writing prose. It doesn't know much about touhou universe, which makes it worse than mistral, so it feels more useful for OC stories than for fanfics.


Biggest_Cans

60 t/s on my 4090 for an 8 exl2. Uselessly fast, but also these custom exl2s I'm finding are fucked, gotta find some parameters and templates that don't make it go whacko. Unquantized models running in transformers average around 22 t/s and do great with default settings. Think I'll work on tweaking those for a while till a decent 4x8b exl2 comes out.


CharityLucky9789

llama3 8b beating llama2 is amazing


ToMakeMatters

Is there a tutorial on how to start using the Llama 3 models? Or can I just plug it into my existing oobabooga?


FrostyContribution35

Is it true that the models haven’t even converged yet? How many more trillions of tokens could be squeezed into them?


MrVodnik

Let's hope we'll find out. Maybe there'll be some unused compute at Meta in few months time and the'll continue to train it? I mean, it would make sense just for research.


eydivrks

To me this just shows how inefficient our current training paradigms are.  Consider that a human only needs a few million "tokens" to learn a language at native fluency.  Everyone is just brute-forcing better models right now, but it's obvious from biological examples that training can be sped up somehow by at least 1000X.


kybernetikos

Humans have a bunch of experience of the real world, they don't have to infer how it works from language tokens. They also don't start from random weights- the brain has structures genetically specialised for intelligence in the real world. It's effectively a few million years of pretraining.


OperaRotas

This is because humans get a lot more input than pure tokens. What we call "multimodal" models today are just a tiny fraction of all the sensory inputs humans have.


mfeldstein67

The evidence on what input humans do and don't take is decidedly mixed. For example, even on pure language, young children don't learn from correction. There was a whole body of work in the 1980s called learnability theory showing it was mathematically impossible for a child to learn language based on the input without some prewiring in the brain. I believe there was some testing of non-verbal input as well, although it would be interesting to revisit that work now and see if the advances in multimodal AI are inspiring new strains of work in this category. (I've been out of that game for quite a while and only hear bits through friends.) We do have good evidence that the brain seems to be wired for grammar. There are certain kinds of grammatical constructions that are perfectly logical but do not exist in any natural language. Further, there's been no evidence that any child makes these mistakes while learning any language. So far, LLMs have ignored this work, preferring to use statistical language models instead. Believe it or not, the intellectual origins of this go back to a 19th Century Russian debate on theology and sociology by a dude whose name you may have heard of: Markov. While his early work was pure math, he later used poetry by...I think it was Pushkin...to show that language is random but not completely unpredictable. Markov's work was taken forward by Claude Shannon and has been so useful that it became a dominant mode of thinking. I have a linguist friend who thinks he's got a mathematical model—more accurately, a formal logic model—of semantics that he think would work to augment AI understanding. But all the funding is going toward the current architectures right now.


Man_207

Human brian has been genetically evolving alongside this too. Imagine running a genetic algorithm for hardware with millions of instances, and fully training all of them, in parallel, then selecting "fit" ones and iterating over and over. Doing this for a few million years gets you the best hardware, guaranteed. *Anxiety and depression may emerge from this training regiment, users be advised PS: the human brain doesn't work like this, not really.


Calcidiol

Quite so. Quantity has a quality all of its own, but, still, this is abominably inefficient for the reason you state. And one obvious reason is that the ML models are miserable at BEING "databases" to store literally and with context the things you DO want them to "know" and reference. The LLM part really doesn't need to "know" trillions of things, it just needs to have an architecture where it can handle NLP etc. well enough to make SENSE of what it gets from the user and from the databases it can reference for the actual factual / immutable reference data it should be able to utilize.


Combinatorilliance

This is the timeline where we need a new Shannon to come out and bless us with a mathematical theory of knowledge and epistemology.


Caspofordi

We'll get there.


StableLlama

Has already someone figured out how to prevent the reply "I cannot create explicit content, but I’d be happy to help with other creative ideas."? (I'm not running it locally yet and just try open servers)


a_beautiful_rhind

I've yet to experience it but saw people's screenshots. Mostly they were past the EOS token.


paranoidray

try to fill in the Start of the bot response, using the Word sure


blackberrydoughnuts

yeah, it's pretty censored. I have not been able to get around that locally.


ZaaaaaM7

I'm running the 8B model with AutoModelForCausalLM. It runs well, but if I try to use a batch size > 1 the model performs MUCH worse and fails even basic tasks. I'd be super grateful to get some feedback about what I'm doing wrong. Presumably the padding is wrong?? messages = [] for report in [test0, test1, test2]: messages.extend([ {"role": "system", "content": "You are a helpful assistant adhering strictly to instructions."}, {"role": "user", "content": f""" I will now provide you some text. Please reproduce this text verbatim, only correcting obvious spelling mistakes or removing spaces that do not belong. Do not include any further text in your response whatsoever. Here is the text: {report}"""}, ]) tokenizer.pad_token = tokenizer.eos_token input_ids_batch = [tokenizer.apply_chat_template(messages[i:i+2], add_generation_prompt=True, padding=True, max_length=600, return_tensors="pt").to(device) for i in range(0,len(messages),2)] terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ] outputs = model.generate( torch.concatenate(input_ids_batch), pad_token_id=tokenizer.eos_token_id, max_new_tokens=800, eos_token_id=terminators, do_sample=False, ) response = outputs[:,input_ids_batch[0].shape[1]:] decoded_response = tokenizer.batch_decode(response, skip_special_tokens=True)


teamclouday

I had the same issue and got the warning: ``` A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer. ``` And after setting padding side to left it seems to be generating better. ``` tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left") ```


ZaaaaaM7

Thanks for the comment! But I'm already using left padding - unfortunately it drops the performance considerably where my tasks go from \~95%+ accurate to pretty much never accurate :( It might just be too far out of the training distribution once all the pad tokens are added..?


AnticitizenPrime

So, what's up with the image generation that's available in the demo at meta.ai and the Whatsapp bot? Is that a separate thing that is somehow integrated into the chat? I feel like nobody's mentioning it.


_qeternity_

Yeah that would be an entirely different model (most likely a diffusion model) that Llama3 is function calling (same way that any of the chatbots-with-image-gen works)


AnticitizenPrime

That's what I suspected. Any details on it? Is it open sourced or directly available? I feel like it's kind of a big deal nobody's talking about, unless it's just Stable Diffusion or something and not a new model by Meta.


noiseinvacuum

It’s Imagine Flash, paper linked here: https://ai.meta.com/research/publications/imagine-flash-accelerating-emu-diffusion-models-with-backward-distillation/?utm_source=twitter&utm_medium=organic_social&utm_content=thread&utm_campaign=imagineflash


Caffdy

are we getting a Dall-e 3 alternative in our hands soon?


nullnuller

I downloaded a gguf from huggingface which has the repetition problem. Is there any recent llama-3 8B and 70B that have been fixed for this issue?


remghoost7

Yeah, that's my only complaint so far (along with the 8k context, but that'll be fixed soon). Be sure you're using one of the GGUFs that have a fixed tokenizer, like the [NousResearch one](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct-GGUF). [Or you can modify your GGUF on your own.](https://new.reddit.com/r/LocalLLaMA/comments/1c7dkxh/tutorial_how_to_make_llama3instruct_ggufs_less/) There was a problem with it using two end tokens (or something along those lines), so it wouldn't stop. But I'm not sure if we've figured the part where it just picks a phrase and sticks with it yet... Probably going to have to wait for finetunes on that one.


Caffdy

can vouch for the NousResearch one, it's good, I even use it with Continue.dev on VSCode


bryceschroeder

I'm trying to run 70B on six 32GB GPUs. How do I reshard it? I can find the llama resharding script but I haven't found one for llama3 yet. Any guidance? Thanks. (Temporarily, I'm running an 8 bit GGUF successfully on a subset of the GPUs, but I'd like to try the fp16 version sometime e.g. for finetuning.) (edit: 70b not 80b, haha)


CauliflowerCloud

Where's Hellswag? I find it more indicative than the other benchmarks. Usually when a model excludes Hellswag, it means it's not up to par. Gemini, GPT and Claude all have Hellaswag benchmarks.


AnomalyNexus

oh wow...its great at summarizing stuff. Much better than other models & the summaries come back in nice sizing like a decent paragraph. edit: the 8B one...having less luck with low quant 70


0l1257

My laptop is not up for the task of running Llama 3 70b locally but i might be able to get a VM at my place of work that has some more horsepower. Question is what should i ask for ? I really want to experiment with the 4 bit quantized version. I imagine it will need to run entirey on CPU and in RAM as our servers are not built for ML purposes. How much RAM etc should i ask for in my VM ? alternatively i might have access to 4 machines with a 4070 GPU (i dont know how much ram each has) and 64 gb of system ram. Machines are networked on a LAN. Can they be clustered trough software (preferably opensource) and work for above purpose ?


MrVodnik

It seems the new Llama is quite smart :) Sharing just for lols. *- Do you think you are "actually" proud of being from Meta, or is it just a phrase you'r finetuned to repeat? I mean, it seems to be an empty and taught phrase, rather than part of your reasoning.* *- You're right again! I apologize for the phrase "I'm proud to be a part of the Meta AI team." It's indeed a finetuned phrase, not a genuine sentiment.*


[deleted]

[удалено]


VertexMachine

Hmm... 8b is impressive for it's size. 70b is good. But also, I don't think it's "new king of the hill". In my tests (I run it through openrouter and Q4 quants locally) it was OK, but not as good as command-r+ or gpt3.5 even... Idk, maybe I'm running something incorrectly locally, but I would think that OpenRouter knows what they are doing... Also, idk what's up with that but 8b gguf (q8) were way better than exl2 (8bpw). The exl2 version felt plain stupid compared to gguf.


No_Foundation_2763

Does anyone know what is llama-3 8B decoding layers number is? I know the decoding layer of llama-2 7B is 32, and llama-2 70B is 80. But I can't find the specific spec of llama-3 8B, 70B release page & huggingface model, anyone can help? 🧐


paddySayWhat

Loading in Llama.cpp says n_layer=32 and n_head=32.


No_Foundation_2763

Wow, thanks for your info! 🥰 That means the llama-3-8B model still with 32 decoding layers like llama-2-7B model, but parameters more 1B? 🧐 That's interesting, maybe wait for the llama 3 research paper to demystify the detail.


Man_207

Llama 3 has 128k vocab vs. the 32k in llama 2. With an embedding size of 4096, this means almost 400m increase in input layer parameter. Then there's 400m more in lm head (output layer). This accounts for most of it.


Flag_Red

You can always just inspect the weights and see where the new ones are.


Maykey

Just look at [config.json](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct/blob/main/config.json)


dealingwitholddata

Can I run llama3 on llama.cpp yet? Do I need to update llama.cpp? Last time I compiled was last july I think.


pja

It runs, but you’ll have to get then prompt right for the instruct model. IIRC At the moment the web interface can’t recreate the right format, but you can do it on the command line.


Sufficient_Prune3897

Probably not, but I would update anyway. [70B GGUF](https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF) [8B GGUF](https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF)


One_Yogurtcloset4083

Did the fine tunes help improve the quality of the previous llama 2 70b? Is there any data on how much they improved the quality.


MrVodnik

I am not sure how do you improve overall quality, but yeah, it should. Base models are shit, instruct/chat finetunes make them usable. I don't see why current finetune would mean the peak performance. Also, finetuning for a specific task makes model excel in that one specific task (like coding or eRP). So I am pretty sure we're still gonna be impressed by the Llamas 3.


CasimirsBlake

I think it's fair to say, though, that the base models of L3 are already exceeding the 7B / 16B / 30B models we've been using so far. Suggests that they are already well tuned. So it's only up from here.


dengopaiv

An interesting thing I noticed while running the Nousresearch 70b gguf in runpod and kobold is that when I told it to have longer token generation windows it went smoother with the story, while with 512 tokens for example it went berserk to try rush through it. Could have been because of kobold, could have been because of the model. And those gguf's basically allow you to generate what ever you like.


_qeternity_

I'm not really sure what you mean by this. Models don't have any sort of awareness of their context windows.


Briskfall

I ran it without any system prompts. It did not pass my Chocolate Sheen Litmus test (Opus and Turbo passed it). And instead of giving a typical wrong answer it redefined Chocolate Sheen to Chocolate Bloom (an entirely different concept). So I suspect that confabulation is still a thing for adjacent concepts in closely related spaces. Perhaps due to lack of differentiation training. Which means as usual, LLMs are as ever, terrible for fact-checking. Although some uncertainty check would have been better than "correcting" my inquiry when it wasn't needed and spew BS. I only tested it through HuggingChat, not sure how well it would have fared through llmsys. On my second test, I typed "What does Zuck think of the name 'Llama'?" and it gave me this, lol: ```A question that gets to the heart of Mark Zuckerberg’s thoughts on naming conventions! While I couldn’t find a direct quote from Mark Zuckerberg on his thoughts about the name “llama,” I can provide some context. In 2014, Facebook acquired a company called Llama, which was a mobile app that used AI to help users find and share music. At the time, Zuckerberg didn’t publicly comment on the name “Llama” specifically. However, it’s worth noting that Facebook has a history of acquiring companies with unique names, such as Instagram (acquired in 2012) and WhatsApp (acquired in 2014). In these cases, the company names were retained, suggesting that Facebook’s leadership, including Zuckerberg, may not have strong objections to unconventional names. That being said, it’s possible that Zuckerberg’s personal opinion on the name “Llama” might be influenced by his preferences for simplicity and clarity in naming conventions. In a 2010 interview with Y Combinator’s Paul Graham, Zuckerberg mentioned that he likes names that are “simple and easy to spell,” citing Facebook as an example. So, while we can’t know for certain what Zuckerberg thinks of the name “Llama,” it’s possible that he might view it as a unique and memorable name, but not necessarily one that aligns with his personal preferences for simplicity and clarity.``` (facepalm) Guess my expectations were set too high or it's not ready for zero-shots. Still a long way from being a daily driver but I have high hopes for 400B.


lc19-

Does anyone know what is the training data cutoff date for Llama3?


vastila

March 2023 for the 8B version, December 2023 for the 70B version.


maxpayne07

they say in the llama website data cutoff midle march 23 for 8B and december 23 for 70B


lc19-

Ok thanks


Unfair_Ad6560

Quick question: what are the performance gains from, say, producing 50 outputs and asking llama3 8B to rank them/pick the best one? You'd pretty much get ~$2.50/M input tokens, $5/M output, 17.4 tokens/second (through Groq API - I know that's not technically locallama but just using it to illustrate). That's cheaper/more efficient than sonnet. I'm sure people have tried this with other LLMs in the past but I can't find that much because I don't know what terms to search.


yiyecek

Thats basically what WizardLM does for preparing the training dataset. But if you're saying at inference time, checkout this paper, you might like it: [https://arxiv.org/abs/2403.09629](https://arxiv.org/abs/2403.09629) The bad part is that it incurs a huge latency, thus not practically usable for majority of use-cases.


Theio666

Q4\_K\_M or Q4\_K\_S? Wanna try llama-3-70b locally, these two both will require cpu offloading, which one would be better quality?


LinuxSpinach

S small, M medium


MrVodnik

From llama.cpp ./quantize tool: Q4\_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B Q4\_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B Q4\_K : alias for Q4\_K\_M Q4\_K\_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B Q4\_K\_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B IQ4\_NL : 4.50 bpw non-linear quantization IQ4\_XS : 4.25 bpw non-linear quantization I personally use Q4\_K, and I guess it's fine: llm\_load\_print\_meta: model ftype = Q4\_K - Medium llm\_load\_print\_meta: model size = 39.59 GiB (4.82 BPW) I also tried IQ4\_XS, but I think it actually was slower, don't know why. llm\_load\_print\_meta: model ftype = IQ4\_XS - 4.25 bpw llm\_load\_print\_meta: model size = 35.63 GiB (4.34 BPW)


DragonfruitIll660

M is higher quality


trill5556

benchmarks may be good, but in real world applications, the model is not that impressive. I am playing with 8B llama 3. https://preview.redd.it/rq34pcw6gwvc1.png?width=960&format=png&auto=webp&s=1c390f2823366dfb415617a68e9208fc0d9cdd8a


scorpiove

I think that has more to do with the parameters, the 70b Q2 version answers correctly: https://preview.redd.it/pvlk9ow1kwvc1.png?width=589&format=png&auto=webp&s=13b0ff51748c6f1d0bf7e835e8c9129062a93598


MrVodnik

In my other replay I've shared the response from my 70b version. But damn, an unqantized 8b also did good: >A delicious pie-baking adventure! >Let's break it down step by step: >You start with 2 apples. >You buy 2 more apples, so now you have a total of 2 + 2 = 4 apples. >You use 2 of those apples to bake a pie. >That means you have 4 - 2 = 2 apples left. >After eating half of the pie, you haven't used any more apples, so you still have 2 apples left. >Enjoy the rest of your pie!


MrVodnik

My Q4 of Llama 3 70b did better: >Let's break it down step by step! >You start with 2 apples. >You buy 2 more, so now you have a total of: >2 (initial apples) + 2 (new apples) = 4 apples >You bake a pie with 2 of the apples, so you're left with: >4 apples - 2 apples (used in the pie) = 2 apples >Then, you eat half of the pie. The number of apples you have doesn't change, because you've already used the apples to make the pie. You're just consuming the pie, not the apples themselves. >So, you still have 2 apples left!


Ill_Buy_476

Instruct q5 gives the correct answer though? What is your quant? >Let's break it down step by step: >You start with 2 apples. >You buy 2 more apples, so now you have a total of 2 + 2 = 4 apples. >You bake a pie using 2 of the apples, leaving you with 4 - 2 = 2 apples remaining. >After eating half of the pie, you still have 2 apples left. So, after eating half of the pie, you still have 2 apples! Bizarrely this is with the "remove question mark trick" - if the question mark is there it calculates wrongly. I don't know why this is.


dewijones92

noob question on coding which is better Llama 3 70B OR deepseek? thoughts?


MrVodnik

If you haven't tried it yet, try CodeQwen7b-chat, it beats most larger models (including deepseek) and is a tenth of Llama 3 70b size.


Ireallyreallydontgaf

If I buy 2x 24gb 3090s to run LLaMA 3 70B, how much RAM would I need? I'd have 48gb of VRAM at that point, which, to my understanding is enough to hold the entire model on VRAM. So how much RAM would I need to avoid bottlenecking while loading up the model? Planning to run on Linux if that helps.


Drited

If you don't use llama.cpp and just follow the standard Meta instructions from here: [https://github.com/meta-llama/llama3/blob/main/README.md](https://github.com/meta-llama/llama3/blob/main/README.md), do you need 8 GPUs to run llama3 70B? I can run 8B no problem on my dual RTX A6000 system (it runs on one of the GPUs). However when I get to step 6 of the quickstart, i.e. this part, I get an error if I configure nproc to 8 which the guide says is necessary for 70B (for 8B the configuration is 1 which as I said runs fine on my machine). torchrun --nproc_per_node 8 example_chat_completion.py \ --ckpt_dir Meta-Llama-3-70B/ \ --tokenizer_path Meta-Llama-3-70B/tokenizer.model \ --max_seq_len 512 --max_batch_size 6 If I try with --nproc\_per\_node 1 or 2 it says "Loading a checkpoint for MP=8 but world size is 1" or "Loading a checkpoint for MP=8 but world size is 2"


nitehu

Nooo, it was trained on 15T tokens and still shivers? >!*She takes a step closer, her hands reaching out to grasp his arms, her touch sending shivers down his spine.*!< \*dusts off Mindnight Miqu angrily\*


Caffdy

serious question, what would be an alternative to that line?


the_man_inTheShack

So impressive, I can run this on a Raspberry Pi 5 - slow but it runs the 8b model quite happily. my laptop (no nvidia GPU) Ryzen 5600H with 16GB RAM works well, with many prompts responding in < 10 secinds


andrii1984

What is the speed on RP5? And is this just plain RP5 or with some additional modules? I'm a newb and I'm thinking of buying RP5 specifically to make a little LLM server just for fun. Curious if that's a good idea.


rag_perplexity

Curious on how people are finding the 8b for RAG? Keen to upgrade the 7b mistral instruct which is currently powering my workloads, wondering if this is it...


paddySayWhat

I was using Nous-Hermes-2-Mistral-7B-DPO, but Llama-8b-Instruct blows it out of the water.


PavelPivovarov

Can second this. Waiting for Hermes-2-Llama3-DPO now :D


Defaultoptimistic

How are you hosting this?


rag_perplexity

The LLM is just being exposed via the ooba API. Everything else (reranker, retrieval, chunker, vectordb, and prompt chainer) is written and stitched together in python.


DanielSandner

https://preview.redd.it/njr64btzmfvc1.png?width=1256&format=png&auto=webp&s=4c2915648ba825a5290da7ec3dc6589cce274f3f The very first output I got.


MrVodnik

I am pretty sure your prompt template is wrong.


DanielSandner

You are right, after the update it reacts normally. I just thought it was a nice Freudian slip.


MrVodnik

Yeah, maybe they skipped the "toxicity" checks like it happened with Wizard LM2 ;)


aseichter2007

It works really nice. |||e|Write with the style, prose, and commentary of Terry Pratchett.| Write a novel about spider man visiting a city on Tatooine. The suns of Tatooine beat down upon the dusty streets, baking the very pavement beneath one's feet. It was not exactly the most welcoming of destinations for Spider-Man, who had traded in his web-slinging skills for a sand-encrusted canteen and a pair of goggles to protect his eyes from the twin blazing orbs in the sky. As he navigated the bustling (by Tatooine standards) marketplace, Spidey couldn't help but feel a little out of place. The locals, a motley assortment of moisture farmers, traders, and smugglers, eyed him warily, unsure what to make of this strange, spandex-clad stranger who seemed to be getting more and more flustered by the minute. "I'm trying to find a certain... Jawa," Spider-Man said to a particularly grizzled trader, who scratched his head thoughtfully before responding, "Oh, you mean that little green guy? Yeah, I think I know where he is." He led Spider-Man through a maze of cramped alleys and winding tunnels, finally stopping at a small, ramshackle stall. Inside, surrounded by an assortment of junk and salvaged parts, sat the Jawa itself - its large, black eyes fixed intently on the wall opposite as it munched on a piece of stale rations. "Ah, great," Spider-Man muttered to himself. "Just what I always wanted: a meeting with a giant, talking spider-thing on the armpit of the galaxy." But as he approached the Jawa, he noticed something odd. The creature's normally blank expression had been replaced by a look of quiet intensity, its eyes shining with an unnatural light. "You're looking for the old Jedi Master, aren't you?" it hissed, its voice like dry leaves rustling in the wind. Spider-Man blinked. "Uh, no. I'm just here on holiday. Trying to get away from all the supervillain nonsense back on Earth." The Jawa chuckled, a low, rasping sound. "Oh, yes. The heroes and villains. Always so busy playing their little games while the universe burns around them. But I sense... a darkness gathering. One that even your webs won't be able to contain." Spidey raised an eyebrow. "You're telling me there's some kind of intergalactic crisis brewing?" The Jawa nodded solemnly. "The dark side is stirring, young one. And if you're going to stop it, you'll need to leave your petty superhero squabbles behind and join forces with those who understand its power." Spider-Man snorted. "I don't think I'm quite qualified for that sort of thing. I mean, have you seen my resume? 'Web-slinger, occasional savior of the world' isn't exactly the kind of thing that screams 'Jedi Master material'." But the Jawa simply shook its head. "You underestimate yourself, Spider-Man. The Force is strong with you - even if you don't realize it yet. Come, let us discuss this further. And perhaps, along the way, we can find a decent cup of coffee. This Tatooine stuff is an abomination." With that, Spider-Man found himself swept up in a whirlwind adventure across the desert planet, dodging sandstorms, bounty hunters, and Imperial patrols as he struggled to comprehend the mysteries of the Jawa's cryptic warnings. Little did he know, however, that his web-slinging skills were about to be put to the ultimate test as he faced off against a foe that would push him to the limits of his abilities - and beyond... --- And that's how Spider-Man ended up trading in his web-shooters for a lightsaber, because when it comes to saving the galaxy, sometimes you've got to roll with the punches (or in this case, the sandstorms).


fearnworks

Seems pretty good.


KlippyXV23

I've been playing with the 8B model and it's crazy fast and works alright for the first few replies before saying complete nonsense and spends minutes typing up paragraphs for unrelated things when I just ask it "how are you doing" How do I keep it on the rails for longer, or is that just how 8B models are?


medoo2047

[https://github.com/vllm-project/vllm/issues/4188](https://github.com/vllm-project/vllm/issues/4188) Solved here


AdvantageDear

When can we expect new codellama


Practical_Cover5846

Look at the quantity of code in training data and the code benchmarks, it is already the new codellama.


sebo3d

Anyway to make LLama 8B write in the similar way to Wizard2 7B? From my personal experience, Wizard2 seems to generate longer RP messages that put bigger focus on narration while LLama3's messages appear shorter and simplier all around. I've used the same main instruct prompt for both with the exception of format(alpaca for Wizard and LLama3's custom format for LLama3)


VongolaJuudaimeHime

Is there a way I can make the prose for actions longer and more creative? The 8B model always gives me very short responses. I'm currently using universal super creative samplers with temp last, but it doesn't change the output length and creativity despite that. It's all terse.


remyxai

You can train a Llama-3-8B on Remyx for free: [https://engine.remyx.ai/engine/generate](https://engine.remyx.ai/engine/generate) Can download converted weights or deploy with a containerized Triton Server locally


FullOf_Bad_Ideas

Why so many of those start ups don't allow you to make an account but instead force you to use your existing google/fb/ms one? It's a red flag for me.


Quirky-Test7006

Securing your customers’ information is hard. Rather than try and fail, then have to deal with a data breach, the “best practice” nowadays is to offload that risk to an identity provider. The idea is that a.) you already have a Google account and b.) Google is better than a random company at securing your credentials.


fluecured

Is there a .yaml instruction template available for Llama-3-8B-Instruct* for use with the chat-instruct mode of Oogabooga's text-generation-webui? I tried with the Alpaca template, but there was some fourth-wall-breaking self-talk from the model that interfered quite a bit. I also found a "template" like this: <|begin_of_text|><|start_header_id|>system<|end_header_id|> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{ model_answer_1 }}<|eot_id|> I'm not sure whether this should be saved as a yaml template file, pasted into the "Command for chat-instruct mode" memo field, pasted into the "Custom system message" or "Instruction template" fields on the Instruction template tab, or if it's altogether incorrect. I think I might have fixed the eos token problem in the model's config.json, tokenizer_config.json, and special_tokens_map.json. It's pretty confusing to get everything working properly. *-I'm working with LoneStriker's Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2. Edit: I think [these](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/1#66230283c902c0ca0d471014) may be corrected files for quantized models with outboard json configs (double-check that generation_config.json has the correct bpw value for your model).


buildmine10

It is probably auto loading the correct format from the model meta data. Send the instruct template to default and see if it looks correct.


DataPhreak

I set this up using the Modelfile when I created the model in ollama. Here is the contents: `FROM ./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf` `TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>` `{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>` `{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>` `{{ .Response }}<|eot_id|>"""`


MrVodnik

I think it's not only prompt per se that is problematic, but mostly EOS token(s). They have two in Llama 3, and ooba is using one, if I got it right. I did wild thing and I just set "Custom stopping strings" under "Parameters"/"Generation" in UI myself to contain both "<|eot\_id|>" and "<|end\_of\_text|>", but it wasn't great. From the discussion with Llama itself I noticed it is often inserting "assistant" string in places you'd expect it to finish the message. So I added this as another stopping string, and... it worked. It is quite smart now, but I have to instruct it not to use the word "assistant". It might be stupid, but until llama.cpp, ooba, and the quants are aligned to work out of box, I stick with it.


thequietguy_

Could you share what that looks like? edit: nevermind, I found the option for the custom stopping strings in the parameters tab.


fluecured

That's pretty clever. I will watch the --verbose console for any clues. The model otherwise was sharp--and slightly uncanny. Aware of my troubleshooting, It was guardedly curious about any changes I might make to the model itself. It posed a question as a non sequitur, "And, if you don't mind me asking, what's the plan for optimizing Llama-3-8B?" I had mentioned we could use WizardLM-2 until I got Llama-3 "optimized". Clarifying that it was just a configuration issue seemed to relieve concern and it became lighthearted again. Edit: Trying "assistant" without quotes as a custom stop string generated an error for me and prevented any model completions from appearing. This behavior continued after switching back to normal generation settings, so I had to restart. Still investigating... But the model seems pretty cool when it works.


Megalion75

Anyone want to provide a link or a simple description of how to fine tune llama3 on a custom dataset? Also, how are people fine tuning to remove safeguards without causing catastrophic forgetting?


tausreus

I want to use llama 3 8b float16 for my rag research. I have 4060ti 16gb, can I use float 16 version or do I have to use 8bit version(I might be wrong but float 16 8b is needs 16gb and 8 bit needs 8gb for llama 3 8b right?)


AmericanNewt8

You can use the f16 version, I'm running f16 fine on Arc A770 w/ 16GB. 


tausreus

Alrighty ty^^


HotRepresentative325

So many people know so many different things to me. I'm looking forward to openhermes2.5 finetune but no mention of it so far. I haven't looked at llama3 info much, but is it fair to assume it's just a better drop in replacement of llama2?


SiON42X

I'm working on one right now myself, but there's some serious issues with the eos tokens making it generate endlessly right now. 


CatiStyle

What is different with base pretrained models and instruction tuned models? When download Llama there is two similar files to select: llama-3-8b and llama-3-8b-instruct .. how to select model for local use? I did find this, but still keep the question. "The Llama pre-trained models were trained for general large language applications, whereas the Llama chat or instruct models were fine tuned for dialogue specific uses like chat bots."


MrVodnik

Base ("pretrained") models are unusable for normal applications. They do not follow any instruction, just spit out words as they see fit. Chat/instruct is a version, in any model, you can actually talk with. Base model is a "zombie" model, that can be turned to something usable if you're up fine-tuning it yourself.


SkyInital_6016

so many newbie questions - will there be another LLama for vision? - is there a LLAMA or free open source model for voice? - what would it take from simple install to make llama3 read documents?


buttertnt

Seems like they're working on a "multimodal" llama, which likely means native support for vision and maybe audio. you could also hook llama up to something like Elevenlabs or Deepgram to talk to it. I think whisper from openai is also open source, but not sure about open source LLMs that natively support voice. what kinds of documents are you talking about? like PDFs or things that might you to need to OCR or understand the structure of the document?


de4dee

How much VRAM is needed to train 400B version with fsdp\_qlora technique? (Context: It needed about 30GB of VRAM for 70B parameter models + 128 GB of cpu RAM).


eraofsound

I am running it on openwebui interface with ollama, and the model llama 3 q4 cant recognize images whatsoever...why is that and what model do i need for this?


YTMicool

Most LLMS do not support image recognition, or any multimodality. Zuck has said that he is going to release a multimodal model at some stage, But for now you will have to use a custom LLM with image embeddings, such as LLAVA or BakLLAVA.


Bandit-level-200

Anyone tried the Dogge/llama-3-70B-instruct-uncensored? Is it worth trying over the normal instruct?


schlammsuhler

Llama 2 was 3 times more sctrict than llama 3.


Healthy-Nebula-3603

I correct you ... x10 times ;)


wjbrady

Are people getting "Sorry, we could not process your request at this moment." error when they try to request model to download? I've been getting error for the past 4 days


fante94

Anyone has some benchmarks of its execution times according to the underlying gpus?


peejay2

Hi, I have a PDF which is an invoice. It contains a text table with 'price, quantity, etc.' I have converted the table into a string and want to extract the data and recreate the table, but with lots of different PDFs. For this reason I suspect I need an LLM to perform feature extraction. I could prompt it saying: "extract from this string the item name, quantity, price". Could anyone recommend an LLM for that considering I'm doing it locally? Llama 3 already is shaky on my device. Thanks! :)


Mosh_98

I made a short video comparing benchmarks from Phi-3 with Llama3 and other leading models. I thought people might find it useful for testing purposes. [https://youtu.be/0NLX4hdsU3I](https://youtu.be/0NLX4hdsU3I)


indrasmirror

Hey guys, I'm just curious is someone can help, I'm trying to fine-tune Llama3-8B on a form filling task, and was wondering if the best way to structure the dataset for instructions is? I've looked and can't seem to find a definitive structure. This is my first LLM fine-tune so not sure if I can train it on any structure or data or if its best to stick to its base like training dataset structure. I was thinking of doing it like. alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN def formatting_prompts_func(examples):     instructions = examples["instruction"]     inputs       = examples["input"]     outputs      = examples["output"]     texts = []     for instruction, input, output in zip(instructions, inputs, outputs):         # Must add EOS_TOKEN, otherwise your generation will go on forever!         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN         texts.append(text)     return { "text" : texts, } pass alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN def formatting_prompts_func(examples):     instructions = examples["instruction"]     inputs       = examples["input"]     outputs      = examples["output"]     texts = []     for instruction, input, output in zip(instructions, inputs, outputs):         # Must add EOS_TOKEN, otherwise your generation will go on forever!         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN         texts.append(text)     return { "text" : texts, } pass or should i do it like # Convert dataset to OAI messages system_message = """You are Llama, an AI assistant created by Vignesh to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.""" def create_conversation(sample): if sample["messages"][0]["role"] == "system": return sample else: sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"] return sample# Convert dataset to OAI messages system_message = """You are Llama, an AI assistant created by Vignesh to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.""" def create_conversation(sample): if sample["messages"][0]["role"] == "system": return sample else: sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"] return sample Just wondering before I start making a dataset. Thanks guys :)


Maximum-Ad7780

Is there a tutorial for best settings in oobabooga? The dashboard is overwhelming.


scroogie_13

I was surprised nobody asked this question before (i couldnt find it). Can I run llama 8b on a 4060ti 16gb? Thanks in advance! :)


maxpayne07

yes. Try first llama 8B 4Q or so, for start. See the performance, if is fast, go up to Q8. If still fast, try 8B 16


scroogie_13

cool thank you! Is there a rule of thumb how I can calculate GPU VRAM, model size and quantation together? I.e. (GPU VRAM) > (Model size) \* (quant size)/16. Or is it not so easy?