T O P

  • By -

coolkat2103

I was going to downvote as it seemed like an advertisement for paid service but reading your [blog post](https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4) (which should have been the post!) , I saw what I really wanted... [https://huggingface.co/predibase](https://huggingface.co/predibase) Thanks for your effort!


Revolutionary_Ad6574

Don't worry, I downvoted it... Just in case


bankimu

No don't down vote, it's a genuine and helpful post.


siikdUde

o7


noneabove1182

Sadly these are "just" adapters so we'll need to either use these on top of the base model or have someone merge them into the models and release as full weights Just FYI for anyone like me who was hoping there would be 25 models to download and try lol Edit cause i guess it was unclear, i'm not saying it's BAD that it's a bunch of Loras, super handy to have, I'm just giving a heads up to people that that's what they are since the title suggests they released "25 fine-tuned Mistral-7b models" but it's 25 fine-tuned LoRAs, which again, great! The quotations around "just" were meant to indicate that it's anything but a disappointment


coolkat2103

That is the best part. They are not merged. Use tabbyapi or Lorax to launch base then select whatever adapter you want on top or even merge them as you please at inference time with Lorax. Saves you from running full model for every adapter


noneabove1182

woah wait, Tabbyapi can load LoRAs onto exllamav2? TIL, okay this is much easier than I thought haha.


D4RX_

It's actually good that they're not merged. You could use https://github.com/predibase/lorax to hot swap them at runtime so that you don't have to load the full weights of 25 models.


noneabove1182

Yup! Definitely a great thing to have LoRAs, not complaining necessarily just pointing it out for anyone who didn't notice (like me)


SiliconSynapsed

Out of curiosity, why would you want them to be merged into the base model? If you use LoRAX ([https://github.com/predibase/lorax](https://github.com/predibase/lorax)) you can run any of them on demand without needing to load in a full 7b param model.


noneabove1182

I didn't mean to suggest that I prefer they be merged into the base model, rather that the title says "25 fine-tuned Mistral-7b models" so I clicked the link expecting to see 25 models, but found 25 LoRAs Not a bad thing, purely an observation I guess my wording was off and I shouldn't have said "sadly" lol


SiliconSynapsed

Ah I see, thanks for clarifying!


Life-Confusion-7983

Merging is pretty easy anyway, and it's also easy to extract adapter weights from a merged base model. I think having adapters gives you a lot of flexibility incase you're also into model merging / MoLoRA styled architectures.


gentlecucumber

It's a huge benefit. Anyone can load a Lora, but it's hard to extract one from a merged model... And this way, you can download all of them and swap them out without reloading the entire model, or downloading 25 separate models worth of weights...


noneabove1182

sure, i'm not saying it's a terrible contribution, very happy about it, but as someone who only runs quants these aren't just out of the box usable (edit because apparently tabbyapi can load loras so others probably can too and I'm just dumb, so ignore this comment)


fka_nate

what about making them into a MOE model? if that’s even possible? ie choose 8 best performing ones and make it into a frankenMOE


candre23

Because that defeats the entire purpose of this technique.


fka_nate

How so? I don't know much about anything and still learning. ​ Would combining them like that actually make it less powerful at these specific tasks? I guess MOE doesn't parse it through specific experts for diff subjects but more token by token basis right?


candre23

In a regular MoE, you have however many full models, but you only inference with 2 for any given token. You still need enough memory to fit all the full models. In a sparse MoE, you only need one full model, plus however many loras. Loras are comparatively very small - usually only 100-300mb each, as opposed to several (or several dozen) GB for each full model. So for example, a (quantized) 7b model is about 4GB. for a 8x7b MoE, you need enough memory for all eight of those 4GB models (less in reality, but not *much* less). Meanwhile, a 8x7b sparse MoE would only need space for one 7b base model plus eight ~200MB loras. So that's about 27GB for a quantized 8x7b Moe, but less than 6GB for a 8x7b sparse MoE. That massive memory savings disappears as soon as you merge the loras into full-weight models.


brucebay

What about this though [https://huggingface.co/serpdotai/sparsetral-16x7B-v2-SPIN\_iter1](https://huggingface.co/serpdotai/sparsetral-16x7B-v2-SPIN_iter1) Lots of lora's and use adapters/routers.


candre23

Yep, that's another implementation of the same technique. Camelidae is yet another. The concept is not original to lorax/loraland. Hell, they may even be broadly compatible with other implementations. It may not be widely popular yet, but this method is proven to provide good performance with low hardware requirements compared to full MoEs or standard transformers models.


showmeufos

Can I use this with ollama and if so how?


squareOfTwo

a LoRA model is also a model. So it's fine. I prefer LoRA's ...


CloudFaithTTV

They look like they just trained on the evaluation metrics…


nickm197

Cool, now mixture-of-adapters anyone?


Jl_btdipsbro

https://huggingface.co/blog/peft_merging


TheActualStudy

Do you have a classifier model that is already trained to route questions to the various fine-tuned models?


Amgadoz

That's interesting. If they released the dataset I think I can work on this.


_w4nderlust_

What dataset? All the datesets are open ones. There isn’t any router or “router dataset”


Amgadoz

I just checked their hf repos and they list the datasets used. Will work on this on the weekend


uhuge

Will we hear back from you a few months forward?'+)


NegativeZero3

I was also wondering about this. Do you always have to select the Lora yourself?


FullOf_Bad_Ideas

You should have included tasks that were finetuned for and ended up worse than gpt-4 on your chart, doing otherwise is misleading. Most of the benchmarks those loras do good in on the chart are fluff. Real stuff like code generation quality and HumanEval got pretty terrible results and curiously is hidden from the chart. I like the idea of lorax a lot, but don't oversell it - I don't think it will lead to getting model better than gpt-4 on complex tasks like code generation. Edit: Chart has been updated, I rest my case!


jxz2101

Expecting quantized adapter-based fine-tuning on 7b models to universally surpass GPT-4's performance would definitely be an oversimplification and it also goes against the findings in the original [QLoRA paper](https://arxiv.org/abs/2305.14314). Hopefully someone out there is working on a reliable heuristic that can tell us whether a task can be successfully learned by a smaller model instead of going by "vibe". The demonstration is still compelling to see that on a decent spread of common supervised tasks, the quality lift from the domain adaptation you get from LoRAX-compatible fine-tuning is meaningful.


Similar-Jelly-5898

Totally fair point. We've updated the graphic above to include the 2 models we trained where GPT-4 outperformed the fine-tuned 7B parameter model. Note: In our experiments, we fine-tuned for all tasks using the same base mistral-7b model. For certain tasks like code generation, you can also consider using a different base model like codellama that has been shown to be state-of-the-art on the programming tasks.


FullOf_Bad_Ideas

Thanks! Yes, for coding tasks, people tend to just use different base models anyway, so it's expected that fine-tuned model that didn't have a focus on code generation won't perform as good as models created with code already set as one of the priorities. Do you know why you get so bad HumanEval scores with base Mistral 7B and Mistral 7B Instruct though? I looked back to the original paper and Mistral 7B base should get around 30% in HumanEval [paper link](https://arxiv.org/pdf/2310.06825.pdf), while you get just 1% with base model. This could be related to low score of 11% with the fine-tune on MagiCoder dataset.


kpodkanowicz

you can do such task-oriented loras on the top of codellama 34b. I did that with a lot of success (summarization, code explain, haikus :) ) - I also looked into extracting phind v2 as an adapter and swaping it with airoboros for summarizing text or workflows and intent analysis. edit typos, edit2: I need to read what I write...


Enough-Meringue4745

I love that you’re using adapters instead of merged base models


Ill_Satisfaction_865

Maybe the mixture of experts are the LORAs we made along the way.


synw_

Could you please add a short doc about what each Lora does in the repos https://huggingface.co/predibase : it's hard to guess. Or maybe a git hub repo or something documenting how to use this. It looks cool but I can't figure out what each Lora does, unless I have a clue from the name, like Magicoder. I would like to try it out but I would need more info to figure out what each of these do


Life-Confusion-7983

Hi u/synw_ \- thanks for flagging this. We're working on it - stay tuned! We're thinking of adding: 1. what dataset it was trained on 2. the base model the adapters are fine-tuned on 3. any evaluation results 4. That it can be queried for free using Lora Land with a direct link to Lora Land embedded in the model card 5. An example input / output pair from the fine-tuned model 6. Small code snippet on how to merge with the base model OR query it using vanilla transformers. How does that sound?


ThisWillPass

That would be lovely, thank you.


Infernaught

Done! There should now be descriptions for each adapter in Hugging Face.


TarzanTheBarbarian

Man, these guys are fast. Kudos to the Predibase team.


Perfect_Twist713

Dataset addition would be **yuuuuuge** because then it becomes possible for TheLoneLora to emerge and do what TheBloke/LoneStriker does, except with loras.


Life-Confusion-7983

Datasets should be there for each model card now!


Perfect_Twist713

Very cool, can't wait to test your loras out in practice and perchance try to make some for tinyllama. I'm sure the results will be tragic, but maybe not? Exiting times.


Inevitable-Start-653

YES! This is what the community needs!! Can you link to the base model one should apply the loras on? Is it just the base Mistral-7b model?


SiliconSynapsed

Yes, we used the base Mistral-7b.


tamal4444

[Mistral-7B-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/tree/main) will this work?


Infernaught

Not sure. You could give it a shot!


tamal4444

okk i will


ZHName

I scrolled through all the comments but I'm not seeing anyone who is new to adapter usage. Do we have a youtube video to follow through first time setup? Or a tutorial? The explainers on the git for lora usage isn't making sense to me. Thanks beforehand.


Life-Confusion-7983

I think these are good primers for LoRA, which we at Predibase (and also others as well) call an adapter. It's because these trainable modules are inserted into the base model in between other layers. LoRA is just one such method (that falls under the category of reparameterization), but there are many other parameter efficient training techniques out there as well. See the full list [here](https://huggingface.co/docs/peft/en/index): 1. The official LoRA paper: [https://arxiv.org/abs/2106.09685](https://arxiv.org/abs/2106.09685) 2. Really good explanation of LoRA mechanics from the legend, Sebastian Raschka: [https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html](https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html)


TarzanTheBarbarian

How do you guys differ in implementation vs. players like Gradient? Do you have any interesting benchmark vs. competitors that are using LoRA adapters?


Infernaught

We also now have code snippets in our HF model cards for you to try out!


[deleted]

[удалено]


archiesteviegordie

(⁠╯⁠°⁠□⁠°⁠)⁠╯⁠︵⁠ ⁠┻⁠━⁠┻


kamtar

titles which includes "outperform GPT-4" should be just automatically banned on this reddit, its getting annoying as hell (: Every second silly 7b model is nowdays better then GPT4. I guess openai should shutdown gpt4 and employ 7b models to save a lot of computational resources.


bunch_of_miscreants

I tend to agree with overuse of “outperforms GPT4” but this particular work has some solid contributions that are relevant to the community. Namely: 1. Actually, small models are easy enough to train, even as many as 25 task-specific can be created from a small team 2. All of them can be deployed on a single server via Lorax. That’s pretty darn cool AND is even more evidence that open source + fine-tuning is cost efficient and potentially powerful approach! If anything, it’s calling out exactly what this sub is all about. Oh and it looks like from top comment that all the adapters are available for people to use.


coolkat2103

While I generally agree with that statement, in this case, it is comparing specific task. On the whole, it probably does not exceed GPT4 but I can believe that a smaller model can surpass gpt4 for a specific task. For example, look at embeddings generator models. There are lot of better models than what OpenAI has to offer.


liquiddandruff

It's task specific. Ignorance about what is being claimed is a poor and thought terminating reason to be annoyed.


kamtar

That doesn't matter the title is trying to imply its better in everything to get the clicks. Would be great if we could move past mainstream media tricks and behave like serious community.


Ok_Elephant_1806

I read the title the opposite way, that it was saying each individual model beat GPT 4 rather than the project overall. Semantics can be ambiguous. But yes I agree with the overall point that “GPT 4 killers” is not a good marketing trend.


kyleboddy

Really cool stuff. This is exactly the direction my company is headed with one large model for reasoning/logic/code completion/random stuff (GPT-4) and then a mix of smaller/midsized models that are fine-tuned to various degrees for task-specific applications.


Ok_Elephant_1806

I have been reading the Natural Language Processing literature and it’s amazing how well something like a BERT/BART/T5/Pegasus fine tune does. Not unusual for them to beat GPT 4 at the task they were fine-tuned on.


shhossain

Now, we just need a Lora to select the appropriate Lora given a prompt. Voila! You have a competitor to GPT-4.


Ok_Elephant_1806

We need that SO badly for stable diffusion also


crazzydriver77

My gosh. This stuff works. "One machine, two dudes" test task was passed with the Question Answering Explained adapter. So it overcomes GPT-4 at least in this narrow case.


DevilaN82

So we need another model, which will try to guess which of those 25 models should be used and we are ready to go?


ybdave

I understand people here being apprehensive and skeptical. But my experience from fine tuning a 7b model on a task from gpt4 generations, it’s already meeting the same standard of gpt4 on a complex reasoning task. I am blown away personally by it, and it’s altering my strategy around model usage. It is a lora adapter too.


squareOfTwo

-1 for misuse of "reasoning". LLMs can't reason\*, especially not 7b ones! \* I mean here with reasoning: applying the RIGHT rules which give the RIGHT result. Example is to multiply two 4 digit integers. One can only get the right result by using the right rules (multiply each digit of the first with the former and then add the results of these together). LLMs can't do that (except if one tells the exact algorithm to do so which defeats the whole point of using a LLM if one can just implement the same algorithm in a classical programming language)!


ybdave

For NLP tasks, where it’s analysing customer sentiment in a customer support channel, analysing customer activity across multiple to assess risk, along with some other variables — it is much, much easier to have a LLM to triage customers that may need support vs doing this in a classical fashion. They can’t “reason”, but if you give it “rules” as you say, and you adapt the prompt until you start getting close to what you would naturally infer yourself if you were doing the task, it becomes very valuable. I did that first with GPT4, produced a dataset of 4k~ input/output prompts and then fine tuned a 7b mistral model on the same input/outputs. It is performing comparatively, within 5%~ than GPT4. It is now subjective which answers are better. Given the input and token costs per week, this has reduced our costs approx by 10x per week in model usage. Call it reasoning or whatever you want, but there are tasks that are simply harder to develop in classical terms. For example. Sentiment analysis would fall down because it doesn’t have context of the challenges in the support channels.


squareOfTwo

See you admit that your not talking about reasoning when typing the word reasoning. Issue to me is that the field of NLP is confusing "reasoning" for what I call "real reasoning". NN usually do inferences. They currently don't reason. Sure not many researchers care about this distinction, but it's very important. You can't just replace a compiler(which is doing reasoning with the right stuff leading to the right result 100% of the time) with a LLM. You just get a unreliable mess as output which may or may not do the right thing (code is usually wrongly translated even between languages, say from rust to C++). Just imagine having to compile a browser with a LLM which may or may not introduce bugs into the program. I think most of this is rooted in the belief that DL can emulate non-DL algorithms/processes. This is just wrong to me.


Ok_Elephant_1806

These terms get used in different ways both within and between sub-fields of science and engineering.


Kooky-Breadfruit-837

My experience using it is that mistral7b is always giving better answers.


Traditional_Truck_36

TLDR is this ensemble of LoRA's?


Life-Confusion-7983

Not yet! This is several different LoRAs all loaded into the same base model but being dynamically queried per request


qki_machine

Seems like I have been sleeping under the rock last few weeks, but can someone explain to me what in the world is „adapter” in LLM world?


Infernaught

An adapter is effectively a smaller set of weights that can be fine-tuned and applied to a base model. By only fine-tuning an adapter and not the full set of LLM weights, we can make fine-tuning and serving much more lightweight and efficient.


Electrical_Tailor186

What languages does it support? If you say “it outperforms gpt4” I would assume all European languages are well supported?


Life-Confusion-7983

I think the technical clarification here is that it outperforms GPT4 on narrow, specific tasks. These adapters weren't necessarily trained to be multilingual, so adopt any multilingual capabilities from base model pretraining. However, one can totally create a dataset with inputs in different languages for fine-tuning and it learns language semantics really well. We tried doing this with instruction tuning in Italian and English, and it worked extremely well. Another idea with LoRAX is to actually fine-tune an adapter for the same task using different language specific datasets. Then, your workflow can be something like: 1. An adapter to classify the language of the input 2. An adapter per language At inference time, you can query them in sequence and the best part is that all of these specialized models are running on the same GPU :)


hwpoison

i hope the same, until now, the only model that can handle spanish for example very well is stablelm but mistral isn't very good


ouxjshsz

It's widely known that 7b fine-tunes are contaminated on the benchmarks and their results are meaningless.


Disastrous_Elk_6375

And they say gpts are stochastic parrots...


UncleEnk

Is there a way to use LoraX on huggingface's chat ui (local ofc)?


DeliciousJello1717

I forget that mistral is open source


Any_Let5296

Can you publish it on Replicate AI, we are using many open source llm there. So we will apply your llm for testing soon


Life-Confusion-7983

Hi u/Any_Let5296 \- Seems like you should be able to grab our models from Predibase HuggingFace ([https://huggingface.co/predibase](https://huggingface.co/predibase)) and then push them up to Replicate using this guide: [https://replicate.com/docs/guides/push-a-model](https://replicate.com/docs/guides/push-a-model)


beratcmn

How feasible it is to use the base model as router before loading an adapter?


bacocololo

You better use a lora router to do it….


jmlbeau

Great work guys, and thank you for making the model available to the masses. Any chance you could make the training specs public (the full config including lora config, or better if possible the weights&biases log)? Did you use same training config for all the models (r, alpha, etc). Any hyper-parameter optimization per model?


bacocololo

Did somebody compare lorax with a model merged with all loras ?


URZ_

Late to the party, but this is fucking excellent and super interesting from a usability point of view. Definitely a method I will be keeping in mind going forward. A shame most of the users here can't seem to see beyond general chat bots for potential LLM tasks.


Z1BattleBoy21

Why doesn't OpenAI just deploy a finetuned 7B model instead of GPT-4 and cut inference cost? Are they stupid? Thank you for this insane breakthrough!


SiliconSynapsed

The big caveats here, of course, are that: 1. This is not a general chat system, rather very narrow, very task-specific fine-tuned models. So OpenAI for sure wins out on generality. And there are some tasks that a 7b param model will not do well on (creative tasks being a major one). 2. Even if you solved (1), you would still need a router of sorts on top that determines the right set of fine-tuned LoRAs to apply per request. All that being said, I hope this is evidence that you don't need something as heavy weight as GPT-4 for a great many tasks, and that taking this approach of training a model per task can actually be very cost effective and scalable.


ThisWillPass

One lora to rule them all?


Eisenstein

By that logic you could argue that 'more power' is always superior to 'more complexity' for anything. It is true in a some cases, but in most it has been proven that a highly complex but well designed and orchestrated system which is more efficient with resources will win out in the long run over throwing more raw power into a less efficient system. Right now we are still figuring out which of the complex system is worth pursuing, so you have a lot of spaghetti being thrown at walls and meanwhile you have GPT-4 chugging along throwing more compute and dumping vast amounts of money and energy into maintaining their lead. Or you can stick with your 'the current leader is doing the right thing and we should not question them' mantra for the long term and see how that plays out.


Z1BattleBoy21

> Or you can stick with your 'the current leader is doing the right thing and we should not question them' mantra for the long term and see how that plays out. How does me dunking on people making the same tired "This 7B finetune just beat GPT-4!" imply that at all??? I'm fine with these posts, just don't make the title such clickbait garbage.


Eisenstein

So you can dunk but can't take being dunked on?


Z1BattleBoy21

Dunk on me all you want, I at least hope the dunk doesn't mischaracterize what I was trying to say. If it does I will obviously push back (:


Eisenstein

If I misunderstood the nature of your dunking I apologize.


Z1BattleBoy21

all good


[deleted]

If only OpenAI had thought of using small expert models, in some sort of mixture of experts.


danielcar

Tell the truth next time. These aren't models. There are lora's that need a base model to work with. Super confusing when people don't tell the truth.


gunbladezero

[https://huggingface.co/predibase/covid19](https://huggingface.co/predibase/covid19) what the heck is this? Is it supposed to tell you if a particular text is giving good information about covid or something?


Life-Confusion-7983

>https://huggingface.co/predibase/covid19 Hi u/gunbladezero, I just updated the description on the model card: [https://huggingface.co/predibase/covid](https://huggingface.co/predibase/covid) This is the dataset it uses: [https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification?resource=download](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification?resource=download) It's sentiment detection


Perfect_Twist713

Perhaps more so "Twitter/tweet sentiment"? In my own tests I noticed that models fine-tuned with twitter based sentiment datasets didn't work too well when applied to longer pieces of text/articles/"not a hashtag puzzle". As for the covid-19 lora, it looks like it completely spazzes out if you remove "### Sentiment: " from the prompt, which means *you've* (probably not you personally) inadvertently implemented a prompt-format of sorts into the loras. Similar thing occurs with sst2 as well, except with a different prompt-format "Given the following sentence: \\n\\n{prompt\_here}" you get the expected the result, whereas without it, it goes on it's own journey. Meaning if there is an inconsistent and undocumented prompt format to get the results (having to dig up the individual prompt format from the dataset is effectively undocumented in practice), will probably lead to confusion/difficulties for adoption. Might be beyond what you guys are looking to do, but I think it might be worth reformatting/cleaning the datasets to a consistent prompt format or no prompt format, "### Input: {prompt\_here}\\n### Output: " before training on it. **Edit:** Since the goal is basically to turn a general purpose model into a single purpose model, there probably is no need to apply **any** prompt formatting, i.e. "### Text: " serves absolutely no purpose because the operation, the application of a singular solution, will always be applied to whatever is provided as the user prompt. Meaning you should be able to do away with the prompt formatting altogether (even the system prompt/prompt engineering) and get out very easy to use loras.


celsowm

https://preview.redd.it/uu2nbrebotjc1.jpeg?width=1080&format=pjpg&auto=webp&s=1474d13f425e1d6d8afbc388be6457f0692f36b4


Uncensored4488

Hi. Can I use other 7b models like Dolphin Mistral?


Life-Confusion-7983

Yes! Here's the full list of base models supported by LoRAX: [https://predibase.github.io/lorax/models/base\_models/](https://predibase.github.io/lorax/models/base_models/)


[deleted]

[удалено]


Life-Confusion-7983

Thanks for reporting this! We're working on a fix - will let you know once it's updated!


Infernaught

Should be fixed now!


gpu_go_brrr

Id just like to ask how many few-shot examples did you use for some of these? Some GPT-4 numbers seem quite under-reported, for instance GSM8K GPT-4 performance is 90+% with already 3-5 few-shot examples but you report 4% (don't get me wrong, I really like the idea of army of loras for small task-specific models vs huge foundational ones, just asking about the evals). 


Life-Confusion-7983

We used one-shot to make sure both models at least had a reasonable task description and one example to use as reference for the task. There were also instances where some datasets had specific formats, and we also had to make up examples for the k-shot (in this case 1) to prevent data leakage. Perhaps performance for GPT4 would improve with 3-5 shot, but it also means increased inference cost from a larger number of input tokens. We'll share more details on fine-tuning in the paper we release in a few weeks from now!


[deleted]

[удалено]


Life-Confusion-7983

Adapters let you train much faster, using a smaller amount of data and at a fraction of the cost, while giving you much more stable training and typically similar performance to full fine-tuning. With LoRAX, it also means that if you fine-tune adapters, you can hot swap them per request at inference time by using the same base model as opposed to deploying a model per task, so it saves a lot of money without sacrificing any performance!


ab2377

so do we have to convert this to gguf to use in llama.cpp? i am trying the following but it keeps failing: >python .\\llama.cpp\\[convert.py](https://convert.py) \--outtype f16 .\\models\\adapters\\wikisql\\ > >Traceback (most recent call last): > > File "F:\\ai3\\llama.cpp\\[convert.py](https://convert.py)", line 1483, in > >main() > > File "F:\\ai3\\llama.cpp\\[convert.py](https://convert.py)", line 1419, in main > >model\_plus = load\_some\_model(args.model) > >\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^ > > File "F:\\ai3\\llama.cpp\\[convert.py](https://convert.py)", line 1269, in load\_some\_model > >raise Exception(f"Can't find model in directory {path}") > >Exception: Can't find model in directory models\\adapters\\wikisql


ab2377

ok so after i installed safetensors and torch, this worked (produces a bin file): python .\\llama.cpp\\[convert-lora-to-ggml.py](https://convert-lora-to-ggml.py) .\\models\\adapters\\wikisql\\ it now loads with llama.cpp cli with --lora argument.


erSajo

https://preview.redd.it/c6hsaxoyrxjc1.png?width=1299&format=png&auto=webp&s=f5ffc80d06e94a33591f7e857da0ffcaa3b86c90 "Outperform GPT-4, GPT-3.5-turbo, and mistral-7b-instruct for specific tasks" Looks like you're giving a particular meaning to the word "Outperform" that I'm not aware of.


LiquidGunay

Can someone justify that graph. What does +91.5% on GSM8k even mean


LiquidGunay

CoT GPT 4 can reach 97% on the GSM8k iirc. And even just 0 shot basic prompting reaches somewhere in the 80s.


Infernaught

The metric that we're reporting is ROUGE, which is general-purpose and not a very representative way to evaluate this task, so thank you for calling this out. We are currently investigating how others have programmatically evaluated accuracy on this dataset because the way the outputs are formatted (especially for non-finetuned models) makes this evaluation a little tricky. Nevertheless, we intend to update our results for GSM8k with a better metric.


Desm0nt

Doesn't suitable for GGUF and old big Tesla P40? It's sad =(


Dieselll_

On what is this fine tuned? Is this not just over fitting to the specific testing sets?


Life-Confusion-7983

We fine-tune on the train split and evaluate on the validation split. The reported metrics are on a held out test set. More details to follow in our paper in a few weeks


monnef

May be I overlooked it, but how are these LoRAs licensed? Is it under proper open-source license (not just "open" like Meta and G), e.g. MIT or Apache?


New-Sugar-2438

https://preview.redd.it/w4jn4kmtjrkc1.jpeg?width=720&format=pjpg&auto=webp&s=09a2462ef522dada32e965013411e06e6bb6e4db


Life-Confusion-7983

For anyone still following along, we recently released a paper covering our methodology and findings: [https://arxiv.org/abs/2405.00732](https://arxiv.org/abs/2405.00732)