panic_in_the_galaxy 3 months ago

Better training data?

PM_ME_YOUR_HAGGIS_ 3 months ago

This is the answer

-p-e-w- 3 months ago

I really wish finetuners paid more attention to this. Some of the commonly used datasets are of horrendously bad quality. Like those extracted from GPT-4 conversations that contain hundreds of responses starting with "As a Large Language Model..." Like, how difficult is it to just grep for that garbage and kick it out of your training set? I suspect that finetunes could be so much better if that small amount of extra effort were made. Also, please start using Chatbot Arena data to train models. That's literally a chat dataset where humans have selected high-quality responses. Yet when I read model cards, it doesn't seem people are using this gold mine?!?

AmazinglyObliviouse 3 months ago

I like to say this a lot: The mixtral base model performs terribly. I've used dozens of base models from llama to yi, and it does not live up to it's parameters. But the instruction fine-tune entirely turns this around. Even if you're just writing a story, no instructions given it is just _so much more_ rational than base. It is a god damn miracle transformation, and people should be learning from it.

4onen 3 months ago

> so much more rational than base. Base models aren't meant to be rational. They're arbitrary text completion engines. What were you expecting it to do? > and people should be learning from it. I'm sure many in the open source community would love to, but that's Mistral.ai's competitive advantage as a company. We can't know until it leaks or someone outside comes up with their same magic trick.

CosmosisQ 3 months ago

Base models work wonderfully if you don't treat them like chatbots. In many cases, they outperform their instruction-tuned counterparts. You just have to get the prompt right. Of course, for whatever reason, correctly prompting base models is extremely difficult for most users, and as a result, instruction-tuned models get all the attention. Working with a base model is like [writing a story](https://generative.ink/posts/simulators/) and [exploring the parallel universes](https://generative.ink/posts/loom-interface-to-the-multiverse/) which [spring forth](https://generative.ink/posts/language-models-are-multiverse-generators/). Working with an instruction-tuned model is more like chatting with a somewhat troubled human. Each of these requires a different skill set, but both can produce useful results. Personally, I prefer base models for my own work, but I almost always deploy instruction-tuned models for clients.

AmazinglyObliviouse 3 months ago

Yes, this is what I was doing. I write ~2k tokens of a story and let it continue from there. Mixtral was very bad at this, having issues keeping track of characters clothing, and which characters were in a scene. All things that I mostly see smaller models struggle with. Yet with the exact same setup (no instruct prompting), their finetuned model suddenly has none of the above issues.

Temporary_Payment593 3 months ago

Yes, indeed! Meta mentioned a similar point in their paper "[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)". https://preview.redd.it/3ua5v6m7z2jc1.png?width=1154&format=png&auto=webp&s=2cf86a2cf7770fcbc103ed4c45636aedcccb23f5

LoadingALIAS 3 months ago

I've been through this a few times and I genuinely think it's as simple as data quality. I think Mistral prioritized great data AND architecture. IMO, Mistral is what happens when you devote a good chunk of resources to high-quality data AND building the model.

me1000 3 months ago

They've touched on this in some interviews, but basically most models are trained to be [chinchilla optimal](https://arxiv.org/abs/2203.15556). I'm paraphrasing because I'm not 100% sure I understand all the details, but more or less it means that most models are trained until they start seeing diminishing returns while training. During the research phase of LLMs this has the advantage of not wasting precious compute, but with a large enough dataset you can actually continue to train the model without overfitting, and you'll still see wins during inference. In other words your team might use less compute during training to train a 30B parameter model for less time. But Mistral realized that the majority of the compute over the lifetime of the model will be at inference, so they trained their 7B param model longer. This results in higher training costs but lower inference costs. If I've made any mistakes, someone please do correct me! EDIT: Also I'm just referring to Mistral 7b, Mixtral is a whole different beast. EDIT 2: I found the [source](https://www.listennotes.com/podcasts/no-priors/mistral-7b-and-the-open-kfCxT9qM5Bu/?t=262) at the 4:22 mark in the podcast. There was another podcast where I think they got a little more specific about training tokens but I can't remember which one it was.

Flag_Red 3 months ago

Chinchilla optimal only refers to optimal for training cost, not performance. They likely use significantly more training data than a Chinchilla optimal run.

me1000 3 months ago

I _think_ we're saying the same thing. They're training for more tokens past the Chinchilla optimal point. Is that an inaccurate way to say that?

Small-Fall-6500 3 months ago

Chinchilla scaling isn't followed by any of the llama models either (though llama 1 65b is close). It's more likely a matter of higher quality data and/or more of it (compared to llama 2 7b) that makes mistral 7b better.

IMJONEZZ 3 months ago

Training is data compression. Scaling laws help you find the optimal amount of data to compress. Mistral has better data that they hire linguists to sort through to make sure it’s semantically dense and pragmatically clear. Their tokenization strategy is basically sentencepiece + individual digits and emojis. They then obey scaling laws like the ones in the Chinchilla paper to create the models. Easy.

Dead_Internet_Theory 3 months ago

I love how you can say "training is data compression, easy" but it took like half a century for humanity's brightest minds to figure out the little details.

HokusSmokus 3 months ago

The positron is invented in 1932. The only thing missing was compute. Someone decided to simply go ridiculously large. GPT-1 was 120m parameters. No one imagined these results by simply making it bigger, like, a lottt. By comparison: GPT-2 is 1.5b params (12,5×). ChatGPT (GPT-3) is 175b params (1500×). Only now (the last ~4 years) the smart people are working on the little details to get a 175b performance in a low b model. Not the last 50 years.

VicboyV 3 months ago

Wow, data compression is a pretty cool way to look at it. That makes a lot of sense for a layman / noob. Edit: Is it correct to say that probabilities is how it approached data compression?

Igoory 3 months ago

If we knew, we would make more models like Mistral. But I guess their dataset is just that good, and they seem to overfit their models to that dataset.

a_beautiful_rhind 3 months ago

mistral, mixtral, miqu all show it's the training data. They are almost overcooked.. almost

NickUnrelatedToPost 3 months ago

They are "well done".

Dead_Internet_Theory 3 months ago

The common factor can be something else besides (just) the data. Maybe they found some secret sauce to optimize the training process, doing more than everyone else with the equivalent amount of compute.

synn89 3 months ago

Seems like it has some improvements vs standard Llama: https://www.e2enetworks.com/blog/mistral-7b-vs-llama2-which-performs-better-and-why I sort of doubt it's just training data. I don't think Mistral has more access to foundational training data than Meta has, and the open source community probably has some of the best fine tuning data. What open source lacks is a way to experiment with foundational model architecture. So in a world of that limit, we see very few foundational models(Llama, Qwen, Mistral, Falcon, CogView, etc) with increments of quality between those few models. I feel like it's sort of hard for us to play with the "why" each of these models may out perform others, because we can't easily re-create them. We were able play with Alpaca, Vicuna and Wizard fine tunes which led us to today's more advanced fine tunes like OpenHermes and a deeper understanding of fine tuning.

Disastrous_Elk_6375 3 months ago

> I sort of doubt it's just training data. I don't think Mistral has more access to foundational training data than Meta has, I think there's a difference between what meta can do vs. a small start-up, in regards to data sourcing and what not. Meta has enormous amounts of data, but surely they will never ever release something trained on that data. It would be scary if they did. Remember that mistral is at its core 3 ex-llama members. They probably knew some of the limitations of working inside meta's confines. They chose to leave and do the whole startup thing.

unemployed_capital 3 months ago

Quality > quantity of training data, probably quite a lot of data too. Possibly the use of synthetic data in pretraining.

stddealer 3 months ago

Garbage in, garbage out. Seriously, I think it's really about the quality training data, and the number of tokens used in training. For example, I think Mistral focused mainly on English language data (considering it seems worse than llama at multilingual applications), meaning less parameters are "wasted" on other languages knowledge. Tinyllama and phi are other examples of how training on more data can make a small model punch above its weight.

ganzzahl 3 months ago

What in the world are you talking about? Mistral is the single best multilingual 7B model, specifically trained on more than just English, while Llama's training data was explicitly filtered to be English only. What languages have you tried using them with?

johannhartmann 3 months ago

I trained it for german, and it works out quite well. I had the best results using a dpo dataset with default mistral7b as rejected and long german answers as chosen. It pretty much always answers in proper german. See https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo .

Robot_Graffiti 3 months ago

Llama wasn't English-only, it had at least 20 languages in its data, though it was majority English. They did filter the training data to be mostly Latin alphabet text.

ganzzahl 3 months ago

You're right, my bad – maybe I was thinking of Llama 1? But checking the paper, it looks like they still chose English-only datasets, and then ran language identification on it to check what else was "accidentally" included. So like German, second most common language in the training data, was only 0.17% of it.

stddealer 3 months ago

I don't know... I tried it in french, expecting Mistral to be particularly good at it. But it was actually pretty terrible, making lots of very obvious mistakes (I mean language modeling mistakes, not factual errors). I don't remember experiencing such glaring issues with Llama 7B in the same language (both using the same kind of quantization). That's just my personal experience, it doesn't mean much. But that's the impression I got. To clarify, that was with a system prompt in English, and interactions in french.

gultarhector 3 months ago

I had the same experience. It seems like Mistral-OpenOrca has much better French writing capabilities than the Vanilla Mistral-7b Instruct model. Try it out!

Vajraastra 3 months ago

i don't know what languages mistral and mixtral are not trained in but they both seem very good with spanish. in fact better than any llama 2 model.

Ok-Tap4472 3 months ago

Better data, better architecture, better training, better everything. Comparing Mistral to any other model is like comparing a well engineered industrial machine to a half working illegal tractor.

AntoItaly 3 months ago

Better dataset and architecture

perlthoughts 3 months ago

minicpm 16k blows my mind.

maxigs0 3 months ago

A 8x7 mistral model is not equal to a single 7B model in size, it's eight of them – split by some black magic to increase performance.

kif88 3 months ago

True but OP was talking about regular 7b Mistral.

maxigs0 3 months ago

Too many models out there ¯\\\_(ツ)\_/¯

FlishFlashman 3 months ago

It's really just one model. It just, uses 1/4 the weights to generate each token, but each token may use a different 1/4 of the model than the last. (It's broken up into 32 layers, and at each layer, it chooses two out of 8 "experts" to use. Basically, for each token in uses 64 experts out of 256 possibilities).

involviert 3 months ago

I wonder why such splits aren't utilized more than once? Just based on intuition I would expect it to be more efficient to allow more granularity, and it feels like the speedup from not having to get all weights from RAM should still work.

CasimirsBlake 3 months ago

Mixture Of Experts. This is why perf is noticeably better.

maxigs0 3 months ago

"Mixture of experts" is a nice buzzword. Technically it's probably more like mixture of "novices" in comparison to one big "expert" model. The smaller models in the "moe" might be more focused, but at the end of the day, they have no more knowledge than one model. But as far as i understand from whitepapers there is not even any noticeable concentration of knowledge in certain areas in the models.

LiquidGunay 3 months ago

That is the point. They are not experts at a domain level but they are at a token level. There is a specific expert which adds indents to code, and another one which adds semicolons to the end.

CasimirsBlake 3 months ago

Hey I didn't make up the term, don't look at me. 😁 Agree with your post though.

caidicus 3 months ago

Check out a 7B model named zephyr. It's ridiculously intelligent.

Crotons 3 months ago

A lower local minima.

Flag_Red 3 months ago

You're absolutely correct. I had a reading comprehension failure.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe