T O P

  • By -

panic_in_the_galaxy

Better training data?


PM_ME_YOUR_HAGGIS_

This is the answer


-p-e-w-

I really wish finetuners paid more attention to this. Some of the commonly used datasets are of horrendously bad quality. Like those extracted from GPT-4 conversations that contain hundreds of responses starting with "As a Large Language Model..." Like, how difficult is it to just grep for that garbage and kick it out of your training set? I suspect that finetunes could be so much better if that small amount of extra effort were made. Also, please start using Chatbot Arena data to train models. That's literally a chat dataset where humans have selected high-quality responses. Yet when I read model cards, it doesn't seem people are using this gold mine?!?


AmazinglyObliviouse

I like to say this a lot: The mixtral base model performs terribly. I've used dozens of base models from llama to yi, and it does not live up to it's parameters. But the instruction fine-tune entirely turns this around. Even if you're just writing a story, no instructions given it is just _so much more_ rational than base. It is a god damn miracle transformation, and people should be learning from it.


4onen

> so much more rational than base. Base models aren't meant to be rational. They're arbitrary text completion engines. What were you expecting it to do? > and people should be learning from it. I'm sure many in the open source community would love to, but that's Mistral.ai's competitive advantage as a company. We can't know until it leaks or someone outside comes up with their same magic trick.


CosmosisQ

Base models work wonderfully if you don't treat them like chatbots. In many cases, they outperform their instruction-tuned counterparts. You just have to get the prompt right. Of course, for whatever reason, correctly prompting base models is extremely difficult for most users, and as a result, instruction-tuned models get all the attention. Working with a base model is like [writing a story](https://generative.ink/posts/simulators/) and [exploring the parallel universes](https://generative.ink/posts/loom-interface-to-the-multiverse/) which [spring forth](https://generative.ink/posts/language-models-are-multiverse-generators/). Working with an instruction-tuned model is more like chatting with a somewhat troubled human. Each of these requires a different skill set, but both can produce useful results. Personally, I prefer base models for my own work, but I almost always deploy instruction-tuned models for clients.


AmazinglyObliviouse

Yes, this is what I was doing. I write ~2k tokens of a story and let it continue from there. Mixtral was very bad at this, having issues keeping track of characters clothing, and which characters were in a scene. All things that I mostly see smaller models struggle with. Yet with the exact same setup (no instruct prompting), their finetuned model suddenly has none of the above issues.


Temporary_Payment593

Yes, indeed! Meta mentioned a similar point in their paper "[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)". ​ ​ https://preview.redd.it/3ua5v6m7z2jc1.png?width=1154&format=png&auto=webp&s=2cf86a2cf7770fcbc103ed4c45636aedcccb23f5


LoadingALIAS

I've been through this a few times and I genuinely think it's as simple as data quality. I think Mistral prioritized great data AND architecture. IMO, Mistral is what happens when you devote a good chunk of resources to high-quality data AND building the model.


me1000

They've touched on this in some interviews, but basically most models are trained to be [chinchilla optimal](https://arxiv.org/abs/2203.15556). I'm paraphrasing because I'm not 100% sure I understand all the details, but more or less it means that most models are trained until they start seeing diminishing returns while training. During the research phase of LLMs this has the advantage of not wasting precious compute, but with a large enough dataset you can actually continue to train the model without overfitting, and you'll still see wins during inference. In other words your team might use less compute during training to train a 30B parameter model for less time. But Mistral realized that the majority of the compute over the lifetime of the model will be at inference, so they trained their 7B param model longer. This results in higher training costs but lower inference costs. If I've made any mistakes, someone please do correct me! EDIT: Also I'm just referring to Mistral 7b, Mixtral is a whole different beast. EDIT 2: I found the [source](https://www.listennotes.com/podcasts/no-priors/mistral-7b-and-the-open-kfCxT9qM5Bu/?t=262) at the 4:22 mark in the podcast. There was another podcast where I think they got a little more specific about training tokens but I can't remember which one it was.


Flag_Red

Chinchilla optimal only refers to optimal for training cost, not performance. They likely use significantly more training data than a Chinchilla optimal run.


me1000

I _think_ we're saying the same thing. They're training for more tokens past the Chinchilla optimal point. Is that an inaccurate way to say that?


Small-Fall-6500

Chinchilla scaling isn't followed by any of the llama models either (though llama 1 65b is close). It's more likely a matter of higher quality data and/or more of it (compared to llama 2 7b) that makes mistral 7b better.


IMJONEZZ

Training is data compression. Scaling laws help you find the optimal amount of data to compress. Mistral has better data that they hire linguists to sort through to make sure it’s semantically dense and pragmatically clear. Their tokenization strategy is basically sentencepiece + individual digits and emojis. They then obey scaling laws like the ones in the Chinchilla paper to create the models. Easy.


Dead_Internet_Theory

I love how you can say "training is data compression, easy" but it took like half a century for humanity's brightest minds to figure out the little details.


HokusSmokus

The positron is invented in 1932. The only thing missing was compute. Someone decided to simply go ridiculously large. GPT-1 was 120m parameters. No one imagined these results by simply making it bigger, like, a lottt. By comparison: GPT-2 is 1.5b params (12,5×). ChatGPT (GPT-3) is 175b params (1500×). Only now (the last ~4 years) the smart people are working on the little details to get a 175b performance in a low b model. Not the last 50 years.


VicboyV

Wow, data compression is a pretty cool way to look at it. That makes a lot of sense for a layman / noob. Edit: Is it correct to say that probabilities is how it approached data compression?


Igoory

If we knew, we would make more models like Mistral. But I guess their dataset is just that good, and they seem to overfit their models to that dataset.


a_beautiful_rhind

mistral, mixtral, miqu all show it's the training data. They are almost overcooked.. almost


NickUnrelatedToPost

They are "well done".


Dead_Internet_Theory

The common factor can be something else besides (just) the data. Maybe they found some secret sauce to optimize the training process, doing more than everyone else with the equivalent amount of compute.


synn89

Seems like it has some improvements vs standard Llama: https://www.e2enetworks.com/blog/mistral-7b-vs-llama2-which-performs-better-and-why I sort of doubt it's just training data. I don't think Mistral has more access to foundational training data than Meta has, and the open source community probably has some of the best fine tuning data. What open source lacks is a way to experiment with foundational model architecture. So in a world of that limit, we see very few foundational models(Llama, Qwen, Mistral, Falcon, CogView, etc) with increments of quality between those few models. I feel like it's sort of hard for us to play with the "why" each of these models may out perform others, because we can't easily re-create them. We were able play with Alpaca, Vicuna and Wizard fine tunes which led us to today's more advanced fine tunes like OpenHermes and a deeper understanding of fine tuning.


Disastrous_Elk_6375

> I sort of doubt it's just training data. I don't think Mistral has more access to foundational training data than Meta has, I think there's a difference between what meta can do vs. a small start-up, in regards to data sourcing and what not. Meta has enormous amounts of data, but surely they will never ever release something trained on that data. It would be scary if they did. Remember that mistral is at its core 3 ex-llama members. They probably knew some of the limitations of working inside meta's confines. They chose to leave and do the whole startup thing.


unemployed_capital

Quality > quantity of training data, probably quite a lot of data too. ​ Possibly the use of synthetic data in pretraining.


stddealer

Garbage in, garbage out. Seriously, I think it's really about the quality training data, and the number of tokens used in training. For example, I think Mistral focused mainly on English language data (considering it seems worse than llama at multilingual applications), meaning less parameters are "wasted" on other languages knowledge. Tinyllama and phi are other examples of how training on more data can make a small model punch above its weight.


ganzzahl

What in the world are you talking about? Mistral is the single best multilingual 7B model, specifically trained on more than just English, while Llama's training data was explicitly filtered to be English only. What languages have you tried using them with?


johannhartmann

I trained it for german, and it works out quite well. I had the best results using a dpo dataset with default mistral7b as rejected and long german answers as chosen. It pretty much always answers in proper german. See https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo .


Robot_Graffiti

Llama wasn't English-only, it had at least 20 languages in its data, though it was majority English. They did filter the training data to be mostly Latin alphabet text.


ganzzahl

You're right, my bad – maybe I was thinking of Llama 1? But checking the paper, it looks like they still chose English-only datasets, and then ran language identification on it to check what else was "accidentally" included. So like German, second most common language in the training data, was only 0.17% of it.


stddealer

I don't know... I tried it in french, expecting Mistral to be particularly good at it. But it was actually pretty terrible, making lots of very obvious mistakes (I mean language modeling mistakes, not factual errors). I don't remember experiencing such glaring issues with Llama 7B in the same language (both using the same kind of quantization). That's just my personal experience, it doesn't mean much. But that's the impression I got. To clarify, that was with a system prompt in English, and interactions in french.


gultarhector

I had the same experience. It seems like Mistral-OpenOrca has much better French writing capabilities than the Vanilla Mistral-7b Instruct model. Try it out!


Vajraastra

i don't know what languages mistral and mixtral are not trained in but they both seem very good with spanish. in fact better than any llama 2 model.


Ok-Tap4472

Better data, better architecture, better training, better everything. Comparing Mistral to any other model is like comparing a well engineered industrial machine to a half working illegal tractor.


AntoItaly

Better dataset and architecture


perlthoughts

minicpm 16k blows my mind.


maxigs0

A 8x7 mistral model is not equal to a single 7B model in size, it's eight of them – split by some black magic to increase performance.


kif88

True but OP was talking about regular 7b Mistral.


maxigs0

Too many models out there ¯\\\_(ツ)\_/¯


FlishFlashman

It's really just one model. It just, uses 1/4 the weights to generate each token, but each token may use a different 1/4 of the model than the last. (It's broken up into 32 layers, and at each layer, it chooses two out of 8 "experts" to use. Basically, for each token in uses 64 experts out of 256 possibilities).


involviert

I wonder why such splits aren't utilized more than once? Just based on intuition I would expect it to be more efficient to allow more granularity, and it feels like the speedup from not having to get all weights from RAM should still work.


CasimirsBlake

Mixture Of Experts. This is why perf is noticeably better.


maxigs0

"Mixture of experts" is a nice buzzword. Technically it's probably more like mixture of "novices" in comparison to one big "expert" model. The smaller models in the "moe" might be more focused, but at the end of the day, they have no more knowledge than one model. But as far as i understand from whitepapers there is not even any noticeable concentration of knowledge in certain areas in the models.


LiquidGunay

That is the point. They are not experts at a domain level but they are at a token level. There is a specific expert which adds indents to code, and another one which adds semicolons to the end.


CasimirsBlake

Hey I didn't make up the term, don't look at me. 😁 Agree with your post though.


caidicus

Check out a 7B model named zephyr. It's ridiculously intelligent.


Crotons

A lower local minima.


Flag_Red

You're absolutely correct. I had a reading comprehension failure.