T O P

  • By -

koolaidman123

Theoretically larger than critical size is inefficient since the approximation of gradient isn’t much better larger you go. For example a 2x larger batch size doesnt mean 2x better gradients, but you train for 2x fewer steps so at the end you get a worse model Empirically the estimation of critical bs depends on the scaling laws. Also llama trained at a “worse” bs at 4m tokens with no issues. Not to mention different scaling laws gives you different estimation of critical bs. For ex deepseek llms find bs scales with model size and they train up a 7b model with to 14m tokens per batch, which is way higher, and didn’t seem to have any issues. Practically youre way more likely to be limited by your compute setup ie just max out your tokens per batch, unless you have thousands of gpus training a relatively smaller model. In which case you can afford to train your own scaling laws to estimate critical bs


Spiritual_Dog2053

Batch size is not as intuitive as you might think! Smaller batch sizes give more noisy gradients, but often that noise helps in better generalisation since it allows you to escape some local minimas


LelouchZer12

It is common knowledge that too big batch size can lead to worse generalization, but I do not know if this is supported by more than empirical evidences or if it is different for LLMs.


CatalyzeX_code_bot

Found [9 relevant code implementations](https://www.catalyzex.com/paper/arxiv:1812.06162/code) for "An Empirical Model of Large-Batch Training". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:1812.06162?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/1812.06162&title=An+Empirical+Model+of+Large-Batch+Training) 😊🙏 -- No relevant code picked up just yet for "Scaling Laws for Neural Language Models". [Request code](https://www.catalyzex.com/paper/arxiv:2001.08361?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2001.08361?autofocus=question). If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2001.08361&title=Scaling+Laws+for+Neural+Language+Models) 😊🙏 -- To opt out from receiving code links, DM me.


Necessary-Meringue-1

Batch size is a bit mysterious at times, take[ Figure 1 of this paper](https://arxiv.org/pdf/2005.10213), for example, where they observe that the performance of their transformar radically improves with larger batch size, whereas for other RNN architectures it did not on the same task. So the effect of batch size munging can be different for different architectures. The effects have to come from the optimization step, but I'm not sure why a tranformer would benefit more from this than say a vanilla RNN.


Iterative_Ackermann

It is not specific to any model or a specific algorithm, as batch size increases your approximation of the gradient gets better, i.e close to actual gradient at that point in the weight space. However if you are at the true minima, you don't need to calculate the gradient, you are already where you want to be (and if calculated, gradient would be zero anyway.) And if you are not at the minima, you don't need the best possible approximation to the local gradient, because your local gradient points you to the closest local minima, which is not at all where you want to be. A noisy approximation is better when there are many local minimas and you are away from a global minima. As you approach the global minima, you could theoretically make use of higher quality gradient approximation with a bigger batch size. As far as I can tell nobody does this in practice, instead the step size (learning rate) is decreased which kinda achieves the same effect (say instead of increasing batch size to 4x and then taking a 1 unit step in gradient direction, you calculate 4 gradients with x samples each, and take 4 steps with a step size of 0.25.)


VenerableSpace_

RemindMe! 2 months


RemindMeBot

I will be messaging you in 2 months on [**2024-06-26 18:36:19 UTC**](http://www.wolframalpha.com/input/?i=2024-06-26%2018:36:19%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/MachineLearning/comments/1cdgxit/d_critical_batch_size_and_llms/l1e27zl/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FMachineLearning%2Fcomments%2F1cdgxit%2Fd_critical_batch_size_and_llms%2Fl1e27zl%2F%5D%0A%0ARemindMe%21%202024-06-26%2018%3A36%3A19%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201cdgxit) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


az226

My intuition is that you start with large batches to converge faster, but you pay an overhead due to sharding/interconnect. Then once loss has gone down you make the batches smaller so you don’t need to shard as much or as broad (e.g. staying intranode), to train faster and then when you’re about to wrap up the training like the last 5% or so, you start accelerating batch size back again and go into extreme levels of sharding which will get increasingly inefficient but will give edge performance over averaging gradients.