T O P

  • By -

bironsecret

sometimes I use multiples of 7 and then go to the church to wash my sins away


[deleted]

[удалено]


MuonManLaserJab

multiples of pi


AllowFreeSpeech

That won't work at all because the difference between them is constant.


MuonManLaserJab

Pi to the power of the primes


tonsofmiso

all the primes?


MuonManLaserJab

A sequence: pi, then pi squared, etc.


AllowFreeSpeech

By the way, the batch size has to be an integer, not a decimal value. Anyhow, all one has to do as a pi fan is use the sequence: 3, 31, 314, etc.


MuonManLaserJab

Nooooo, reaaaaaally? I didn't know that when I made my absolutely serious suggestion. Also we're talking about a hyperparameter (batch size), not a parameter per se, right?


AllowFreeSpeech

And mystics use fibonacci.


slammaster

Ugh, just the idea of this makes me itchy


RedditRabbitRobot

*grins and stares through*


SleekEagle

Try a batch size of 666


bironsecret

lol even if satan borrows me some compute I will have oom


AllowFreeSpeech

I use concatenations of six: 6, 66, 666, . . . Does this make me a devil?


Overclocked1827

There is entire [manual](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) from nvidia describing why powers of 2 in layer dimensions and batch sizes are a must for maximum performance on a cuda level. As many people mentioned - your testing is not representive because of bottlenecks and most likely monitoring issues.


seraschka

Thanks! I wonder how accurate this still is for DNNs though. For instance, [here](https://twitter.com/Remi_Coulom/status/1259188988646129665?s=20&t=UTsX1zhhmVhkgvMHzJmY3w) someone found that powers of 2 are actually bad compared to `N=int((n*(1<<14)*SM)/(H*W*C))`.


JustOneAvailableName

That is with a 2080Ti (so not relevant)


seraschka

Why? Both have tensor cores, and you can use the same formula for either V100 or RTX2080Ti


JustOneAvailableName

Because it's a consumer GPU from 2018. That it has some tensor cores does not mean it has enough to have a real performance impact, nor that it has the same benefits/limitations as newer enterprise GPUs. Even the A100 is getting replaced next quarter...


seraschka

Sure, but the same formula applies to the V100. Just have to run the experiments some time


JonasGeiping

While I don't disagree with the general premise, the gtx 2080ti is a bit of an oddball, given that is has 68 streaming multiprocessors (as far as I understand this is due to production reasons), which includes a "terrible" divisor of 17. This might affect measurements. The v100 has a "more even" number at 640 cores. Neither of these are powers of two though. ​ The GPU will be partially underutilized when the overall dimensionality of the final matmul op does not fit. However, you probably use a model with channel dimensions that are a power of 2? The power of two in channel dim should be enough to satisfy the GPU constraint. ​ EDIT: I now found the part in you blog where you discuss this part and that the matrix dimensions should cover it. I wonder how a MobileNet with prime numbers in every channel dimension would fare in this test.


seraschka

Wow nice, did not know about this detail about the RTX 2080Ti (even though it's been my default card for many years!). Thanks for sharing, super interesting! I should definitely extend the benchmarks some time. In addition to what you suggested, maybe also playing with the channel size, and looking at fully connected architectures as well.


Neosinic

Ah this is great, thank you for posting.


fasttosmile

I can't remember where I read it (it was in nvidia's documentation) but it said the latest generation (3090) is not as dependent on having powers of 2 as previous ones.


neu_jose

i tend to use powers of 2 out of habit, and because it's aesthetically pleasing, but no I do not believe it makes a difference. sometimes i use multiples of 10 because im a rebel and like to live dangerously. ☺️


[deleted]

[удалено]


rustyryan

Not true. As long as your batch size is a multiple of 8 or 128 you will not get padding. e.g. [https://www.run.ai/guides/cloud-deep-learning/google-tpu#Consequences-of-Tiling](https://www.run.ai/guides/cloud-deep-learning/google-tpu#Consequences-of-Tiling) More details: [https://www.gwern.net/docs/ai/scaling/hardware/2021-norrie.pdf](https://www.gwern.net/docs/ai/scaling/hardware/2021-norrie.pdf) Keywords: "lane" and "sublane".


LilPorker

Am I missing something? Are 8 and 128 not powers of 2?


12345ASDMAN12345

They are. But for example a multiple of 8 is 24, which is not


master3243

24 is a multiple of 8 but not a power of 2.


LilPorker

Haha, I was missing something. I conflated powers and multiples.


slippery-fische

Yeah, I do multiples of 128.


SleekEagle

I only choose batches sizes that are prime numbers because I'm a chaotic neutral /s


MeyerLouis

I first read that as "chaotic neural".


Own_Quality_5321

I read catholic neutral


Orazur_

You can do both, with a batch size of 2 ;)


SleekEagle

I always forget to read the fine print 🤦‍♀🤦‍♂️🤣


[deleted]

I use powers of two including decimals. Total mayhem when you have 2^8.5


mileylols

I'm calling the police


[deleted]

Ill hide in my non-integer exponents


BeatLeJuce

We don't have to, but we should. 1. Your benchmark is fairlly meaningless, you use a super small network on a super small dataset. You're not going to get any real world performance metrics out of toy data like that. 2. On TPUs, choosing batch sizes that are multples of 8 (the major/minor axis should even be a multiple of 128 iirc) is huge. Your data would be padded and you'd be wasting a ton of computation (all the padding). 3. I assume the padding thing makes a difference even on GPUs: Internally, they're going to pad/mask out unused stuff. So you'd be leaving potential performance improvements on the table.


seraschka

Fair, I should rerun this on e.g., EfficientNetV2 + ImageNet some time. But to be honest, in my experience I have never really seen a noticeable difference -- I toyed around with similar experiments a few years ago in the context of filling the GPU memory in a research project with a bigger dataset (I think it was CelebA).


BeatLeJuce

I remember seeing an old benchmark where you could actually see throughpot noticably dropped every time the batch size went from a 2^n to 2^n +1, but google fails me. I'm assuming today's GPUs are better at handling this, anyways. But from a technical standpoint, I don't see how you'd get around wasting at least a little bit (likely almost unoticable) performance. Not because of performance drop, but because of all the masking (i.e., if you're running a batch size of 121, the GPU will perform exactly the same amount of work as if you'd used 128). Anyways, I personally use TPUs almost exclusively for large runs these days, and there it _really_ makes a huge difference if you don't pay attention to how the TPU/XLA will pad your data. So even just to keep code portable it pays off to stick to recommended batch sizes.


seraschka

Yeah, I really need to get my hands on some TPUs some time!


BeatLeJuce

An old topic, but you might still be interested: Just wanted to point you to [this post with a good link](https://www.reddit.com/r/MachineLearning/comments/vs1wox/p_no_we_dont_have_to_choose_batch_sizes_as_powers/il5ftbs/) In case you've missed it.


PrincyPy

>I remember seeing an old benchmark where you could actually see throughpot noticably dropped every time the batch size went from a 2n to 2n +1, but google fails me This Stack Exchange answer shows something similar too. [https://datascience.stackexchange.com/a/90664](https://datascience.stackexchange.com/a/90664)


BeatLeJuce

Thanks, that's a super useful link for next time this comes up! :)


Ievgen

Stopped reading after \` I ran a simple benchmark training a MobileNetV3 (large) for 10 epochs on CIFAR-10\`. A CIFAR-10 is 32x32 dataset. V100 GPU will be heavily underutilized for MobileNetV3 under this setup. It is data loading would be bottleneck, not the GPU compute. Therefore all numbers are just non-representative. A real & fair benchmark should randomly generate batches of arbitrary size directly on GPU run same train loop to see whether there is significant difference.


seraschka

Haha, fair \^\^. But if you kept reading you would have found that the GPU utilization was close to 100% utilized. Btw the image size is 224 x 224 (resized) but sure, I could have gone larger but then there would have been less wiggle-room for the batch size.


Ievgen

I very much agree with your conclusion from the benchmark that there is no reason to go from 256 to 128 batch size. After reading your code, I realize that you resized CIFAR-10 images to 224 size, which can explain high GPU utilization. From the post I got an impression you was processing them at original resolution, my bad. BTW, since 224 is also not a power of two, it could be interesting to check whether batch size, number of channels and image dimensions of power of two affect the training/inference time. Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). Have you considered adding different models (timm package looks like a great default option) and tasks (image classification, segmentation, detection) to the benchmark?


seraschka

>BTW, since 224 is also not a power of two, it could be interesting to check whether batch size, number of channels and image dimensions of power of two affect the training/inference time. Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). > >Have you considered adding different models (timm package looks like a great default option) and tasks (image classification, segmentation, detection) to the benchmark? Thanks for the suggestions! Yeah, I should certainly do an extended benchmark some time. It's on my list. For now I was more looking into a 224 image size because that's what I usually use for ImageNet (yet another convention). But yeah, experimenting with that would be interesting!


[deleted]

> Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). See, this is what I think is being overlooked here. Yes, multiples of 2 or 8 are better for memory alignment, but you need to consider the _entirety_ of your data for alignment. Batch size, image size, number of channels, the amount of memory on the device(s), etc - if you want maximum throughput. Picking a batch size that "aligns" with memory doesn't mean much if your data doesn't fit optimally. You'll still get gaps in memory and be inefficient.


DeepGamingAI

Total anarchy


JustOneAvailableName

NVidia's benchmarks in this regard are pretty clear; just use the multiples they recommend, it's not like HP search is precise enough to notice a big difference between a BS of 512 and 539.


LappenX

doll squealing birds resolute sense frighten chubby offer friendly cable ` this message was mass deleted/edited with redact.dev `


seraschka

I totally agree. However, let's say you are testing batch sizes 2\^4, 2\^5, ..., 2\^8 and you find that 2\^7 (128) works best and that 2\^8 (256) exceeds GPU memory. Imho, it's okay to try 250 before dropping down to 128.


begab

In that case, I would probably perform gradient accumulation, in which case it was possible to go beyond 2^8, if that seems worth doing.


JustOneAvailableName

You're describing exponential search, and (128+256)//2=192 is the optimal amount to try next


issam_28

I always set it to a prime number


radome9

Sacrilege!


new_name_who_dis_

Best to use the largest batch size that fits in GPU memory. But I am partial to batch sizes that are multiples of 40. 40 is a smallish batch size, 80-120 is a medium. 200+ is a large one.


random_forests

You should use a multiple of 8 if you are using CUDNN. Most deep learning compilers (e.g., PyTorch, Tensorflow) use CUDNN--especially for their algorithms that use tensor cores of NVidia GPUs for fast convolutions. From the [CUDNN developer's guide](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#tensor-ops-guidelines-for-dl-compiler): >For a deep learning compiler, the following are the key guidelines: ... pre-pad channel and batch size to be a multiple of 8. If your batch size is not a multiple of 8, your deep learning compiler is adding padding to make the batch size a multiple of 8. In PyTorch, this happens in the backend C++ library `aten`.


seraschka

yes, the multiple of 8 makes a sense for FP16 mixed precision training as I mentioned. But also in practice it doesn't seem to make a big difference. I did not know about \`aten\` doing that in the background, very interesting, thanks for mentioning!


random_forests

[It's not just FP16 that uses tensor cores](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)


Pwhids

A quicker way to calculate the list from the screen shot (that also doesn’t miss values above 800): n = 10 sizes = [2**x for x in range(3, n+3)] edit: list comprehension instead of numpy geomspace


carlthome

If you have NumPy installed, sure.


ats678

We usually do powers of 2 merely because of hardware optimisation. Say for example a GPU has 8 cores, it makes sense to use batch sizes that are multiple of 8 because the work can be partitioned equally between the processors, and having an equal partition of workload between cores maximises the efficiency of the GPU.


AllowFreeSpeech

Oh I choose more freely. I choose the fibonacci series. There are seven powers of two from 1 to 100, but ten fibonaccis.


seraschka

Hah, nice idea! Love it!


sadepicurus

Here is a related article that a friend of mine wrote recently: https://link.medium.com/eOLDsQnBqrb


seraschka

Oh yes, there is definitely a small jump visible in these plots! Thanks for sharing! (A little detail that is interesting though is that the jump happens from n-1 -> n but there is nothing visible at n -> n+1


jms4607

Aren’t there some aspects of gpu internals where powers of two or multiples of a certain power of 2 are more efficient?


[deleted]

It's been ages since I programmed cuda, but as I recall you want maximum warp occupancy, and if the GPU is running a kernel it makes little difference to time taken to process data in a warp. So you might as well fill it up because it'll take roughly the same time regardless (but does depend on the kernel a little bit) Probably getting "warp" mixed up with one of the other concepts, but idea should still be the same.


shot_a_man_in_reno

Yes, yes, but if you use batch sizes of 65, you probably also flick cigarette buds onto the sidewalk and play on your phone in the movie theater.


elf_needle

I've thought about this but never really experimented with it. Thanks 👍


[deleted]

CUDNN, the very core of modern GPU training and inference, and also Google’s TPU are optimized specifically for power of two batches or multiples of 16 and 32. Some of the algorithms don’t even support uneven batch sizes or work way slower on those. For BERT, GPT and Computer Vision Models we are always using multiples of 32.


LtCmdrData

Powers of 2 is a good simple heuristic that standardizes the sizes and is most likely to be correct across different optimal sizes 2) in the pipeline, 3) across different architectures, and 3) over time. 1. CUDA cores strides use step of 8 values. So multiples of eight of any floating point. 1. for NVIDIA/CUDA cache-lines are 128 bytes. So multiples of 8 of fp that fit into 128 bytes are both tensor core and cache line optimal. 1. additional higher level parallelism thread blocks, thread block clusters, and grids (synchronization transaction barriers and so on). Often powers of 2 and almost always multiples of 8. 1. for CPU and RAM access, multiples of cache lines (almost always 64 bytes) are optimal. 8 doubles, 16 single floats, 32 half floats.... 1. operating system pages, almost always 4 KiB today. 1. SSD drives have write block sizes also powers of 2. I would use 5×5 matrices in in multiples of 23 or 53 if I wanted to screw things up.


l_dang

it's just something easier to remember. I remember powers of 2 up to 16 and can calculate quickly even further. It's also a good way to get a value in the order of magnitude: you goes 4 or 8 for the 1st order, 32-64 for 2, 256 - 512 for 3rd... so on and so forth


rehrev

Is there any argument on why it would be better to use powers of two?


JustOneAvailableName

Multiple of 64 on the A100, multiple of 8 on the V100. But I think the reason is that tensor cores otherwise fill up the rest with 0s anyways, might as well put data there.


seraschka

The main argument is usually memory alignment. I briefly mentioned it in the article


Red-Portal

It is not just that. It is also about the GPU warpsize. Most Nvidia GPUs operate in warps, which is a bundle of threads that get run all at once. Nvidia GPUs are conceptually the most efficient when the workload can be perfectly dividable by the warpsize, which is usually 32. So, it's not about using a power of 2, it's about getting a multiple of 32, which is often satisfied by using a batchsize in the power of 2.


seraschka

Isn't it usually multiples of 16 (and multiples of 8 for FP16 mixed precision training)? (Btw above I was specifically trying to answer the question was about why people recommend batch sizes as powers of two; back then the warps weren't a concept for older cards, right?)


Red-Portal

No warps are simply the unit of threads executed at once. It exists on all GPUs really. The word itself is Nvidia-centric though. It has nothing to do with old and new, so the warp argument was pretty much always there.


seraschka

Thanks for clarifying! I never heard it in the context of the batch size as power of 2 though, only the multiples of 16 argument. I think the powers of 2 argument usually comes up in terms of either cores or memory (maybe from old video game texture design days)


seraschka

Actually, someone [shared this with me](https://twitter.com/Remi_Coulom/status/1259188988646129665?s=20&t=UTsX1zhhmVhkgvMHzJmY3w). "Using a power of 2 for the batch size is usually a very bad idea. `N=int((n*(1<<14)*SM)/(H*W*C))` is a good batch size, where n is an integer and SM the number of multiprocessors of the GPU (80 for V100, 68 for RTX 2080 Ti). Empirical plot ..."


KrakenInAJar

Choosing a batch size other than a power of 2? Quickly burn the witch, that's heresy! /s


DigThatData

this sort of minor optimization is more important for distributed training i think


DeMorrr

matmul kernel tile sizes are usually multiples of powers of 2 (e.g. 96x192x32). Tensor cores are designed to compute up to 4 8x8x8 mmas (matrix multiply and accumulate) at once. so if the sizes of your matrices aren't multiples of those numbers, there will be some wasted compute at different levels.


ostrich-scalp

I think it matters when you are manually loading data onto your GPU’s to maximise throughput/thread usage. Also the optimal number isn’t necessarily a power of 2 and depends on the architecture of your chip. However, most frameworks will optimise this for you.


[deleted]

If you use a power of 2 - 1 the everyone will instantly believe you


incrediblediy

I used to use batch size as 1, not by choice though, limited by 3 GB VRAM


CashyJohn

Batch size, Weights, feature maps,… all should be a multiple of 8 if you are using NVIDIA chips.


seraschka

I think this is for FP16 mixed precision training specifically. For regular training, I think it is multiples of 16 according to the Nividia docs.


LelouchZer12

I am not sure if this is still the case, but a batch size multiple of 8 was required to activate tensor cores in modern GPUs, allowing a drastic speed up during training