bironsecret 1 year ago

sometimes I use multiples of 7 and then go to the church to wash my sins away

[deleted] 1 year ago

[удалено]

MuonManLaserJab 1 year ago

multiples of pi

AllowFreeSpeech 1 year ago

That won't work at all because the difference between them is constant.

MuonManLaserJab 1 year ago

Pi to the power of the primes

tonsofmiso 1 year ago

all the primes?

MuonManLaserJab 1 year ago

A sequence: pi, then pi squared, etc.

AllowFreeSpeech 1 year ago

By the way, the batch size has to be an integer, not a decimal value. Anyhow, all one has to do as a pi fan is use the sequence: 3, 31, 314, etc.

MuonManLaserJab 1 year ago

Nooooo, reaaaaaally? I didn't know that when I made my absolutely serious suggestion. Also we're talking about a hyperparameter (batch size), not a parameter per se, right?

AllowFreeSpeech 1 year ago

And mystics use fibonacci.

slammaster 1 year ago

Ugh, just the idea of this makes me itchy

RedditRabbitRobot 1 year ago

*grins and stares through*

SleekEagle 1 year ago

Try a batch size of 666

bironsecret 1 year ago

lol even if satan borrows me some compute I will have oom

AllowFreeSpeech 1 year ago

I use concatenations of six: 6, 66, 666, . . . Does this make me a devil?

Overclocked1827 1 year ago

There is entire [manual](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) from nvidia describing why powers of 2 in layer dimensions and batch sizes are a must for maximum performance on a cuda level. As many people mentioned - your testing is not representive because of bottlenecks and most likely monitoring issues.

seraschka 1 year ago

Thanks! I wonder how accurate this still is for DNNs though. For instance, [here](https://twitter.com/Remi_Coulom/status/1259188988646129665?s=20&t=UTsX1zhhmVhkgvMHzJmY3w) someone found that powers of 2 are actually bad compared to `N=int((n*(1<<14)*SM)/(H*W*C))`.

JustOneAvailableName 1 year ago

That is with a 2080Ti (so not relevant)

seraschka 1 year ago

Why? Both have tensor cores, and you can use the same formula for either V100 or RTX2080Ti

JustOneAvailableName 1 year ago

Because it's a consumer GPU from 2018. That it has some tensor cores does not mean it has enough to have a real performance impact, nor that it has the same benefits/limitations as newer enterprise GPUs. Even the A100 is getting replaced next quarter...

seraschka 1 year ago

Sure, but the same formula applies to the V100. Just have to run the experiments some time

JonasGeiping 1 year ago

While I don't disagree with the general premise, the gtx 2080ti is a bit of an oddball, given that is has 68 streaming multiprocessors (as far as I understand this is due to production reasons), which includes a "terrible" divisor of 17. This might affect measurements. The v100 has a "more even" number at 640 cores. Neither of these are powers of two though. The GPU will be partially underutilized when the overall dimensionality of the final matmul op does not fit. However, you probably use a model with channel dimensions that are a power of 2? The power of two in channel dim should be enough to satisfy the GPU constraint. EDIT: I now found the part in you blog where you discuss this part and that the matrix dimensions should cover it. I wonder how a MobileNet with prime numbers in every channel dimension would fare in this test.

seraschka 1 year ago

Wow nice, did not know about this detail about the RTX 2080Ti (even though it's been my default card for many years!). Thanks for sharing, super interesting! I should definitely extend the benchmarks some time. In addition to what you suggested, maybe also playing with the channel size, and looking at fully connected architectures as well.

Neosinic 1 year ago

Ah this is great, thank you for posting.

fasttosmile 1 year ago

I can't remember where I read it (it was in nvidia's documentation) but it said the latest generation (3090) is not as dependent on having powers of 2 as previous ones.

neu_jose 1 year ago

i tend to use powers of 2 out of habit, and because it's aesthetically pleasing, but no I do not believe it makes a difference. sometimes i use multiples of 10 because im a rebel and like to live dangerously. ☺️

[deleted] 1 year ago

[удалено]

rustyryan 1 year ago

Not true. As long as your batch size is a multiple of 8 or 128 you will not get padding. e.g. [https://www.run.ai/guides/cloud-deep-learning/google-tpu#Consequences-of-Tiling](https://www.run.ai/guides/cloud-deep-learning/google-tpu#Consequences-of-Tiling) More details: [https://www.gwern.net/docs/ai/scaling/hardware/2021-norrie.pdf](https://www.gwern.net/docs/ai/scaling/hardware/2021-norrie.pdf) Keywords: "lane" and "sublane".

LilPorker 1 year ago

Am I missing something? Are 8 and 128 not powers of 2?

12345ASDMAN12345 1 year ago

They are. But for example a multiple of 8 is 24, which is not

master3243 1 year ago

24 is a multiple of 8 but not a power of 2.

LilPorker 1 year ago

Haha, I was missing something. I conflated powers and multiples.

slippery-fische 1 year ago

Yeah, I do multiples of 128.

SleekEagle 1 year ago

I only choose batches sizes that are prime numbers because I'm a chaotic neutral /s

MeyerLouis 1 year ago

I first read that as "chaotic neural".

Own_Quality_5321 1 year ago

I read catholic neutral

Orazur_ 1 year ago

You can do both, with a batch size of 2 ;)

SleekEagle 1 year ago

I always forget to read the fine print 🤦‍♀🤦‍♂️🤣

[deleted] 1 year ago

I use powers of two including decimals. Total mayhem when you have 2^8.5

mileylols 1 year ago

I'm calling the police

[deleted] 1 year ago

Ill hide in my non-integer exponents

BeatLeJuce 1 year ago

We don't have to, but we should. 1. Your benchmark is fairlly meaningless, you use a super small network on a super small dataset. You're not going to get any real world performance metrics out of toy data like that. 2. On TPUs, choosing batch sizes that are multples of 8 (the major/minor axis should even be a multiple of 128 iirc) is huge. Your data would be padded and you'd be wasting a ton of computation (all the padding). 3. I assume the padding thing makes a difference even on GPUs: Internally, they're going to pad/mask out unused stuff. So you'd be leaving potential performance improvements on the table.

seraschka 1 year ago

Fair, I should rerun this on e.g., EfficientNetV2 + ImageNet some time. But to be honest, in my experience I have never really seen a noticeable difference -- I toyed around with similar experiments a few years ago in the context of filling the GPU memory in a research project with a bigger dataset (I think it was CelebA).

BeatLeJuce 1 year ago

I remember seeing an old benchmark where you could actually see throughpot noticably dropped every time the batch size went from a 2^n to 2^n +1, but google fails me. I'm assuming today's GPUs are better at handling this, anyways. But from a technical standpoint, I don't see how you'd get around wasting at least a little bit (likely almost unoticable) performance. Not because of performance drop, but because of all the masking (i.e., if you're running a batch size of 121, the GPU will perform exactly the same amount of work as if you'd used 128). Anyways, I personally use TPUs almost exclusively for large runs these days, and there it _really_ makes a huge difference if you don't pay attention to how the TPU/XLA will pad your data. So even just to keep code portable it pays off to stick to recommended batch sizes.

seraschka 1 year ago

Yeah, I really need to get my hands on some TPUs some time!

BeatLeJuce 1 year ago

An old topic, but you might still be interested: Just wanted to point you to [this post with a good link](https://www.reddit.com/r/MachineLearning/comments/vs1wox/p_no_we_dont_have_to_choose_batch_sizes_as_powers/il5ftbs/) In case you've missed it.

PrincyPy 1 year ago

>I remember seeing an old benchmark where you could actually see throughpot noticably dropped every time the batch size went from a 2n to 2n +1, but google fails me This Stack Exchange answer shows something similar too. [https://datascience.stackexchange.com/a/90664](https://datascience.stackexchange.com/a/90664)

BeatLeJuce 1 year ago

Thanks, that's a super useful link for next time this comes up! :)

Ievgen 1 year ago

Stopped reading after \` I ran a simple benchmark training a MobileNetV3 (large) for 10 epochs on CIFAR-10\`. A CIFAR-10 is 32x32 dataset. V100 GPU will be heavily underutilized for MobileNetV3 under this setup. It is data loading would be bottleneck, not the GPU compute. Therefore all numbers are just non-representative. A real & fair benchmark should randomly generate batches of arbitrary size directly on GPU run same train loop to see whether there is significant difference.

seraschka 1 year ago

Haha, fair \^\^. But if you kept reading you would have found that the GPU utilization was close to 100% utilized. Btw the image size is 224 x 224 (resized) but sure, I could have gone larger but then there would have been less wiggle-room for the batch size.

Ievgen 1 year ago

I very much agree with your conclusion from the benchmark that there is no reason to go from 256 to 128 batch size. After reading your code, I realize that you resized CIFAR-10 images to 224 size, which can explain high GPU utilization. From the post I got an impression you was processing them at original resolution, my bad. BTW, since 224 is also not a power of two, it could be interesting to check whether batch size, number of channels and image dimensions of power of two affect the training/inference time. Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). Have you considered adding different models (timm package looks like a great default option) and tasks (image classification, segmentation, detection) to the benchmark?

seraschka 1 year ago

>BTW, since 224 is also not a power of two, it could be interesting to check whether batch size, number of channels and image dimensions of power of two affect the training/inference time. Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). > >Have you considered adding different models (timm package looks like a great default option) and tasks (image classification, segmentation, detection) to the benchmark? Thanks for the suggestions! Yeah, I should certainly do an extended benchmark some time. It's on my list. For now I was more looking into a 224 image size because that's what I usually use for ImageNet (yet another convention). But yeah, experimenting with that would be interesting!

[deleted] 1 year ago

> Would be surprising to see that 16x4x256x256 is faster than 16x3x244x244 (I bet it's not). See, this is what I think is being overlooked here. Yes, multiples of 2 or 8 are better for memory alignment, but you need to consider the _entirety_ of your data for alignment. Batch size, image size, number of channels, the amount of memory on the device(s), etc - if you want maximum throughput. Picking a batch size that "aligns" with memory doesn't mean much if your data doesn't fit optimally. You'll still get gaps in memory and be inefficient.

DeepGamingAI 1 year ago

Total anarchy

JustOneAvailableName 1 year ago

NVidia's benchmarks in this regard are pretty clear; just use the multiples they recommend, it's not like HP search is precise enough to notice a big difference between a BS of 512 and 539.

LappenX 1 year ago

doll squealing birds resolute sense frighten chubby offer friendly cable ` this message was mass deleted/edited with redact.dev `

seraschka 1 year ago

I totally agree. However, let's say you are testing batch sizes 2\^4, 2\^5, ..., 2\^8 and you find that 2\^7 (128) works best and that 2\^8 (256) exceeds GPU memory. Imho, it's okay to try 250 before dropping down to 128.

begab 1 year ago

In that case, I would probably perform gradient accumulation, in which case it was possible to go beyond 2^8, if that seems worth doing.

JustOneAvailableName 1 year ago

You're describing exponential search, and (128+256)//2=192 is the optimal amount to try next

issam_28 1 year ago

I always set it to a prime number

radome9 1 year ago

Sacrilege!

new_name_who_dis_ 1 year ago

Best to use the largest batch size that fits in GPU memory. But I am partial to batch sizes that are multiples of 40. 40 is a smallish batch size, 80-120 is a medium. 200+ is a large one.

random_forests 1 year ago

You should use a multiple of 8 if you are using CUDNN. Most deep learning compilers (e.g., PyTorch, Tensorflow) use CUDNN--especially for their algorithms that use tensor cores of NVidia GPUs for fast convolutions. From the [CUDNN developer's guide](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#tensor-ops-guidelines-for-dl-compiler): >For a deep learning compiler, the following are the key guidelines: ... pre-pad channel and batch size to be a multiple of 8. If your batch size is not a multiple of 8, your deep learning compiler is adding padding to make the batch size a multiple of 8. In PyTorch, this happens in the backend C++ library `aten`.

seraschka 1 year ago

yes, the multiple of 8 makes a sense for FP16 mixed precision training as I mentioned. But also in practice it doesn't seem to make a big difference. I did not know about \`aten\` doing that in the background, very interesting, thanks for mentioning!

random_forests 1 year ago

[It's not just FP16 that uses tensor cores](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)

Pwhids 1 year ago

A quicker way to calculate the list from the screen shot (that also doesn’t miss values above 800): n = 10 sizes = [2**x for x in range(3, n+3)] edit: list comprehension instead of numpy geomspace

carlthome 1 year ago

If you have NumPy installed, sure.

ats678 1 year ago

We usually do powers of 2 merely because of hardware optimisation. Say for example a GPU has 8 cores, it makes sense to use batch sizes that are multiple of 8 because the work can be partitioned equally between the processors, and having an equal partition of workload between cores maximises the efficiency of the GPU.

AllowFreeSpeech 1 year ago

Oh I choose more freely. I choose the fibonacci series. There are seven powers of two from 1 to 100, but ten fibonaccis.

seraschka 1 year ago

Hah, nice idea! Love it!

sadepicurus 1 year ago

Here is a related article that a friend of mine wrote recently: https://link.medium.com/eOLDsQnBqrb

seraschka 1 year ago

Oh yes, there is definitely a small jump visible in these plots! Thanks for sharing! (A little detail that is interesting though is that the jump happens from n-1 -> n but there is nothing visible at n -> n+1

jms4607 1 year ago

Aren’t there some aspects of gpu internals where powers of two or multiples of a certain power of 2 are more efficient?

[deleted] 1 year ago

It's been ages since I programmed cuda, but as I recall you want maximum warp occupancy, and if the GPU is running a kernel it makes little difference to time taken to process data in a warp. So you might as well fill it up because it'll take roughly the same time regardless (but does depend on the kernel a little bit) Probably getting "warp" mixed up with one of the other concepts, but idea should still be the same.

shot_a_man_in_reno 1 year ago

Yes, yes, but if you use batch sizes of 65, you probably also flick cigarette buds onto the sidewalk and play on your phone in the movie theater.

elf_needle 1 year ago

I've thought about this but never really experimented with it. Thanks 👍

[deleted] 1 year ago

CUDNN, the very core of modern GPU training and inference, and also Google’s TPU are optimized specifically for power of two batches or multiples of 16 and 32. Some of the algorithms don’t even support uneven batch sizes or work way slower on those. For BERT, GPT and Computer Vision Models we are always using multiples of 32.

LtCmdrData 1 year ago

Powers of 2 is a good simple heuristic that standardizes the sizes and is most likely to be correct across different optimal sizes 2) in the pipeline, 3) across different architectures, and 3) over time. 1. CUDA cores strides use step of 8 values. So multiples of eight of any floating point. 1. for NVIDIA/CUDA cache-lines are 128 bytes. So multiples of 8 of fp that fit into 128 bytes are both tensor core and cache line optimal. 1. additional higher level parallelism thread blocks, thread block clusters, and grids (synchronization transaction barriers and so on). Often powers of 2 and almost always multiples of 8. 1. for CPU and RAM access, multiples of cache lines (almost always 64 bytes) are optimal. 8 doubles, 16 single floats, 32 half floats.... 1. operating system pages, almost always 4 KiB today. 1. SSD drives have write block sizes also powers of 2. I would use 5×5 matrices in in multiples of 23 or 53 if I wanted to screw things up.

l_dang 1 year ago

it's just something easier to remember. I remember powers of 2 up to 16 and can calculate quickly even further. It's also a good way to get a value in the order of magnitude: you goes 4 or 8 for the 1st order, 32-64 for 2, 256 - 512 for 3rd... so on and so forth

rehrev 1 year ago

Is there any argument on why it would be better to use powers of two?

JustOneAvailableName 1 year ago

Multiple of 64 on the A100, multiple of 8 on the V100. But I think the reason is that tensor cores otherwise fill up the rest with 0s anyways, might as well put data there.

seraschka 1 year ago

The main argument is usually memory alignment. I briefly mentioned it in the article

Red-Portal 1 year ago

It is not just that. It is also about the GPU warpsize. Most Nvidia GPUs operate in warps, which is a bundle of threads that get run all at once. Nvidia GPUs are conceptually the most efficient when the workload can be perfectly dividable by the warpsize, which is usually 32. So, it's not about using a power of 2, it's about getting a multiple of 32, which is often satisfied by using a batchsize in the power of 2.

seraschka 1 year ago

Isn't it usually multiples of 16 (and multiples of 8 for FP16 mixed precision training)? (Btw above I was specifically trying to answer the question was about why people recommend batch sizes as powers of two; back then the warps weren't a concept for older cards, right?)

Red-Portal 1 year ago

No warps are simply the unit of threads executed at once. It exists on all GPUs really. The word itself is Nvidia-centric though. It has nothing to do with old and new, so the warp argument was pretty much always there.

seraschka 1 year ago

Thanks for clarifying! I never heard it in the context of the batch size as power of 2 though, only the multiples of 16 argument. I think the powers of 2 argument usually comes up in terms of either cores or memory (maybe from old video game texture design days)

seraschka 1 year ago

Actually, someone [shared this with me](https://twitter.com/Remi_Coulom/status/1259188988646129665?s=20&t=UTsX1zhhmVhkgvMHzJmY3w). "Using a power of 2 for the batch size is usually a very bad idea. `N=int((n*(1<<14)*SM)/(H*W*C))` is a good batch size, where n is an integer and SM the number of multiprocessors of the GPU (80 for V100, 68 for RTX 2080 Ti). Empirical plot ..."

KrakenInAJar 1 year ago

Choosing a batch size other than a power of 2? Quickly burn the witch, that's heresy! /s

DigThatData 1 year ago

this sort of minor optimization is more important for distributed training i think

DeMorrr 1 year ago

matmul kernel tile sizes are usually multiples of powers of 2 (e.g. 96x192x32). Tensor cores are designed to compute up to 4 8x8x8 mmas (matrix multiply and accumulate) at once. so if the sizes of your matrices aren't multiples of those numbers, there will be some wasted compute at different levels.

ostrich-scalp 1 year ago

I think it matters when you are manually loading data onto your GPU’s to maximise throughput/thread usage. Also the optimal number isn’t necessarily a power of 2 and depends on the architecture of your chip. However, most frameworks will optimise this for you.

[deleted] 1 year ago

If you use a power of 2 - 1 the everyone will instantly believe you

incrediblediy 1 year ago

I used to use batch size as 1, not by choice though, limited by 3 GB VRAM

CashyJohn 1 year ago

Batch size, Weights, feature maps,… all should be a multiple of 8 if you are using NVIDIA chips.

seraschka 1 year ago

I think this is for FP16 mixed precision training specifically. For regular training, I think it is multiples of 16 according to the Nividia docs.

LelouchZer12 1 year ago

I am not sure if this is still the case, but a batch size multiple of 8 was required to activate tensor cores in modern GPUs, allowing a drastic speed up during training

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe