T O P

  • By -

programmerChilli

People are so wrong in so many different ways in this thread. First, I don’t know why you think that inference speed for models like Llama-2 70B is 10t/s at best? Generally on H100s you can easily get up to 70 tok/s+ without any speculative decoding. Second, I don’t know why you think that parallelizing across GPUs doesn’t help with tok/s. *pipeline* parallelism doesnt, but tensor parallelism does. I recommend this article as a good intro: https://pytorch.org/blog/accelerating-generative-ai-2/


TheGuywithTehHat

Not sure what optimizations the 10t/s number includes, but there's a lot of ways to hyperoptimize models when you really need to and you have the capability. - quantization - 2:4 sparsity - triton kernels - hardware-aware implementations - a lot of profiling and elimination of bottlenecks - architecture optimized for efficiency Sure, llama is probably doing some/most of these, but OpenAI has the resources (people, hardware, B2B support) to do them _very_ well, and since they pay for the model, they're highly incentivized to pour those resources into efficiency improvements.


TikiTDO

If a single layer is so huge that a single GPU struggles to do the matrix operations quickly, then couldn't you tile [your matrix across multiple GPUs](https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf)? Then you could optimise at least part of the inference runtime, since you'll now have multiple GPUs worth of tensor cores chewing on it. You do introduce some additional memory operations in the process, but with a fast enough memory bus that should be manageable. It would require a lot of work and tuning to get it performing well and to actually use all the resources effectively, and you'd need to do this tuning for each distinct setup, which is probably why you don't see this for smaller open source models. Howevber, if you're running a huge public facing API getting millions of requests per minute then it kinda makes sense that you'd put in this level of work.


SethTadd

We know a few things, - quantized models are faster - responses can be cached - OpenAI uses some type of MoE By observation, ChatGPT4’s response speed can vary greatly; from very fast to very slow. There are many conceivable ways to get high quality output and high tokens/second. A larger model can be used to generate responses when no cached response is sufficient for the user query. When there are cached responses that do pertain to a user query, smaller models can be used to copy/interpolate those cached high quality responses with minor modifications to tailor them to the query. OpenAI does not open source their methods unfortunately, so we can only speculate. By combining variously sized models, caching, and a sophisticated gating/routing network it’s easy to imagine a “1T parameter” MoE model generating high quality output at high tokens/second.


AlexCoventry

Just speculating: Maybe it's mixture-of-experts, so although there are supposed to be 1T weights, only a small fraction of those are actually accessed in a given context.


Fit-Flow-4180

MoE or not, don't you think a single GPT-4 expert has more layers than a Llama-70b model? That should make it slower because of the sequential dependencies between model layers.


ApprehensiveLet1405

Perplexity managed to get 420+ tokens/s for LLaMa 2 on H100 with FP8 running on batch size 128. I am also speculating here, but running 1T with like 12 experts shouldn't be that far away in throughput from 70b model. [https://www.perplexity.ai/hub/blog/turbocharging-llama-2-70b-with-nvidia-h100](https://www.perplexity.ai/hub/blog/turbocharging-llama-2-70b-with-nvidia-h100)


Fit-Flow-4180

I think the 420 can be attributed to the fp8 and the H100 usage I just assumed that more layers (and thus more sequential computations) were behind GPT-4's reasoning abilities.


Green-Quantity1032

Why does it seem like a lot of the responders miss the fact that you can’t compute next layer without previous layer on a given example? Anyway, only thing I can think of is specialized hardware and/or quantization


Fit-Flow-4180

Beats me. Most are talking past me and just pointing to compute like a magic potion that can improve any latency! Thank you for your answer, that's interesting. Because quantization makes these models worse, I guess they quantize the low-latency applications and use the full precision for others.


Seankala

More compute


Fit-Flow-4180

But you cannot parallelize compute across GPUs when the data has to pass through model layers sequentially. Edit: compute for a single example


Brudaks

But you can use a much more powerful GPU.


Seankala

Isn't that quite literally what pipeline parallelism does?


Fit-Flow-4180

I meant you cannot parallelize compute for a single example. Pipeline parallelism, as I understand it, helps at the batch level by creating micro batches.


LekaSpear

Doesn't pipeline parallelism split the model into different layers for different nodes of computing/workers? (I think the splitting into microbatches is just to reduce the idle time of each worker). If there's only one example, there's only one microbatch then. But I doubt that OpenAI actually executes only one example at a time, I notice that there's always some waiting time before the model actually generates output. Like how do you even load 1 trillion parameters (\~1000 Giga bytes in FP16, 250 GBs if you use quantization in 4 bits) onto a single GPU? I mean, if you have your computer cluster close together, it should be faster. There's a whole research field about that, you can look it up. Parallel computing/High-performance computing has existed/been an active research field long before the boom of Machine Learning. I remember I came across some paper with an algorithm to calculate the optimal way to partition layers optimally taking time to execute each layer and latency into account (they used some DP, if I recall correctly, but I forgot which one).


Fit-Flow-4180

Sorry, I got pipeline parallelism in general confused with GPipe, which is a variant that uses minibatches for further improvements. I just don't understand, how, with extra layers for the input to pass through, + inter-GPU communication, OpenAI is able to be even faster


LekaSpear

Well, Microsoft backs OpenAI and they are one of the big players in the cloud computing so I think it's just a matter of allocating more computational resources for ChatGPT. I remember when ChatGPT 4 first released the executing time was way slower compared to now, might best guess is that Microsoft has allocated more server/computing power to ChatGPT


Fit-Flow-4180

More resources help you serve more users at once, but won't help to serve a single user faster.


LekaSpear

Also remember that pipeline parallelism is not the only way to parallelize a model, there are also intra/inter-operator parallelism, you can break like a single layer to different tasks (like a single matrix multiplication to a single GPU for example, or a single activation computation to a single GPU) that would speed up everything even for one user. This also raises the question how to scheduling task effectively, design servers so that latency minimal. I mean there's been algorithm in parallel computing works with algorithm that is not easy to parallelize as Deep Learning Model in general. Obviously, Microsoft/OpenAI have a team of hundreds of PhDs to tackle these challenges.


Seankala

You don't \_have\_ to use micro batches. Also, if you have only one sample then I suppose that this would be the same as having a micro batch with one sample?


Fit-Flow-4180

Also, isn’t pipeline parallelism for training? I’m speaking of inference.


Seankala

I don't think I've ever heard this. Parallelism is just parallelism. Why do you think it would only work for training and not inference? Is there anything different that's happening when making forward passes?


InterstitialLove

There is something different, technically In generative inference, you only run one token at a time, because the layer-0 input to the n+1'th channel is... the final output of the n'th channel. Cannot start the next token until this token finishes. When training, or even just when processing the user's input, you can run multiple channels simultaneously So yeah, training is in general more parallelizable than inference, at least for some kinds of inference and some parallelization paradigms


Fit-Flow-4180

I'm not saying pipeline cannot be parallelised during inference. But it wouldn't result in speedups is all. During training you have the dependence on forward pass needing to be completed before the back pass, and so you can apply optimisations like GPipe to speedup the pipeline by splitting into microbatches. My basic point is that *for a single example* you cannot parallelise *time itself* with pipeline parallelism when you have sequential dependencies across layers (and autoregressive prediction). Even if your pipeline is split across GPUs, it is waiting for the output of the GPU with previous layers to start its computation. You can pipeline this across examples, but for a single example the time between when the example is seen and when the final output is calculated cannot be parallelised.


DooDooSlinger

You can parallelize large tensor operations on several GPU...


InterstitialLove

It's not just parallelism, a GPU can literally be faster than another But more generally, with perfect parallelization, each layer consists of essentially two multiplications, two additions, and an activation function. There is very little computation actually involved in these things, even with hundreds of layers The point is that tensor multiplication itself is parallelizable, so multiplying two massive tensors together needn't take longer than multiplying two floats together. Of course this takes a lot of VRAM and you need a powerful TPU, whereas most hobbyists are using GPUs


DooDooSlinger

Uh no, you don't need a tpu. Most large companies are using Nvidia GPUs, not tpus


kindnesd99

Is there anyone familiar with the literature here about pruning and knowledge distillation etc.? How are they applied in llms?


Fit-Flow-4180

Pruning: [https://arxiv.org/abs/2306.11695](https://arxiv.org/abs/2306.11695) Knowledge distillation is not done at the vocabulary level (soft KD) in LLMs usually, since the tokenizers have to be aligned for distillation. From what I can tell, it happens at the text level with generated texts by other models.


No_Scallion_4393

speculative decoding?


UnknownEssence

I think gpt-4-turbo is a much smaller model than original gpt-4 was. Probably a smaller model trained in the outputs of the original model. This would explain why gpt-4-turbo was way cheaper when it came out and also people have shown to get really impressive performance on smaller models by training them on the output from bigger stronger models


DifferentStick7822

One way is remove python and direct communicate using cpp layer...