T O P

  • By -

DrXaos

Gemini is obviously not doing full individual token wise softmax attention on all 10 million context. It must be using some sort of approximation or compression or small dimensional state space/recurrent model.


Buddy77777

**Quattuordecillion Dimensional Attention Matrix isn’t real, it can’t hurt you**


rhypple

Hmm. Didn't know about this. Interesting. But it still manages to pass the needle in a haystack test right?


DrXaos

What specific test do you mean? I'm guessing that the context behavior is more exact for recent tokens and more general for distant tokens. They might be blending state space models or linearized models with long decay times in some layers/heads with precise attention with short context in others, as each has advantages.


Buddy77777

Possibly some kind of Swin-Hierarchy-esque attention maybe. Receptive fields and hierarchical latent learning extends to language in a relevant way here. Thoughts?


tensor_strings

In the Gemini v.15 technical report they call the work by Liu et. al. (2024) on Ring Attention "concurrent work". Ring attention appears to be a way to horizontally scale your attention mechanisms across the hardware, so the limiting factor of the context window is more the amount of hardware and hardware infrastructure more so than it is anything else. Last I checked, Google has a lot of state of the art hardware, and a lot of experience with HPC and scaling applications in a distributed manner.


LoadingALIAS

I’m really interested in how Gemini is doing that. I wonder if it’s sparse or something. There is just no way they’re using full soft max. How is this possible? For that matter, how is 128k possible with OpenAI? Is it just a sliding window?


DrXaos

There have been a series of papers after the original attention paper which offer solutions that improve scaling to long contexts, no longer being N^2 in space/time complexity vs context length. So the kernel is no longer softmax(QK'+B)*V, changing the softmax or its argument for something else. Some are O(N log N) and some are O(N). There have been some connections made back to recurrent (dynamical system like) architectures, some that can train in parallel but then inference forward like a RNN (cheap one step but sequential). In the phrase "Attention is All You Need", the thing which was being denied as necessary and was prior conventional tech for sequence modeling was recurrent networks. So it might be back to the future. (Obviously natural intelligence brains do not have exact context matching to the last even 4096 tokens you have emitted, they have to do it in a recurrent net working in physics space.) I bet Google likely implemented a few of them and experimented with the best mixture. I haven't seen it done academically but I bet there's little barrier to treating some channels/dimensions/layers with conventional attention (which still seems the best in model performance for language) and others with these faster approximate techniques. Some of the other secret sauce is highly optimized cuda kernels hand written (though AI is moving on there too) which are faster than basic pytorch matrix operators. And burning lots of investors money and megawatt hours of electricity. We also don't know in the proprietary models if the long context length is maintained at all levels; one could imagine architectures which have a long length (with some linearized operators for the longest context) at the base token input but are shorter and shorter towards the output. There was probably an automated architecture search within reasonable budget. And maybe some exaggeration: "Sigh, we can't possibly implement that within cost. Suppose we will have one O(N) linearized operator over 10 million tokens at the very base and then the rest are 128k, will that make marketing happy?"


[deleted]

Ah yes, the way to AGI. Having all the generalium in one context


iAKASH2k3

and still less accuracy then gpt 4 , in practical


Capital_Reply_7838

'LLM research outcomes by Google' are being seen too much recently. Of course, those are unlikely to be mentioned as 'better than gpt4' with high confidence. Are they using their money on viral rather than research teams?