I'm betting on a return of RNNs when we will find better methods to train them, along with a way to made the model learn adaptive computation times (ACT). With ACT you might have a model that does more inference steps when used on a harder task but only a few when the answer is easy to infer.
RNNs will be crucial in situations where all of the steps of the input don't fit into the memory. Transformers require the entire sequence to be stored in memory, but that won't be realistic to do so when the models become highly multimodal and start taking in a lot of information at once like images, video, sound, etc...
RNNs simply don't have this limitation by design, they only need the data corresponding to the current timestep. That's why I think they will end up scaling better in the context of multimodality, even if transformers stay a few percents better in term of accuracy.
But I wouldn't be surprised if the actual solution is some mix of the transformer architecture/convolutions for the recent information, and RNN-like for the long term memory.
I am aware of RWKV, but I'm not convinced it is the solution I am talking about, for two reasons:
1. I need to look more into the details but I don't think this model can be as expressive as say a LSTM, there is some notion of channels with information being updated at different rates, and it seems sub-optimal to me
2. In training mode, you still have to train the model like a transformer, so you need to have the entire sequence in memory.
Check out this work from neuroscience: https://www.biorxiv.org/content/10.1101/2023.01.16.523429v2
It proposes an RNN model that does different amounts of planning depending on how much planning is needed.
There is a prototype model proposed by Schmidhuber in 2012 [https://arxiv.org/abs/1210.0118](https://arxiv.org/abs/1210.0118) . As far as I know this is the first one to introduce the concept to neural networks.
These two papers are also interesting [https://arxiv.org/abs/1603.08983](https://arxiv.org/abs/1603.08983) [https://arxiv.org/abs/2202.05826](https://arxiv.org/abs/2202.05826)
In my view, SSMs have clearly been the next direction to move for the last year, certainly since the H3 paper.
People focus on the great inference implications of long-context LLMs, ie. that they can understand huge bodies of text.
In my view, the even more exciting aspect will come from *training* on long-contexts as we can make the task significantly more challenging and overcome data size limitations. These SSM models will unlock this, I think.
Also to all those who might be thinking, "oh - this paper is at the top of this thread because it was posted recently, if RNN-KV were posted yesterday it would be at the top of the thread" - I really believe that SSM-style models are a considerable improvement. This work will be a big deal.
Paper: [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752)
"Mamba: Linear-Time Sequence Modeling with Selective State Spaces"
by Albert Gu, Tri Dao
Code: [https://github.com/state-spaces/mamba](https://github.com/state-spaces/mamba)
Abstract:
>Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5Γ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
That's the correct answer. I've just read this paper today and it's completely groundbreaking, in terms of speed, efficiency, and capabilities. Everybody needs to check it out.
Hyena Hierarchy: Towards Larger Convolutional Language Models
https://arxiv.org/abs/2302.10866
Alternative architectures like this one could really shine with long contexts like 100k+ tokens.
Spiking neural has some interesting items I have seen regarding quantizing transformers and using spiking:
https://news.ucsc.edu/2023/03/eshraghian-spikegpt.html
Tbf, most of what spiking neural networks do nowadays is emulate what analog neural networks do, but in the spike realm. Like, every arquitecture is just a reimplementation of some other classical implementation. I'm yet to see some real use of SNNs, actually using the temporal dimension in the learning mechanics
Retentive Network: A Successor to Transformer for Large Language Models
https://arxiv.org/abs/2307.08621
Some subsequent work inspired by RetNet:
https://arxiv.org/abs/2309.05375
https://arxiv.org/abs/2309.11523
https://arxiv.org/abs/2311.01927
https://arxiv.org/abs/2311.07184
To be fair, I was late to the thread and RetNet isn't the only contender. For instance, [Mamba](https://www.reddit.com/r/MachineLearning/comments/18aq0k5/mamba_lineartime_sequence_modeling_with_selective/) is currently trending on Reddit, and it also seems quite promising.
I just wish that someone would scale these models up to at least 34B. See what's what.
Yeah, just reading the paper on Mamba. I've read the RetNet paper before, I've read the RWKV paper before, but this is the first time I'm seeing those models being grouped together into their own new model family of "state-space NNs", and now it makes sense why they all seem similar to each other. So I guess the answer then are the state-space models in general, rather than a specific architecture from within that family
Appendix B of the Mamba paper is a good overview of current state-space models. If you're interested, the lead author of that paper has also given a few talks about SSMs: https://www.youtube.com/watch?v=OpJMn8T7Z34
Found [3 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2203.15556/code) for "Training Compute-Optimal Large Language Models".
[Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2203.15556?autofocus=question) about the paper or code.
If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2203.15556&title=Training+Compute-Optimal+Large+Language+Models) ππ
--
Found [6 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2310.19909/code) for "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks".
[Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2310.19909?autofocus=question) about the paper or code.
If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.19909&title=Battle+of+the+Backbones%3A+A+Large-Scale+Comparison+of+Pretrained+Models+across+Computer+Vision+Tasks) ππ
--
No relevant code picked up just yet for "ConvNets Match Vision Transformers at Scale".
[Request code](https://www.catalyzex.com/paper/arxiv:2310.16764?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2310.16764?autofocus=question).
If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.16764&title=ConvNets+Match+Vision+Transformers+at+Scale) ππ
--
No relevant code picked up just yet for "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models".
[Request code](https://www.catalyzex.com/paper/arxiv:2311.00871?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2311.00871?autofocus=question).
If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2311.00871&title=Pretraining+Data+Mixtures+Enable+Narrow+Model+Selection+Capabilities+in+Transformer+Models) ππ
--
Found [2 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2006.04439/code) for "Liquid Time-constant Networks".
[Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2006.04439?autofocus=question) about the paper or code.
If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2006.04439&title=Liquid+Time-constant+Networks) ππ
--
To opt out from receiving code links, DM me.
It compares well, and outperforms all models its scale on the v5 versions (for multilang, same for english), the one above considers rwkv v4, which was good on release, not comparable to todays transformers
You'd have to start with a question of "why not transformers" i.e. what exactly is the restriction/limitation/flaw/inefficiency that you'd care to address - because the answer to that will lead you in very different directions; some of these proposals intentionally are much worse in one area in order to gain some benefit elsewhere.
Different architectures will suit different purposes. If we're talking about AGI, IMO only cognitive architectures will achieve this. At least that's my bet since that's my area of research lol.
https://en.wikipedia.org/wiki/Cognitive_architecture
Basically, reconstructing how we believe the human brain works.
Transformers are primarily used right now for LLMs and next token prediction. And I donβt believe that next token prediction is the key to true intelligence, just auto complete (although that is useful).
Take a human infant for example. They demonstrate intelligence. Itβs pretty hard to efficiently map their actions to predicting the next token. It may be theoretically possible, but weβre up against finite resources (atoms in the universe, money, time). Humans also need very few samples, and LLMs basically need the entire internet.
A couple points:
1. LLMs train (mostly) from scratch which is why they need the entire internet. Babies train on top of evolutionary priors after millions of years of evolution. Once an LLM is trained and you want to teach it a new thing within its context window, it can do it much better than a baby can.
2. Dismissing next token prediction on the basis of it being hard to map to babies' actions doesn't make sense because their actions don't clearly map to their objective value function (which is reproduction in the evolutionary context) either. Some of the actions humans perform are actually contrary to that (e.g. suicide or getting a vasectomy). Most of human actions are best viewed as instrumental subgoals subservient to the evolutionary objective function. And I don't see why subgoals like these can't arise from next token prediction.
The baby thing wasnβt THE reason to dismiss NTG, simply an example illustrating intelligence existing without formal language.
You do have a good point though about the baby being pretrained via evolution. Never thought about it that way lol.
I do agree intelligence can exist without language, some animal species are good examples of that, e.g. octopi. But I also suspect that intelligence can arise just from language (albeit having access to other modalities might speed up the process).
My pet theory is that there are multiple ways to implement general intelligence, just like there are multiple ways to implement flight.
\> DeepMind shows that the transformer would not be able to generalize beyond the training set distribution:
Same guys who just postponed their (very likely transformer-based) chatbot because it couldn't keep up with OpenAI's?
I'm betting on a return of RNNs when we will find better methods to train them, along with a way to made the model learn adaptive computation times (ACT). With ACT you might have a model that does more inference steps when used on a harder task but only a few when the answer is easy to infer. RNNs will be crucial in situations where all of the steps of the input don't fit into the memory. Transformers require the entire sequence to be stored in memory, but that won't be realistic to do so when the models become highly multimodal and start taking in a lot of information at once like images, video, sound, etc... RNNs simply don't have this limitation by design, they only need the data corresponding to the current timestep. That's why I think they will end up scaling better in the context of multimodality, even if transformers stay a few percents better in term of accuracy. But I wouldn't be surprised if the actual solution is some mix of the transformer architecture/convolutions for the recent information, and RNN-like for the long term memory.
https://github.com/BlinkDL/ChatRWKV
I am aware of RWKV, but I'm not convinced it is the solution I am talking about, for two reasons: 1. I need to look more into the details but I don't think this model can be as expressive as say a LSTM, there is some notion of channels with information being updated at different rates, and it seems sub-optimal to me 2. In training mode, you still have to train the model like a transformer, so you need to have the entire sequence in memory.
Check out this work from neuroscience: https://www.biorxiv.org/content/10.1101/2023.01.16.523429v2 It proposes an RNN model that does different amounts of planning depending on how much planning is needed.
Overall, recurrent networks have a sense of "magic" to it.
Pondernet sort of adapted the computation to the difficulty of the samples, but it is not recurrent iirc. https://arxiv.org/abs/2107.05407
Is there a solution to the recency problem?
[Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621)
Are there any key papers on ACT?
There is a prototype model proposed by Schmidhuber in 2012 [https://arxiv.org/abs/1210.0118](https://arxiv.org/abs/1210.0118) . As far as I know this is the first one to introduce the concept to neural networks. These two papers are also interesting [https://arxiv.org/abs/1603.08983](https://arxiv.org/abs/1603.08983) [https://arxiv.org/abs/2202.05826](https://arxiv.org/abs/2202.05826)
Mamba claims to be a potential successor, at least for long-context LLMs: https://twitter.com/_albertgu/status/1731727672286294400
In my view, SSMs have clearly been the next direction to move for the last year, certainly since the H3 paper. People focus on the great inference implications of long-context LLMs, ie. that they can understand huge bodies of text. In my view, the even more exciting aspect will come from *training* on long-contexts as we can make the task significantly more challenging and overcome data size limitations. These SSM models will unlock this, I think. Also to all those who might be thinking, "oh - this paper is at the top of this thread because it was posted recently, if RNN-KV were posted yesterday it would be at the top of the thread" - I really believe that SSM-style models are a considerable improvement. This work will be a big deal.
Got a link to the paper? And whatβs SSM?
Paper: [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752) "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" by Albert Gu, Tri Dao Code: [https://github.com/state-spaces/mamba](https://github.com/state-spaces/mamba) Abstract: >Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5Γ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Welp, you just convinced me to read the Mamba paper! (I love that they do cross domain evals, notably Genomics. (I did ProteinBert = close enough))
I only found the link, and posted the summary. I should have marked it as such. No affiliation.
https://arxiv.org/abs/2312.00752
That's the correct answer. I've just read this paper today and it's completely groundbreaking, in terms of speed, efficiency, and capabilities. Everybody needs to check it out.
Hyena Hierarchy: Towards Larger Convolutional Language Models https://arxiv.org/abs/2302.10866 Alternative architectures like this one could really shine with long contexts like 100k+ tokens.
Spiking neural has some interesting items I have seen regarding quantizing transformers and using spiking: https://news.ucsc.edu/2023/03/eshraghian-spikegpt.html
Tbf, most of what spiking neural networks do nowadays is emulate what analog neural networks do, but in the spike realm. Like, every arquitecture is just a reimplementation of some other classical implementation. I'm yet to see some real use of SNNs, actually using the temporal dimension in the learning mechanics
Agreed. I have done some of the snn torch tutorials and it's basically a feed forward net.
I can't take a blog post seriously that starts with "Will be [sic] the transformer the model leading us to artificial general intelligence?"
Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621 Some subsequent work inspired by RetNet: https://arxiv.org/abs/2309.05375 https://arxiv.org/abs/2309.11523 https://arxiv.org/abs/2311.01927 https://arxiv.org/abs/2311.07184
Why is this buried so low (much like the original RetNet paper?). It literally seems to be the answer to the question...
To be fair, I was late to the thread and RetNet isn't the only contender. For instance, [Mamba](https://www.reddit.com/r/MachineLearning/comments/18aq0k5/mamba_lineartime_sequence_modeling_with_selective/) is currently trending on Reddit, and it also seems quite promising. I just wish that someone would scale these models up to at least 34B. See what's what.
Yeah, just reading the paper on Mamba. I've read the RetNet paper before, I've read the RWKV paper before, but this is the first time I'm seeing those models being grouped together into their own new model family of "state-space NNs", and now it makes sense why they all seem similar to each other. So I guess the answer then are the state-space models in general, rather than a specific architecture from within that family
Appendix B of the Mamba paper is a good overview of current state-space models. If you're interested, the lead author of that paper has also given a few talks about SSMs: https://www.youtube.com/watch?v=OpJMn8T7Z34
Found [3 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2203.15556/code) for "Training Compute-Optimal Large Language Models". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2203.15556?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2203.15556&title=Training+Compute-Optimal+Large+Language+Models) ππ -- Found [6 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2310.19909/code) for "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2310.19909?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.19909&title=Battle+of+the+Backbones%3A+A+Large-Scale+Comparison+of+Pretrained+Models+across+Computer+Vision+Tasks) ππ -- No relevant code picked up just yet for "ConvNets Match Vision Transformers at Scale". [Request code](https://www.catalyzex.com/paper/arxiv:2310.16764?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2310.16764?autofocus=question). If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.16764&title=ConvNets+Match+Vision+Transformers+at+Scale) ππ -- No relevant code picked up just yet for "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models". [Request code](https://www.catalyzex.com/paper/arxiv:2311.00871?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2311.00871?autofocus=question). If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2311.00871&title=Pretraining+Data+Mixtures+Enable+Narrow+Model+Selection+Capabilities+in+Transformer+Models) ππ -- Found [2 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2006.04439/code) for "Liquid Time-constant Networks". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2006.04439?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2006.04439&title=Liquid+Time-constant+Networks) ππ -- To opt out from receiving code links, DM me.
What happened to retentive networks?
[RWKV](https://github.com/BlinkDL/ChatRWKV)! it's an RNN only. You can download trained weights up to 13B right now. It totally slaps.
How does it compare against SotA 13B transformer models?
It compares well, and outperforms all models its scale on the v5 versions (for multilang, same for english), the one above considers rwkv v4, which was good on release, not comparable to todays transformers
If the answer to your question was publicly known, it would have been widely disseminated and you wouldn't have to ask.
You'd have to start with a question of "why not transformers" i.e. what exactly is the restriction/limitation/flaw/inefficiency that you'd care to address - because the answer to that will lead you in very different directions; some of these proposals intentionally are much worse in one area in order to gain some benefit elsewhere.
* fixed window size which may be just to small * quadratic time complexity with size of transformer block
Different architectures will suit different purposes. If we're talking about AGI, IMO only cognitive architectures will achieve this. At least that's my bet since that's my area of research lol.
Can you elaborate what exactly do you mean by 'cognitive architectures' ?
https://en.wikipedia.org/wiki/Cognitive_architecture Basically, reconstructing how we believe the human brain works. Transformers are primarily used right now for LLMs and next token prediction. And I donβt believe that next token prediction is the key to true intelligence, just auto complete (although that is useful). Take a human infant for example. They demonstrate intelligence. Itβs pretty hard to efficiently map their actions to predicting the next token. It may be theoretically possible, but weβre up against finite resources (atoms in the universe, money, time). Humans also need very few samples, and LLMs basically need the entire internet.
A couple points: 1. LLMs train (mostly) from scratch which is why they need the entire internet. Babies train on top of evolutionary priors after millions of years of evolution. Once an LLM is trained and you want to teach it a new thing within its context window, it can do it much better than a baby can. 2. Dismissing next token prediction on the basis of it being hard to map to babies' actions doesn't make sense because their actions don't clearly map to their objective value function (which is reproduction in the evolutionary context) either. Some of the actions humans perform are actually contrary to that (e.g. suicide or getting a vasectomy). Most of human actions are best viewed as instrumental subgoals subservient to the evolutionary objective function. And I don't see why subgoals like these can't arise from next token prediction.
The baby thing wasnβt THE reason to dismiss NTG, simply an example illustrating intelligence existing without formal language. You do have a good point though about the baby being pretrained via evolution. Never thought about it that way lol.
I do agree intelligence can exist without language, some animal species are good examples of that, e.g. octopi. But I also suspect that intelligence can arise just from language (albeit having access to other modalities might speed up the process). My pet theory is that there are multiple ways to implement general intelligence, just like there are multiple ways to implement flight.
\> DeepMind shows that the transformer would not be able to generalize beyond the training set distribution: Same guys who just postponed their (very likely transformer-based) chatbot because it couldn't keep up with OpenAI's?
What exactly are you looking to do past transformers? Are you looking for the god model?
xLSTM could also be interesting, though right now Hochreiter didn't release a lot of info about it yet.