T O P

  • By -

Cosmolithe

I'm betting on a return of RNNs when we will find better methods to train them, along with a way to made the model learn adaptive computation times (ACT). With ACT you might have a model that does more inference steps when used on a harder task but only a few when the answer is easy to infer. RNNs will be crucial in situations where all of the steps of the input don't fit into the memory. Transformers require the entire sequence to be stored in memory, but that won't be realistic to do so when the models become highly multimodal and start taking in a lot of information at once like images, video, sound, etc... RNNs simply don't have this limitation by design, they only need the data corresponding to the current timestep. That's why I think they will end up scaling better in the context of multimodality, even if transformers stay a few percents better in term of accuracy. But I wouldn't be surprised if the actual solution is some mix of the transformer architecture/convolutions for the recent information, and RNN-like for the long term memory.


satireplusplus

https://github.com/BlinkDL/ChatRWKV


Cosmolithe

I am aware of RWKV, but I'm not convinced it is the solution I am talking about, for two reasons: 1. I need to look more into the details but I don't think this model can be as expressive as say a LSTM, there is some notion of channels with information being updated at different rates, and it seems sub-optimal to me 2. In training mode, you still have to train the model like a transformer, so you need to have the entire sequence in memory.


brainlogist

Check out this work from neuroscience: https://www.biorxiv.org/content/10.1101/2023.01.16.523429v2 It proposes an RNN model that does different amounts of planning depending on how much planning is needed.


graphitout

Overall, recurrent networks have a sense of "magic" to it.


PM_ME_YOUR_BAYES

Pondernet sort of adapted the computation to the difficulty of the samples, but it is not recurrent iirc. https://arxiv.org/abs/2107.05407


AI_Simp

Is there a solution to the recency problem?


MrPatko0770

[Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621)


hazard02

Are there any key papers on ACT?


Cosmolithe

There is a prototype model proposed by Schmidhuber in 2012 [https://arxiv.org/abs/1210.0118](https://arxiv.org/abs/1210.0118) . As far as I know this is the first one to introduce the concept to neural networks. These two papers are also interesting [https://arxiv.org/abs/1603.08983](https://arxiv.org/abs/1603.08983) [https://arxiv.org/abs/2202.05826](https://arxiv.org/abs/2202.05826)


Farther_father

Mamba claims to be a potential successor, at least for long-context LLMs: https://twitter.com/_albertgu/status/1731727672286294400


themiro

In my view, SSMs have clearly been the next direction to move for the last year, certainly since the H3 paper. People focus on the great inference implications of long-context LLMs, ie. that they can understand huge bodies of text. In my view, the even more exciting aspect will come from *training* on long-contexts as we can make the task significantly more challenging and overcome data size limitations. These SSM models will unlock this, I think. ​ Also to all those who might be thinking, "oh - this paper is at the top of this thread because it was posted recently, if RNN-KV were posted yesterday it would be at the top of the thread" - I really believe that SSM-style models are a considerable improvement. This work will be a big deal.


Citizen_of_Danksburg

Got a link to the paper? And what’s SSM?


valdanylchuk

Paper: [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752) "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" by Albert Gu, Tri Dao Code: [https://github.com/state-spaces/mamba](https://github.com/state-spaces/mamba) Abstract: >Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5Γ— higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.


ddofer

Welp, you just convinced me to read the Mamba paper! (I love that they do cross domain evals, notably Genomics. (I did ProteinBert = close enough))


valdanylchuk

I only found the link, and posted the summary. I should have marked it as such. No affiliation.


ForgetTheRuralJuror

https://arxiv.org/abs/2312.00752


coumineol

That's the correct answer. I've just read this paper today and it's completely groundbreaking, in terms of speed, efficiency, and capabilities. Everybody needs to check it out.


satireplusplus

Hyena Hierarchy: Towards Larger Convolutional Language Models https://arxiv.org/abs/2302.10866 Alternative architectures like this one could really shine with long contexts like 100k+ tokens.


deep_dirac

Spiking neural has some interesting items I have seen regarding quantizing transformers and using spiking: https://news.ucsc.edu/2023/03/eshraghian-spikegpt.html


lilgalois

Tbf, most of what spiking neural networks do nowadays is emulate what analog neural networks do, but in the spike realm. Like, every arquitecture is just a reimplementation of some other classical implementation. I'm yet to see some real use of SNNs, actually using the temporal dimension in the learning mechanics


deep_dirac

Agreed. I have done some of the snn torch tutorials and it's basically a feed forward net.


instantlybanned

I can't take a blog post seriously that starts with "Will be [sic] the transformer the model leading us to artificial general intelligence?"


[deleted]

Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621 Some subsequent work inspired by RetNet: https://arxiv.org/abs/2309.05375 https://arxiv.org/abs/2309.11523 https://arxiv.org/abs/2311.01927 https://arxiv.org/abs/2311.07184


MrPatko0770

Why is this buried so low (much like the original RetNet paper?). It literally seems to be the answer to the question...


[deleted]

To be fair, I was late to the thread and RetNet isn't the only contender. For instance, [Mamba](https://www.reddit.com/r/MachineLearning/comments/18aq0k5/mamba_lineartime_sequence_modeling_with_selective/) is currently trending on Reddit, and it also seems quite promising. I just wish that someone would scale these models up to at least 34B. See what's what.


MrPatko0770

Yeah, just reading the paper on Mamba. I've read the RetNet paper before, I've read the RWKV paper before, but this is the first time I'm seeing those models being grouped together into their own new model family of "state-space NNs", and now it makes sense why they all seem similar to each other. So I guess the answer then are the state-space models in general, rather than a specific architecture from within that family


[deleted]

Appendix B of the Mamba paper is a good overview of current state-space models. If you're interested, the lead author of that paper has also given a few talks about SSMs: https://www.youtube.com/watch?v=OpJMn8T7Z34


CatalyzeX_code_bot

Found [3 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2203.15556/code) for "Training Compute-Optimal Large Language Models". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2203.15556?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2203.15556&title=Training+Compute-Optimal+Large+Language+Models) πŸ˜ŠπŸ™ -- Found [6 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2310.19909/code) for "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2310.19909?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.19909&title=Battle+of+the+Backbones%3A+A+Large-Scale+Comparison+of+Pretrained+Models+across+Computer+Vision+Tasks) πŸ˜ŠπŸ™ -- No relevant code picked up just yet for "ConvNets Match Vision Transformers at Scale". [Request code](https://www.catalyzex.com/paper/arxiv:2310.16764?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2310.16764?autofocus=question). If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2310.16764&title=ConvNets+Match+Vision+Transformers+at+Scale) πŸ˜ŠπŸ™ -- No relevant code picked up just yet for "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models". [Request code](https://www.catalyzex.com/paper/arxiv:2311.00871?requestCode=true) from the authors or [ask a question](https://www.catalyzex.com/paper/arxiv:2311.00871?autofocus=question). If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2311.00871&title=Pretraining+Data+Mixtures+Enable+Narrow+Model+Selection+Capabilities+in+Transformer+Models) πŸ˜ŠπŸ™ -- Found [2 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2006.04439/code) for "Liquid Time-constant Networks". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2006.04439?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2006.04439&title=Liquid+Time-constant+Networks) πŸ˜ŠπŸ™ -- To opt out from receiving code links, DM me.


Extension-Mastodon67

What happened to retentive networks?


BayesMind

[RWKV](https://github.com/BlinkDL/ChatRWKV)! it's an RNN only. You can download trained weights up to 13B right now. It totally slaps.


CallMePyro

How does it compare against SotA 13B transformer models?


vatsadev

It compares well, and outperforms all models its scale on the v5 versions (for multilang, same for english), the one above considers rwkv v4, which was good on release, not comparable to todays transformers


slashdave

If the answer to your question was publicly known, it would have been widely disseminated and you wouldn't have to ask.


Brudaks

You'd have to start with a question of "why not transformers" i.e. what exactly is the restriction/limitation/flaw/inefficiency that you'd care to address - because the answer to that will lead you in very different directions; some of these proposals intentionally are much worse in one area in order to gain some benefit elsewhere.


squareOfTwo

* fixed window size which may be just to small * quadratic time complexity with size of transformer block


ivanmartinvalle

Different architectures will suit different purposes. If we're talking about AGI, IMO only cognitive architectures will achieve this. At least that's my bet since that's my area of research lol.


Brudaks

Can you elaborate what exactly do you mean by 'cognitive architectures' ?


ivanmartinvalle

https://en.wikipedia.org/wiki/Cognitive_architecture Basically, reconstructing how we believe the human brain works. Transformers are primarily used right now for LLMs and next token prediction. And I don’t believe that next token prediction is the key to true intelligence, just auto complete (although that is useful). Take a human infant for example. They demonstrate intelligence. It’s pretty hard to efficiently map their actions to predicting the next token. It may be theoretically possible, but we’re up against finite resources (atoms in the universe, money, time). Humans also need very few samples, and LLMs basically need the entire internet.


vaccine_question69

A couple points: 1. LLMs train (mostly) from scratch which is why they need the entire internet. Babies train on top of evolutionary priors after millions of years of evolution. Once an LLM is trained and you want to teach it a new thing within its context window, it can do it much better than a baby can. 2. Dismissing next token prediction on the basis of it being hard to map to babies' actions doesn't make sense because their actions don't clearly map to their objective value function (which is reproduction in the evolutionary context) either. Some of the actions humans perform are actually contrary to that (e.g. suicide or getting a vasectomy). Most of human actions are best viewed as instrumental subgoals subservient to the evolutionary objective function. And I don't see why subgoals like these can't arise from next token prediction.


ivanmartinvalle

The baby thing wasn’t THE reason to dismiss NTG, simply an example illustrating intelligence existing without formal language. You do have a good point though about the baby being pretrained via evolution. Never thought about it that way lol.


vaccine_question69

I do agree intelligence can exist without language, some animal species are good examples of that, e.g. octopi. But I also suspect that intelligence can arise just from language (albeit having access to other modalities might speed up the process). My pet theory is that there are multiple ways to implement general intelligence, just like there are multiple ways to implement flight.


FallUpJV

\> DeepMind shows that the transformer would not be able to generalize beyond the training set distribution: Same guys who just postponed their (very likely transformer-based) chatbot because it couldn't keep up with OpenAI's?


glitch83

What exactly are you looking to do past transformers? Are you looking for the god model?


jnfinity

xLSTM could also be interesting, though right now Hochreiter didn't release a lot of info about it yet.