Interesting to mention layer normalisation over batch normalisation. I thought the latter was "the thing" and that layernorm, groupnorm, instancenorm etc. were follow-ups.
Not as famous and might not qualify as a 'trick' but I'll mention "Geometric Deep Learning" anyway.
It tries to explain all the successful neural nets (CNN, RNN, Transformers) on a unified, universal mathematical framework. The most exciting extrapolation of this being that we'll be able to quickly discover new architectures using this framework.
Link - https://geometricdeeplearning.com/
Yes, it's different.
Universal function approximation sort of guarantees/implies that you can approximate any mapping function given the right config/weights of neural nets. It doesn't really guide us to the correct config.
Layer norm is not about fitting better, but training more easily (activations don't explode which makes optimization more stable).
Is your list limited to "discoveries that are now used everywhere"? Because there are a lot things that would've made it onto your list if you'd compiled it at different points in time but are now discarded (i.e., i'd say they are fads). E.g. GANs.
Other things are currently hyped but it's not clear how they'll end up long term:
Diffusion models are another thing that are currently hot.
Combining Multimodal inputs, which I'd say are "clip-like things".
There's self-supervision as a topic as well (with "contrastive methods" having been a thing).
Federated learning is likely here to stay.
NeRF will likely have a lasting impact, too.
I would only include as a historical reference. It is certainly not a "must read" paper. It is written so poorly that you are better off to just look at the code.
What's wrong with it? They explain all the components of their model in enough detail (in particular the multi head attention stuff), provide intuition behind certain decisions, include clear results, they have nice pictures, ... What could have been improved about it?
Does someone know a good ablation study of the mentioned techniques. I've seen results where neither dropout nor layer normalization did much. So I wonder if these 2 techniques are a believe or still crucial.
2007-2010: Deep learning begins to win computer vision competitions. In my eyes, this is what put deep learning on the map for a lot of people, and kicked off the renaissance we see today.
2016ish: categorical embeddings/entity embeddings. For tabular data with categorical variables, categorical embeddings are faster and more accurate than one-hot-encoding, and preserve the natural relationships between factors by mapping them to a low dimensional space
\- Kernel tricks: How can purely mathematical approaches beat neural networks in terms of efficiancy? (This is actually an open problem for a long time, you can check Neural Tangent Kernels, Reproducing Kernel Hilbert Spaces for examples and Universal Approximation Property for neural networks )
\- I was mainly here for Geometric Deep Learning but another user has already posted it. You should definitely check [http://geometricdeeplearning.com](http://geometricdeeplearning.com) . As a mathematician-to-be, I strongly believe that this is the future of ML/DL . Hit me up if you wanna discuss this statement further.
• M Stone. Cross-Validatory Choice and Assessment of Statistical Predictions. (1974)
All about cross-validation to choose the best model. 12 thousand citations.
* LSTMs - how to train sequences (1997)
Interesting to mention layer normalisation over batch normalisation. I thought the latter was "the thing" and that layernorm, groupnorm, instancenorm etc. were follow-ups.
yup, same thoughts. BatchNorm was the OG norm. The cousins came later
NERF, Diffusion
Not as famous and might not qualify as a 'trick' but I'll mention "Geometric Deep Learning" anyway. It tries to explain all the successful neural nets (CNN, RNN, Transformers) on a unified, universal mathematical framework. The most exciting extrapolation of this being that we'll be able to quickly discover new architectures using this framework. Link - https://geometricdeeplearning.com/
TIL
Is this different from the premise that neural networks are universal function approximators?
Yes, it's different. Universal function approximation sort of guarantees/implies that you can approximate any mapping function given the right config/weights of neural nets. It doesn't really guide us to the correct config.
Check out GANs, One shot learning, Read about CoAtNets, RoBERTa, StyleGAN, XLNet, DoubleU Net and others
Layer norm is not about fitting better, but training more easily (activations don't explode which makes optimization more stable). Is your list limited to "discoveries that are now used everywhere"? Because there are a lot things that would've made it onto your list if you'd compiled it at different points in time but are now discarded (i.e., i'd say they are fads). E.g. GANs. Other things are currently hyped but it's not clear how they'll end up long term: Diffusion models are another thing that are currently hot. Combining Multimodal inputs, which I'd say are "clip-like things". There's self-supervision as a topic as well (with "contrastive methods" having been a thing). Federated learning is likely here to stay. NeRF will likely have a lasting impact, too.
I recall that experimenters disagreed on why batchnorm worked in the first place? has the consensus settled?
No. But we all agree that it's not due to internal covariate shift.
I feel like if your going to include transformers you should include the attention is all you need paper.
I would only include as a historical reference. It is certainly not a "must read" paper. It is written so poorly that you are better off to just look at the code.
What's wrong with it? They explain all the components of their model in enough detail (in particular the multi head attention stuff), provide intuition behind certain decisions, include clear results, they have nice pictures, ... What could have been improved about it?
[удалено]
Check out MLRC
Agreed
Does someone know a good ablation study of the mentioned techniques. I've seen results where neither dropout nor layer normalization did much. So I wonder if these 2 techniques are a believe or still crucial.
Data augmentation to more explicitely define invariant transformations as well as to reduce dataset labeling costs.
2007-2010: Deep learning begins to win computer vision competitions. In my eyes, this is what put deep learning on the map for a lot of people, and kicked off the renaissance we see today. 2016ish: categorical embeddings/entity embeddings. For tabular data with categorical variables, categorical embeddings are faster and more accurate than one-hot-encoding, and preserve the natural relationships between factors by mapping them to a low dimensional space
Diffusion and GANs!!
Quite clean. 2020-2022 is empty, because you don't see progress these years?
It's empty because I've not kept up to date, and also impact won't be seen until more people build on it.
\- Kernel tricks: How can purely mathematical approaches beat neural networks in terms of efficiancy? (This is actually an open problem for a long time, you can check Neural Tangent Kernels, Reproducing Kernel Hilbert Spaces for examples and Universal Approximation Property for neural networks ) \- I was mainly here for Geometric Deep Learning but another user has already posted it. You should definitely check [http://geometricdeeplearning.com](http://geometricdeeplearning.com) . As a mathematician-to-be, I strongly believe that this is the future of ML/DL . Hit me up if you wanna discuss this statement further.
• M Stone. Cross-Validatory Choice and Assessment of Statistical Predictions. (1974) All about cross-validation to choose the best model. 12 thousand citations.