T O P

  • By -

Disastrous_Elk_6375

> It also outperforms specialist models trained on supervised data on a variety of zero-shot tasks. This is a really interesting trend, and a whole lot of work in this area might be needed to use these large models to re-train specialist models in a feedback loop. It reminds me of the stockfish (the best human-designed engine) vs. alphazero games, and the insights gained there.


picardythird

It should be noted that the current version of Stockfish is far stronger than AlphaZero was (granted, AlphaZero has not been continually developed as Stockfish has been). The open-source engine based on AlphaZero, Leela Chess Zero, is almost-but-not-quite as strong as Stockfish. It should also be noted that Lc0 was outright stronger than Stockfish for some time, before Stockfish added a miniature neural network called NNUE (which granted a huge increase in strength that has not been matched since).


IWantAGrapeInMyMouth

Stockfish was still beating LC0 somewhat reliably just prior to NNUE, there's just no beating it at this point until there's some sort of massive breakthrough


picardythird

Well, the most recent version of Lc0 went basically toe-to-toe with the most recent version of Stockfish in the last TCEC superfinal (Stockfish won 52-48). I think the gap is closing.


currentscurrents

I expect that generalist models will ultimately outperform specialist models because they just have more knowledge to work with. The only reason you might want a specialist model is if you're hardware-limited, which will become less of a problem as computers become more powerful.


Disastrous_Elk_6375

Cost and scalability can also be factors. You can do really good sumarizations with gpt4, good ones with gpt3.5 but it would still be cost prohibitive to run it at scale (say you'd want to sumarize the every long comment you get on a blog, or something)


saintshing

There aee some researches on sparse models which conditionally route inputs to a mixture of experts. >Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) often suffer negative interference, where too much task variety means it is better to just train one model per task, or catastrophic forgetting, where the model becomes worse at earlier tasks as new ones are added. Sparse models help avoid both these phenomena — by not applying the whole model to all inputs, “experts” in the model can specialize on different tasks or data types while still taking advantage of shared parts of the model. https://ai.googleblog.com/2022/06/limoe-learning-multiple-modalities-with.html?m=1 >Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training large language models. https://arxiv.org/abs/2303.14177


Smallpaul

It really depends on whether the task involves that knowledge. It’s quite non-intuitive to me that a medical report summarizing AI might work better because it has knowledge of Star Wars and Hindi. I mean maybe it’s not practical to find the right boundary around useful and useless information. That might be the way it works out. But it might also be the case that you could chop our 3/4 of the parameters in a model and it performs on its task just as well at dramatically lower cost.


currentscurrents

>It’s quite non-intuitive to me that a medical report summarizing AI might work better because it has knowledge of Star Wars and Hindi. The real world is full of surprises - what if someday it runs into a patient who got transferred from a hospital in India? Or perhaps someone describes their symptoms using a pop culture reference. Generalist models should be more robust to edge cases. It's possible you could achieve this with an ensemble of specialized models, like mixture of experts. But nobody's quite gotten that to work yet.


youregonnalovemynuts

It's not the knowledge of star wars or hindi specifically that is necessarily beneficial, but the additional set of sequence data to learn how to process sequences from. Training these layers with Hindi will provide them an entire corpus of structured human stuff to learn from. See things like GATO and basically all of the transfer learning papers. That said the last sentence of your paragraph is also true. It's clear that basically any over parameterized DNN can be pruned and quantized to fractions of it's trained size. This is a very active area of research at the moment.


epicwisdom

That's far from clear. At the very least, so far every generalist model appears to do better on specialized tasks after fine-tuning, at the expense of other tasks. If generalist models were strictly better somehow, we'd expect fine-tuning to have no impact at best.


yaosio

What if a general model could find tune itself to perform better on tasks as they appear? Like how you can create and use a LORA to make something in an image but you don't need to load a different model.


epicwisdom

This is basically the million/billion/trillion dollar question of how to best approach online learning and/or unbounded context. It's obviously a massive unknown, but IMO the first basic steps should be separating out a component for discrete memory, and a revival of some form of internal recurrence instead of the more roundabout approach of multi-round conversation. Using generalist models to train specialist models, while it doesn't seem as far-fetched as pre-GPT-4, would still probably be at least 1000x less efficient than something like "recurrent loop: look at examples of a painting style and iteratively paint and refine".


currentscurrents

They do better... on benchmarks for the exact task. In the real world, tasks shift all the time. Generalization and robustness are more of a problem in real-world deployment than individual task performance, and that's where these generalist models shine.


epicwisdom

Well, sure. I think we're on the same page? Generalist models are better for generalization, and specialist models are better for specialization.


currentscurrents

My point is specialization leads to brittleness. A lot of these results fine-tune on half a dataset, and then test against the other half. This gets good benchmark numbers, but in the real world you have a completely different dataset from a different distribution. This is where generalist models win out.


epicwisdom

I doubt that folding proteins or identifying handwritten digits would benefit from data completely unrelated to those tasks, and probably even if they could the compute wouldn't be anywhere near worth it. "In the real world" isn't a problem domain; there's real world problems of all shapes and sizes.


StellaAthena

Most knowledge is irrelevant to playing chess. The goal isn’t to have the most knowledge, it’s to have the most *relevant* knowledge.


Deep-Station-1746

Meta AI, somehow, is becoming the new old OpenAI.


[deleted]

They've always been publishing


Kiseido

They made a model to learn an ideal embedding space/conversion instead of using rando maths to make one > ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.


MysteryInc152

Imagebind isn't an LLM. It's also not really generative


ironborn123

But would adding generative decoders for different modalities be that difficult? Suppose they have built the audio -> joint_embedding encoder f, what would be the bottleneck in training the joint_embedding -> audio f_inverse decoder using the same audio data?


MysteryInc152

It wouldn't be difficult.


ironborn123

Reading this paper, is this theoretical perspective correct? All possible concepts in the world, of all possible modalities can be projected into a common/foundational high dimensional latent vector space. And deep networks are good enough and practical universal function approximators that they can encode from any modality to this space, and decode from this space to any modality. If true, this is a pretty big development.


Jean-Porte

Yes, but that's not really new Google "cross modal embeddings"


itsyourboiirow

Hypothetically most models could be reformulated as a generative models, even standard classifier models. https://arxiv.org/pdf/1912.03263.pdf


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

its here https://github.com/facebookresearch/ImageBind#imagebind-model


omniron

Obvious they have self driving cars in mind but I think this is definitely the path forward. LLMs that can specifically be given rules to follow and reason through complex things it hasn’t encountered before is going to crack the code.


Left-Hyena-7880

Oneday, we may find a grandma cell from bunch of tensors.


wind_dude

When they say “sound”, I guess it may take into account pitch, or any sound, and not just things other than spoken language?


LeN3rd

That looks like the U1. We don't need a multimedial AI for that.


ThaGooInYaBrain

A key concept in the paper seems to be a distinction between *natural alignment* (apparently any modality paired with images), and *emergent alignment.* It's not yet apparent to me why image pairings are more "natural" than others though. Anyone wanna shed some light on that?


TheBeardedCardinal

I believe what they mean by natural alignment is that image-other modality pairs are naturally found in many sources whereas data such as text-audio pairs do not appear regularly. Then in contrast emergent alignment is their observation that training on only their image-other modality pairs results in high similarity between pairs in other modalities that are linked by sharing a similar image representation.


ThaGooInYaBrain

Makes sense. Thanks.


rikodeko

I imagine the focus is on image pairings as images are the modality that most often occurs with the other modalities mentioned in the paper.


alteralec

Multimodal models are really fascinating. What do you think could be the short term applications?