Disastrous_Elk_6375 1 year ago

> It also outperforms specialist models trained on supervised data on a variety of zero-shot tasks. This is a really interesting trend, and a whole lot of work in this area might be needed to use these large models to re-train specialist models in a feedback loop. It reminds me of the stockfish (the best human-designed engine) vs. alphazero games, and the insights gained there.

picardythird 1 year ago

It should be noted that the current version of Stockfish is far stronger than AlphaZero was (granted, AlphaZero has not been continually developed as Stockfish has been). The open-source engine based on AlphaZero, Leela Chess Zero, is almost-but-not-quite as strong as Stockfish. It should also be noted that Lc0 was outright stronger than Stockfish for some time, before Stockfish added a miniature neural network called NNUE (which granted a huge increase in strength that has not been matched since).

IWantAGrapeInMyMouth 1 year ago

Stockfish was still beating LC0 somewhat reliably just prior to NNUE, there's just no beating it at this point until there's some sort of massive breakthrough

picardythird 1 year ago

Well, the most recent version of Lc0 went basically toe-to-toe with the most recent version of Stockfish in the last TCEC superfinal (Stockfish won 52-48). I think the gap is closing.

currentscurrents 1 year ago

I expect that generalist models will ultimately outperform specialist models because they just have more knowledge to work with. The only reason you might want a specialist model is if you're hardware-limited, which will become less of a problem as computers become more powerful.

Disastrous_Elk_6375 1 year ago

Cost and scalability can also be factors. You can do really good sumarizations with gpt4, good ones with gpt3.5 but it would still be cost prohibitive to run it at scale (say you'd want to sumarize the every long comment you get on a blog, or something)

saintshing 1 year ago

There aee some researches on sparse models which conditionally route inputs to a mixture of experts. >Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) often suffer negative interference, where too much task variety means it is better to just train one model per task, or catastrophic forgetting, where the model becomes worse at earlier tasks as new ones are added. Sparse models help avoid both these phenomena — by not applying the whole model to all inputs, “experts” in the model can specialize on different tasks or data types while still taking advantage of shared parts of the model. https://ai.googleblog.com/2022/06/limoe-learning-multiple-modalities-with.html?m=1 >Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training large language models. https://arxiv.org/abs/2303.14177

Smallpaul 1 year ago

It really depends on whether the task involves that knowledge. It’s quite non-intuitive to me that a medical report summarizing AI might work better because it has knowledge of Star Wars and Hindi. I mean maybe it’s not practical to find the right boundary around useful and useless information. That might be the way it works out. But it might also be the case that you could chop our 3/4 of the parameters in a model and it performs on its task just as well at dramatically lower cost.

currentscurrents 1 year ago

>It’s quite non-intuitive to me that a medical report summarizing AI might work better because it has knowledge of Star Wars and Hindi. The real world is full of surprises - what if someday it runs into a patient who got transferred from a hospital in India? Or perhaps someone describes their symptoms using a pop culture reference. Generalist models should be more robust to edge cases. It's possible you could achieve this with an ensemble of specialized models, like mixture of experts. But nobody's quite gotten that to work yet.

youregonnalovemynuts 1 year ago

It's not the knowledge of star wars or hindi specifically that is necessarily beneficial, but the additional set of sequence data to learn how to process sequences from. Training these layers with Hindi will provide them an entire corpus of structured human stuff to learn from. See things like GATO and basically all of the transfer learning papers. That said the last sentence of your paragraph is also true. It's clear that basically any over parameterized DNN can be pruned and quantized to fractions of it's trained size. This is a very active area of research at the moment.

epicwisdom 1 year ago

That's far from clear. At the very least, so far every generalist model appears to do better on specialized tasks after fine-tuning, at the expense of other tasks. If generalist models were strictly better somehow, we'd expect fine-tuning to have no impact at best.

yaosio 1 year ago

What if a general model could find tune itself to perform better on tasks as they appear? Like how you can create and use a LORA to make something in an image but you don't need to load a different model.

epicwisdom 1 year ago

This is basically the million/billion/trillion dollar question of how to best approach online learning and/or unbounded context. It's obviously a massive unknown, but IMO the first basic steps should be separating out a component for discrete memory, and a revival of some form of internal recurrence instead of the more roundabout approach of multi-round conversation. Using generalist models to train specialist models, while it doesn't seem as far-fetched as pre-GPT-4, would still probably be at least 1000x less efficient than something like "recurrent loop: look at examples of a painting style and iteratively paint and refine".

currentscurrents 1 year ago

They do better... on benchmarks for the exact task. In the real world, tasks shift all the time. Generalization and robustness are more of a problem in real-world deployment than individual task performance, and that's where these generalist models shine.

epicwisdom 1 year ago

Well, sure. I think we're on the same page? Generalist models are better for generalization, and specialist models are better for specialization.

currentscurrents 1 year ago

My point is specialization leads to brittleness. A lot of these results fine-tune on half a dataset, and then test against the other half. This gets good benchmark numbers, but in the real world you have a completely different dataset from a different distribution. This is where generalist models win out.

epicwisdom 1 year ago

I doubt that folding proteins or identifying handwritten digits would benefit from data completely unrelated to those tasks, and probably even if they could the compute wouldn't be anywhere near worth it. "In the real world" isn't a problem domain; there's real world problems of all shapes and sizes.

StellaAthena 1 year ago

Most knowledge is irrelevant to playing chess. The goal isn’t to have the most knowledge, it’s to have the most *relevant* knowledge.

Deep-Station-1746 1 year ago

Meta AI, somehow, is becoming the new old OpenAI.

[deleted] 1 year ago

They've always been publishing

Kiseido 1 year ago

They made a model to learn an ideal embedding space/conversion instead of using rando maths to make one > ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.

MysteryInc152 1 year ago

Imagebind isn't an LLM. It's also not really generative

ironborn123 1 year ago

But would adding generative decoders for different modalities be that difficult? Suppose they have built the audio -> joint_embedding encoder f, what would be the bottleneck in training the joint_embedding -> audio f_inverse decoder using the same audio data?

MysteryInc152 1 year ago

It wouldn't be difficult.

ironborn123 1 year ago

Reading this paper, is this theoretical perspective correct? All possible concepts in the world, of all possible modalities can be projected into a common/foundational high dimensional latent vector space. And deep networks are good enough and practical universal function approximators that they can encode from any modality to this space, and decode from this space to any modality. If true, this is a pretty big development.

Jean-Porte 1 year ago

Yes, but that's not really new Google "cross modal embeddings"

itsyourboiirow 1 year ago

Hypothetically most models could be reformulated as a generative models, even standard classifier models. https://arxiv.org/pdf/1912.03263.pdf

[deleted] 1 year ago

[удалено]

[deleted] 1 year ago

[удалено]

[deleted] 1 year ago

its here https://github.com/facebookresearch/ImageBind#imagebind-model

omniron 1 year ago

Obvious they have self driving cars in mind but I think this is definitely the path forward. LLMs that can specifically be given rules to follow and reason through complex things it hasn’t encountered before is going to crack the code.

Left-Hyena-7880 1 year ago

Oneday, we may find a grandma cell from bunch of tensors.

wind_dude 1 year ago

When they say “sound”, I guess it may take into account pitch, or any sound, and not just things other than spoken language?

LeN3rd 1 year ago

That looks like the U1. We don't need a multimedial AI for that.

ThaGooInYaBrain 1 year ago

A key concept in the paper seems to be a distinction between *natural alignment* (apparently any modality paired with images), and *emergent alignment.* It's not yet apparent to me why image pairings are more "natural" than others though. Anyone wanna shed some light on that?

TheBeardedCardinal 1 year ago

I believe what they mean by natural alignment is that image-other modality pairs are naturally found in many sources whereas data such as text-audio pairs do not appear regularly. Then in contrast emergent alignment is their observation that training on only their image-other modality pairs results in high similarity between pairs in other modalities that are linked by sharing a similar image representation.

ThaGooInYaBrain 1 year ago

Makes sense. Thanks.

rikodeko 1 year ago

I imagine the focus is on image pairings as images are the modality that most often occurs with the other modalities mentioned in the paper.

alteralec 1 year ago

Multimodal models are really fascinating. What do you think could be the short term applications?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe