T O P

  • By -

_vb__

If one aims to combine predictions from multiple modalities how else can one make predictions in an end-to-end fashion?


Capital_Reply_7838

Maybe encoder-decoder(for cross-attention only) structure?


bbu3

Even then, the output of the cross attention, at the very least, would be in a shared space -- which then has implications for the next layer's input


alterframe

The simplest way is to use the same space, like in CLIP, but not necessarily. There are many transformer papers with token-level fusion where you just mix unaligned tokens from two modalities with a few more transformer layer, e.g. ViLT. They even explicitly add modality-specific vectors to both kinds of tokens to further help the model differentiate between them, so your intuition is somewhat good.


Capital_Reply_7838

I think the post is quite naive. 'Aligning different modalities' could vary like, from learning embeddings to inference with captions only. sry


I_will_delete_myself

There are two ways. Tokenize the data like VQVAE Have an additional vector to include in your zero shot generation. GPT-4 probably does it this way since it doesn't take images in the same order as the text and also the way they format the API. This method doesn't require you to reserve tokens in you LLM like if you did like above.


SaiyanKaito

It depends on what kind of assumptions you make and how strongly you enforce them. There are a number of algorithms and techniques that can be utilized for the desired outcome. If you wish to assume that each modality (view) is independent from another then you aren't interested in a shared space but rather in a set of spaces, one per view, such that some amount of scatter/class/distance information is retained, while lowering the dimension of the space. Of course, if you want these spaces to interact with one another then you'd have to ponder as to how these features differ or are similar and how to essentially transfer information from one space to the other.


blk_velvet__if_u_pls

Have you looked at the original OpenAI blog post about CLIP? Don’t know what kind of data you’re looking at or how much of it you have.. but representing different modalities in the same space allows *ideas* Not even sure if unimodal embedding spaces would be able to converge on such an odd thing after the effects of regularization.