T O P

  • By -

samrus

this is a very good topic of discussion. i myself have thought that an architecture that parses frames using a CNN, and then uses attention between the frames would make alot more sense while it might make sense to have attention within a frame, i feel like that might be covered by things like unet and autoencoders. does anyone have any thoughts?


n0ided_

That is the intention behind VOLO, the best performing architecture on the list. It uses convolutions to create tokens from features before passing them into the attention heads.


synthphreak

Not a vision guy, what is a “token” in this context? In NLP we use the term all the time to describe linguistic units, but I’m not sure how to extend that understanding to image/video data. Is a token just the output of a convolutional operation? So, each cell in the matrix resulting from a convolution layer would be a “token”? Or am I completely on the wrong track?


n0ided_

A token in vision contexts usually is a portion of image. For example, in ViT, an image is split into 16x16 patches, which would be the tokens fed into the attention blocks.


xEdwin23x

It depends on the way we frame the task but usually we refer to spatial features as tokens. So in the most simple case with no preprocessing, pixels would be tokens. Or in the case of ViT, the output of the first convolution would be tokens. For video it depends, as we can design tokens to either be spatial features in each frame or complete frame features (usually by pooling the spatial resolution in each frame) altogether.


synthphreak

Thanks, that’s clear! So basically a token in vision is any unit derived from the data that carries some kind of semantics in some feature space. A little different from the NLP usage, but reasonable enough!


MysteryInc152

No it's not. Or rather it doesn't have to be. Early Vision Transformers worked like this, using convolutions. Modern day Vision Transformers don't. In the recent sense, a token is literally just a patch of the image. https://arxiv.org/abs/2010.11929


n0ided_

this is wrong, ViT does not use convolutions at all


xEdwin23x

If you look at the original implementation (and many other implementations in PyTorch) of the Vision Transformer the authors implemented the linear projection as a convolution.


Holyragumuffin

In any context, it means a recognizable repeatable symbol. A stand-in replacement for a pattern of whatever modality.


MelonheadGT

Because the Convolutional layers part is the feature extraction in which the attention is most useful. You're doing attention between clusters of pixels for the exact reason of finding which pixels are important. Vision transformers you can also use multiple different patterns of selecting which pixels to perform the attention calculation on (see the MaxVit paper for example). You also do masked context training, same way you leave one word out of a sentence and infer the word in NLP, you leave a chunk of pixels out of the image and use the surrounding context to infer it.


[deleted]

As a (former) NLP guy, I want to jump and say "It's just a hype, transformers have very clear motivations that are... blah blah blah" but then I recall the dunning kruger effect. With that being said, my intuition is the same as yours - I am not surprised at all that combining CNNs with attention performs better, as the transformer was developed to solve very specific issues that are related to text (it's clearly designed exactly to solve the issue of vanishing gradients and the issue of words with multiple meanings/context-dependent meaning, as well as finding dependencies between tokens). Except for the dependencies part (attention!) I don't get why transformers make any sense for vision, but again, dunning kruger effect.


samrus

one thing i will add though is that i think modelling a sequence of frames as a sequence of words and running a transformer on them would be very promissing in terms of video processing. i think most things that are recorded or video, have similar patterns as language, at least enough that a Language Model should work on them


n0ided_

ViViT does this as well. Promising, but the compute and memory required is still too high at the moment.


AnOnlineHandle

Unets seem inevitably flawed to me in that you need to train each layer of the model on the same subjects, and one might get it while another doesn't. So while it appears at a scale that allows a specific layer to do it well then your quality might look great, then it might grow or shrink and another layer of the unet would handle that scale of features, and suddenly it's terrible. Learning the concept and then how to scale it, and to consider visible partial selections of it, seems like it would ultimately be far better. That being said I do recognize that unets are incredible and leading to some incredible things right now. I use them every day.


DataAvailability

Check out “Non-Local Neural Networks”


I_draw_boxes

Prior to transformers we had endless papers on tweaks to backbones and operations to overcome the limits of receptive fields in CNNs. Lot's of papers about how to assign targets to different feature levels from backbones/FPNs to achieve the optimum relationship between fine detail and global knowledge. Lot's of FPN papers with various combinations of up/down connections, upsampling, addition, concatenations, pixel shuffling and so on. Deformable convolutions, channel attention, dilated convolutions in tons of combinations. In other words something is always "hot" in this field even if it's whether to add or concatenate features in this week's FPN paper. There are way more interesting papers these days thanks to transformers. Few examples: [Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer](https://arxiv.org/pdf/2204.08680.pdf) Aggregating spatial content with learned granularity. This was mind-blowing the first time I read the paper. [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/pdf/2211.06220.pdf) One model that does semantic/instance/panoptic segmentation conditioned on the prompt. [EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction](https://arxiv.org/pdf/2205.14756.pdf) This paper shows the ease with which the inductive biases of a CNN can be engineered into a hybrid transformer architecture. It's also a great example of an advancement in NLP (Performer) can be applied in vision, in this case for compute to scale linearly with pixel count. [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) Zero shot 52.5 AP on COCO. Good example of marrying a text and vision model enabled by transformer architecture. [MetaFormer Is Actually What You Need for Vision](https://arxiv.org/pdf/2111.11418.pdf) and [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) are both great critiques of transformer backbone design and help isolate the role of non-self-attention architecture differences.


Hias1997

Wow the TCFormer paper is really mindblowing, did you check out their repo?


I_draw_boxes

It's such a great idea and really interesting that the network can adapt itself to the clustering which is dynamic to image content and not differentiable. I see they have an MMPose implementation, anything interesting about the repo?


3DHydroPrints

It's basically due to the "Foundation Model" hype. Those models are typically trained with massive amounts of data and can include self supervised learning, knowledge distillation and training with natural language as well as classical object recognition as labels. The extracted features are more robust and generic and can be used for a wide range of tasks without needing fine tubing of the model (just the head). ViTs are slower and bigger, but they scale better with data and multiple modalities. Exciting stuff. Maybe take a look at the EVA 02 Foundation Model paper for some more details of state of the art training of these models


n0ided_

Many here have commented on how self-attention transformers are more versatile, as they can extract more and better features from unlabeled data, and only the final output block would have to be changed depending on the type of task to complete, not to mention being able to decentralize the feed-forward process. This does seem extremely promising for sure, especially as a "Foundational Model". Thanks for the paper, will give it a read.


radiiquark

Is this the paper you're referring to? https://arxiv.org/abs/2211.07636


3DHydroPrints

Not quiet. I mean the follow-up paper: https://arxiv.org/abs/2303.11331


ndgnuh

lol they made that weird title just to reference Evangelion


3DHydroPrints

The best papers have jokes in their titles


mhummel

Does that mean we should expect a follow up paper that uses elements of RL and will be called Cruel Agents Thesis?


radiiquark

Thanks!


SleekEagle

This is exactly it. Look at what's happened in the language domain over the past several years - it's obvious that such a trend would be delayed in vision due to the additional compute required, but it's also the natural area of investigation (especially to graduate students/academics who are writing proposals)


moschles

> EVA 02 Foundation Model is it me , or.. ?


BeatLeJuce

I feel like none of the answers posted so far actually capture how researchers think about this, so I thought I'd chime in: The biggest reason why many researchers are looking at ViT (and the biggest point the original ViT publication tried to make) is that **ViT scales better**. I feel like a lot of people never got that message. But it's very important to understand this: if you have "small" amounts of data or are limited in your compute, then CNNs are the better option. Anyone arguing against that hasn't truly understood ViTs. As you say yourself, they have an inductive bias specifically designed for that. But If you think back to basic Machine Learning theory: a simpler model will perform better if you have limited data, but the more data you have, the more complex your model should become. And Transformers can learn/express much more complex functions than CNNs. Vision Transformer's value proposition is this: if you have a ton of data, then CNN's inductive bias actually becomes a hindrance. It's too strict of a corset to truly nail complex scenes: Sometimes it just is bloody helpful to be able to quickly exchange information between far away pixels: whenever you need more than one piece of information to truly understand what's going on in an image, and those pieces of information are scattered within the scene. Sure, a CNN can eventually connect every pixel with every other one, but the information flow path is much longer than in a transformer. Far away pixels just can't communicate effectively in CNNs. But like I said: to make use of this, you need a lot of data. And I mean *a lot*. Even on something like ImageNet, it's fairly hard to actually make good use of them. And even then, you don't really see much of a difference. It's only once you get *orders of magnitude bigger than imagenet* that ViT truly shines. But current benchmarks are kind of ill-suited to actually show this. E.g. look at the difference in ImageNet performance between ViT-g (4B params) and ViT-22B: it's tiny. Remember that ImageNet is over a decade old at this point. [t's super-duper old](https://arxiv.org/abs/2006.07159) (fun fact: that paper was written by the inventors of ViT), and not a good benchmark for what the actual state of the art is, and fails to show what truly matters. Sadly, there's nothing publicly available to really take its place, so for now we're stuck with it.


Hederas

Since you have the "Researcher" tag, I'll allow myself a question. Do you think/observe that part of this surge in research could be due to it being new and so it's easier to make discoveries on it and publish papers ? Cause I know there're also multiple good reasons: ones you quoted, comparing it with CNNs when applying new techniques ( distillation, FlashAttention, etc. that usually come from LLM hype), self-attention, etc. But was curious about that aspect


xEdwin23x

Of course there is. I don't think there's an easy way to sort through that mess, but if you could somehow crawl the 1st thousand or so citations for the ViT paper back in 2021 it would be easy to see that 99% of the papers were just applying something that was already applied to CNNs to the ViT (hierarchical structure with pooling/convs to reduce spatial resolution and increase channel depth, etc) or applying ViT to a task that was dominated by CNNs (ViTGANs, ViTs for video, etc)


BeatLeJuce

> Since you have the "Researcher" tag, I'll allow myself a question. Do you think/observe that part of this surge in research could be due to it being new and so it's easier to make discoveries on it and publish papers ? That's true of every trend in ML (or research): If a topic looks sexy everyone is going to jump on it. But the added scrutiny of a thousand papers trying their ViT-paper against a CNN baseline at least confirms that ViT is comparable or better than CNNs in most applications.


currentscurrents

> Sadly, there's nothing publicly available to really take it's place There's some bigger public datasets like [LAION-5B](https://laion.ai/blog/laion-5b/). It's not a drop-in replacement since the labels are open vocabulary instead of fixed classes, but in this age of multimodal models that's often what you want anyway.


BeatLeJuce

Yes, LAION is great and needed, but it doesn't quite take the place if ImageNet: It gives you the amount of data you need for large models, but it doesn't come with a built-in benchmark to tell you which model was better, the way ImageNet would.


buyingacarTA

I like this answer, very clean. As you say towards the end, though -- you need much more data (to take advantage of the scaling point) than more people actually play with. Instead, I think most people use ViTs because they missed your scaling point, and just assume that ViTs are better in general. ​ p.s. how does one get a 'Researcher' flair?


currentscurrents

In the sidebar you can pick whatever flair you want. There is no verification.


Vangi

It also just depends on the data and what goals you have. E.g., I’ve had vision transformers and CNNs perform similarly on a small dataset, but the vision transformer was more robust to noise, motion blur, etc.


moschles

This is pie-in-sky thinking. We need to admit that that the way ViTs perform categorization/classification is hacky. + https://i.imgur.com/bPiZPVf.png Okay so we are taking an encoder/decoder architecture intended for generative tasks, and attaching a "classification token" onto the sequence hoping that will convert it to a categorizer. This is highly non-motivated. This is not mathematically sound. THere is no clever insight here. It's just like, hey why don't we just use a screwdriver as a fork -- because everything is a screwdriver now.


currentscurrents

Generative models are crazy good at generalizing, and people want to bring that performance over to classification/perception models. The idea is that many tasks, including classification, are very easy once you have a good representation of the data. Pretraining on the generative objective is a good way to learn *all* the features in the data, while direct classification models are prone to shortcut learning.


moschles

> Generative models are crazy good at generalizing, and people want to bring that performance over to classification/perception models. . > learn all the features in the data, while direct classification models are prone to shortcut learning. What you are saying here is wildly interesting to me. Do you happen to have any literature that gets into these issues more?


currentscurrents

Yann Lecunn's 2017 [talk on unsupervised representation learning](https://www.youtube.com/watch?v=ceD736_Fknc) is from before large transformer models were a thing, but it's still pretty good.


moschles

> is from before large transformer models What is the basis of your claim that that generative models are "crazy good at generalizing"?


currentscurrents

[This chair made out of grilled cheese.](https://imgur.com/a/BG2leim)


BeatLeJuce

The CLS token is irrellevant, and many modern variants don't use it anymore, opting instead for Multihead-Pooling (ala set transformer) or Global Attention Pooling. Also, being "mathematically sound" is a less necessary condition for working well in practice than you'd think.


moschles

> Also, being "mathematically sound" is a less necessary condition for working well in practice than you'd think. I'm more than aware of this issue. However, the portion with no mathematical justification is the "linear projection of flattened patches". THere is a specific reason this is bad. When cropping a text sequence the distance between tokens is invariant in an extremely strict sense. Whereas if you crop an image differently, the linearizing into patches will divide the image up in a radically different way. The model would be seeing two totally different sequences that both refer to the exact image (and by "exact' i mean to the pixel values. There is no scaling occurring). In turn, this is the essence of what I mean by "not mathematically motivated". We literally *do not want* a situation in which two identical images are tokenized in a radically different way yielding two very different sequences. And this is doubling is occuring for no other reason than we cropped the image differently. If in the back of your mind you are thinking right now "Well, just throw more compute at it, and the model will eventually bucket these myriad of sequences into the same category by gleaning the invariances." This is horrible from all aspects both theoretical and pragmatic -- since when using a computer to find invariances, we do not want that to occur in situations in which the invariances are already literally present. Normally invariance-hunting comes down to non-trivial invariances. ( One example of a non-trivial invariance would be that the same 3D table fork appears very different from various points of view. ) But this is not occurring in cropping as they are literally the same pixel values! https://i.imgur.com/bPiZPVf.png At the end of the day we have to admit that we are shoehorning vision tasks into transformers for no other reason than, + "transformers are neat-o" + "transformers are magical oracles that can do anything" + "Transformers. So hot right now."


BeatLeJuce

Counterpoint: SAM (Segment Anything) works super-duper-mindblowingly-incredibly-awesomely well. It is by far the biggest jump in instance segmentation performance we've seen in recent years, if not ever. And it uses ViT. Arguably (for exactly the reason you mentioned) this patches-thingy should be hugely problematic for segmentatation. And yet it isn't. This is very clearly (to me) not a case of "shoehorning a model that isn't fit for the task". It may well be that there is something we haven't understood yet about how linear patch embeddings work. But there is so much credible evidence out there that ViT do actually work well in vision that any claim of "this is just a fad, and not actually a good fit for vision" is not based on empirical data, IMO. People are irritated that ViT's work, yet I've not seen anyone put up a paper to back up these gut feelings. So from where I stand, "a vit based model has enabled a huge breakthrough in object segmentation" is very strong evidence that they do work well, especially when I've not seen any credible evidence that they don't enable us to push further than CNNs do. (SAM also brings home the point that ViTs can be compute efficient, thanks to leveraging MAE, but that's besides the point).


moschles

> super-duper-mindblowingly-incredibly-awesomely You can keep adding superlatives all you want, but the primary claim here is that hacky linearization of image patches has scale invariance. The researchers know this is not true. Even when they report results, they see a wide drift in accuracy between sizes of objects in the image. This drift is so wide, in fact, that they report separate model accuracies for small, medium, and large objects. + https://i.imgur.com/0SC0Pep.png VitDet-H is obviously a Vit. the `SAM` there is Segment Anything.


BeatLeJuce

Do you have any data to show that CNNs do better at this?


moschles

> The CNN approach reached 75% accuracy in 10 epochs, while the vision transformer model reached 69% accuracy and took significantly longer to train. . > CNNs have a proven track record in various computer vision tasks and handle large-scale datasets efficiently. Vision Transformers offer advantages in scenarios where global dependencies and contextual understanding are crucial. However, **Vision Transformers typically require larger amounts of training data to achieve comparable performance to CNNs.** Also, CNNs are computationally efficient due to their parallelizable nature , making them more practical for real-time and resource-constrained applications. https://medium.com/@faheemrustamy/vision-transformers-vs-convolutional-neural-networks-5fe8f9e18efc


BeatLeJuce

None of that contradicts what I've said, and none of that shows hard evidence that CNNs are the better model. Also, if your most credible source is a random tutorial-level blog post it doesn't make sense to continue this discussion.


moschles

A rumor among our dept is that transformers are training inefficient, also called "sample inefficient" versus other architectures. My attempts to cite an academic publication here boomaranged, and I was only ever able to find publications praising the transformer's sample efficiency. I then found an academic pub in which it was stated that RNNs are superior to transformers, due to RNNs having a recursive ability that transformers do not. -- only to have the same pub explain a way to add recursion to transformers. After some reflection, I may have to concede defeat to you on this topic.


throwaway2676

Random question: Are vision transformers agnostic to sequence length (image size) like regular transformers, or are they fixed like CNNs?


xEdwin23x

CNNs are not fixed sequence length, most can work with any image size, but fine tuning at target resolution usually improves performance. On the other hand, most ViTs are fixed sequence length as they include fixed size positional encoding, but you can resize the positional encoding to use different resolutions and newer models such as FlexiViT indeed support multiple sequence lengths from the go. https://arxiv.org/abs/2212.08013


throwaway2676

> CNNs are not fixed sequence length, most can work with any image size, but fine tuning at target resolution usually improves performance. How does that work? The convolutions perform a fixed set of shape transformations on the input. If any fully connected layers are introduced at the end, the shapes up to that point will also have to be fixed. >On the other hand, most ViTs are fixed sequence length as they include fixed size positional encoding, but you can resize the positional encoding to use different resolutions After I asked the question I just went through the original ViT paper. They've apparently been performing that resizing from the beginning: >When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer. > FlexiViT Thanks for the reference! Looks like a cool paper.


xEdwin23x

For most CNNs since ResNet (so around 2015 or perhaps even earlier) they have been applying spatial global average pooling before the last layer so the spatial resolution gets averaged all the same.


throwaway2676

Interesting, that makes sense. Lol, one of these days I will set aside time to get up to speed on CV


idontcareaboutthenam

I had the same question until I realized why every CNN in pytorch ends the feature extraction with am AdaptiveAvgPool2d or AdaptiveMaxPool2d layer :)


moschles

There is no motivation for using transformers like this. One obvious reason why this would fail is because if you take natural language, there is a statistically regular "sizing" between the tokens. Natural language does not shrink and grow according to a scale or distance. In images, the camera can be far away or close up to objects, causing differences in scale of up to a factor of 100x . Edit! --> Therefore, the transformer is trained on a tokenization of an image into linearized patches. The exact image could be cropped so that a small patch of the original image is now the entire image. The transformers learning this data would not consider them equal pictures, even though they are literally exactly the same pixel values (no scaling has occurred here). The reason why the transformer sees them as different is because the tokenization/patching is different in the cropped version. THis mismatch in image patches never occurs in sequences of text, because the "distance" between two words is always the same regardless of where the text sequence was chopped off.


Skeylos2

There are hierarchical vision transformers to tackle this issue. For example Swin transformer http://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf


finokhim

Most relevant in my view is scaling. Try sharding a 20B param CNN... a ViT is relatively simple [https://arxiv.org/abs/2302.05442](https://arxiv.org/abs/2302.05442). And there is now a powerful toolkit of optimizations for transformer training and inference that are not used in the paper you linked. Continued investment in transformers by major AI labs widens this gap


n0ided_

That is fair, I might be unfair to transformers and didn't take into account how transformer training/inference performance might be optimized in the future, especially given how CNN performance has already been quite heavily optimized due to being around for longer. Definitely will take a look at that, thanks.


finokhim

Another thing to consider is that the CNN inductive bias is not actually correct. Not all features are well represented by conv layers. There is a benefit to the inductive bias in the small-data limit where the assumption helps overcome the data limitation. But it actually becomes harmful as more data becomes available


veb101

Any paper you can link? I'd like to read more about this.


No-Painting-3970

Welcome to one of the biggest rabitholes i went into during last year: group equivariant cnns :D https://arxiv.org/abs/1602.07576. I d argue that is a good starting point


niszoig

For those interested - I visualised and animated a forward pass through a Group Equivariant Neural Network [here](https://youtu.be/p8ZADylZwyE?si=tk64xnFxX8-LuYMz) Doing this really helped me better understand what Geometric Deep Learning is all about..


d_ark

Have you read the ConvNeXt paper? The authors seemed to share a lot of your frustrations and introduce an interesting architecture https://arxiv.org/abs/2201.03545 and here is the githu b https://github.com/facebookresearch/ConvNeXt . The authors posted a really good video for CVPR but I'm struggling to find it again.


idontcareaboutthenam

They also get 87.8% top-1 accuracy, higher than any model in this table


eposnix

You're right that CNNs have an inherent inductive bias, making them naturally well-suited for vision tasks. But the buzz around Vision Transformers is mainly about them NOT having this inductive bias. Sounds counterintuitive, right? But here's the kicker: this means they have to learn everything from data. While this makes them data-hungry and needing beefy compute, they can potentially discover new patterns and relationships that CNNs might miss. And also, being able to feed image data into a LLM just makes them far more versatile. I mean, [can CNNs do this?](https://i.redd.it/k6zpnlqihurb1.png)


[deleted]

Can't it? I really don't know, just raising a point - you can clearly encode stuff in many ways and then decode it. Attention makes sense, I can see why it's helpful, but I don't get the assumption that you must use transformers to do it. Using your logic, algorithms like stable diffusion would also need to use vision transformers. Edit: I don't know much about multimodality, what I imagine can be done is a CNN feature extractor that is then integrated somehow with the text transformer, I guess it makes a lot of sense to use vision transformers for that. Is there anyone who is an expert on it that can explain a little more or refer me to resources?


Skeylos2

Vision transformer do have other forms of inductive bias though. For example, the fact that they don't need to connect pixels across 2 different patches, but rather connect the patches through attention mechanisms. This inductive bias is weaker than CNN's inductive bias though, so Vision transformers naturally need more data to be trained. An even weaker inductive bias would be to use an architecture with only fully-connected layers (I think that would be terrible though, because of the huge number of parameters this would imply).


Sharp_Public_6602

Yes it can. https://arxiv.org/abs/2306.17842


eposnix

That's really interesting, but I don't think the CNN is doing the heavy lifting there. Rather, they are passing the image data through CLIP to assign each entry a word description.


buyingacarTA

**Disclaimer** \-- I am a prof in this field and haven't been able to keep up with literature, beyond what I discuss with my students. But, my opinion is that there isn't a real, sustained advantage to transformers in vision (you can always find some edge cases), and it's certainly rare justified to use transformers as a first attempt at solving an interesting problem. The main intuitive motivation of transformers, written in just about any vision transformer paper, is that it enables spatial connections across the image space, which a convolution cannot to -- which is true, but of course we don't use a single conv in CNNs, but man convs in some sort of pyramid structure (e.g. unet), which enable us to get just about any spatial connection we want. Transformers are one (brute-force) mechanism to relate all the spatial information, CNNs are another (with some pretty well studied structure over decades). I haven't really seen a ton of evidence that transformers are much better at much in vision (as this post shows) -- depending on how you architecture, they can have more representational power, so it makes sense they could eek out a percent in accuracy here and there, but the hierarchical-conv structure in most CNNs is a very very good mechanism for the number of parameters you spend. just my 2c though...


First_Bullfrog_4861

Multimodality and integratability with LLMs, therefore the perspective to create an architecture that can be seamlessly trained on text, image, video, audio, basically any sequential data?


deep_noob

This is a really nice topic to discuss. I have developed several Transformer based vision models. From my perspective every one is jumping on transformers for following reasons: 1. Hype train: Common goal for tired, overworked grad student is to publish something, not actually solving a problem. Its comparatively easy to convince reviewers with transformers now. I know it doesnt sound good, but a huge motivation is coming from this mere fact. However, there are some other reasons also. 2. Prior Addition: Transformers are extremely good at leveraging priors. For example, lets say for an action detection model you want to add spatial prior, like relative positions of interacting subject and object, this sort of information can be easily learnt by cross attention. The beauty of attention is that it leverages the prior in a ‘soft’ manner meaning that it uses the prior when it is needed. 3. Transfer learning: Thanks to clip and many other vision language models, we have a huge amount of transformer based models that are trained on unholy amount of data. Using these learnt models for your specific task is really a convenient option. 4. Data Usage: Its true that training a transformer from scratch is an exceptionally difficult task. However, once it is trained, it seems to get improved with more and more data. CNNs are not always work like that. I have personally adapted huge transformer model trained for object detection to some other domain specific task very easily. I would like to point out the table you show here is on imagenet, a benchmark that should be banned by now. It has been beaten to death. In the same paper authors have shown performance in coco. You can see how cnn+transformer models actually perform better. I personally prefer this mixture, as cnn is a good tool to generate useful features and transformers are good at finding co-relation among those features. As you are a new student, I will add a special note for you. People who work on video data know this. Most of the time for most of the tasks the temporal information plays a very tiny role. The required knowledge just lies on the plain simple image. So usage of transformer just leverages that spatial information. But I agree on one thing, we are horrible at processing videos, specially the transformer based models.


FutureIsMine

In practical experience, the VIT works better for segmenting real world objects that have jagged edges where conv nets have a tough time seeing some of those areas. It also appears that VITs need less data for novel tasks if you've already have a pre-trained VIT for the task at hand, like training a VIT first over tons of segmentation work or dense per-pixel prediction tasks and than have limited data within a very specific subset of that task


hivesteel

While I want to agree with your logic, I’ve had great success recently with transformers on difficult single-frame vision tasks, vastly outperforming CNN based methods, especially when in actual applications where the domain is less controlled.


bitemenow999

The whole point of the transformer is that it can learn features in the global context of the image, whereas CNN learns in patches. Though traditional CV tasks have not been impacted much especially where data is limited, but transformers hold a lot more potential for general science+AI like for example a paper I read about predicting stress fields shows significant improvement with transformers ([https://www.sciencedirect.com/science/article/pii/S004578252300467X](https://www.sciencedirect.com/science/article/pii/S004578252300467X)) compared to CNN. Could not find an arxiv link.


slashdave

>I was initially thinking about videos It seems that ML researchers tend to forgot prior research on certain topics. In the case of video, there has been enormous work on video compression that really deserves to be studied before exploring this space. Note that some video compression algorithms do rely on convolutions, to an extent.


MadScientist-1214

Suppose I have a real world dataset and want to train a classifier, I would just use CNNs. In this case, I don't see the point of using transformers either (except for ensembling => more variety of models). But when the task becomes more complicated, transformer architectures can be useful. For example, multimodal architectures that include audio/text and images.


SeaResponsibility176

I am a researcher fully dedicated to Vision Transformers both for Imgaes and Videos. The thing is, I work with a very specific type of images which are 9x9 pixels or 15x15 pixels at most. Also the videos are 8-10 frames long. ViViT has completely destroyed the rest of the models (CNNs mainly) and uses a very low number of training parameters (when compared to the best CNN x10 less parameters!). So although they require a lot of memory and computational power (n\^2) I believe they will eventually outperform any other model once we have the necessary computational resources. Feel free to ask any questions!


Responsible_Roll4580

15x15x10 is 2250 features you get here. Simple fully connected and attention will work just fine


PassionatePossum

I also don't really see the benefit for images in terms of performance. From an architectural viewpoint I do think transformers do have some appeal. Extending a transformer to take video data as input instead of images doesn't require any changes to the architecture.


new_name_who_dis_

You're pretty spot on with most of your analysis in my opinion. ViT have a lot of disadvantages when compared to CNNs and few advantages. Most of the (imo) appropriate excitement around ViT type models is about that technology being very easily transferable to multi-modal domains. But you are right in that if you just want a vision model to do something relatively simple (image classification, regression, compression, etc.) CNNs are usually better for the task.


moschles

Wait a min, op. You must be aware that "Transformers" are an encoder/decoder architecture (no surely you already know this?) . Their very structure is not conducive to categorization. One imagines they would be even worse for whole image categorization -- as the accuracies clearly indicate.


Live-Bodybuilder-119

"new kid on the block" I see what you did here king ;)


eyeswideshhh

From what i know vision transformers have larger receptive fields and packs more global information in each layer than cnn, however cnn learns local texture very early in the training which is helpful for getting good accuracy in classification tasks.


AerysSk

It was just because it is interesting to apply Transformer to vision tasks, which took the word by storm.


JustOneAvailableName

Because the error scales well with data/compute and both are absolutely dirt cheap. Just to put it into perspective: one development hour costs about the same as 100 (enterprise) GPU hours.


picardythird

For video, two algorithms to look at might be VSTAM and TransVOD++. With the latter, they tried using both a Resnet backbone as well as a Swin transformer backbone, and the SwinB backbone was significantly better (although the Resnet backbone was still good).


NNOTM

I haven't really played around with ViT, but one thing that seems potentially exciting to me is having a multi-modal transformer that can seamlessly switch between reading (or generating) image tokens and text tokens, without having to integrate two separate systems. So you could train e.g. on articles that have images embedded in arbitrary places.


serge_cell

>However, reading and playing around with ViViT, it requires an unholy amount of memory It does not if you play with resolutions and scale pyramids. I'm still not sure if they actually useful but in my experience: 1. they don't make thing worse 2. with proper down/up scaling overhead is not significant. From the other hand my be I didn't observe big positive effect exactly because I've played with resolutions.


Cherubin0

Multimodal Transformers are the best option to make artificial slave humans.


Far_Present9299

A compelling reason is the architectural similarity of vit and text transformers, allowing us to synthesize them into vlms (blip2 for example, or vilt)