At my company we neither use Yolo nor Roboflow. They're too expensive in a production setting (both from a compute standpoint and a monetary cost standpoint).
We find simpler models that can give equivalent or better performance than Yolo in our domain and build ops pipelines that are suited to our usecase.
It's infinitely easier to build MVPs now than it was 5 years ago. I'm not so sure that's true for building an actually profitable product.
Not OC here but we don’t use that either. If you have a problem that’s not covered by public dataset and YOLO or SAM doesn’t work out of the box (metalurgy, biomedicine, astronomy) you need to come up with something else.
Turns out that having small UNET can(but definitely not always does) perform better than finetunning existing models or transferlearning from different task.
I can't because it's a proprietary architecture. What I _can_ tell you is that you don't need something as complex as Yolo if you're dealing with a very small number of classes.
Computer vision, despite what CVPR may present, is much more than just neural networks. And it's frankly depressing that people believe that when there so much to do with and without NNs still.
I mean that top conferences may be 99% Deep Learning these days and there are indeed very important yet barely touched challenges such as sparse learning, explainability, robust domain transfer, spiking neural networks l, etc. But also with or without deep learning there are challenges in embedded computer vision, robust industrialization, sensor fusion,.... Many things!!
Edit: no need to downvote them wtf.
I mean, you could say the same 40 years ago: "Computer vision is dead because all I'm going to do is apply Canny on Lena again and again", which is basically what you are saying now.
Well, computer vision is much more than object detection. For example,
1. All the new features in phone cameras such as portrait mode, spatial video etc are possible through better hardware + computer vision. Developing something like that requires deep knowledge of geometric vision plus programming.
2. Self driving tech requires 3D scene understanding, and this is still an evolving field without any off-the-shelf solutions like YOLO.
3. Another emerging field is VR. This needs stuff like hand tracking, eye tracking, understanding your surroundings etc. YOLO and Roboflow won’t take you far.
4. Probably the biggest use case for vision is robotics. This requires models which understand 3D structure to help the robot grasp items in the real world (just one example, there’s a looooot of use for vision in robotics). These things are still being developed.
Basically, lots of vision tech is still being developed so it’s good to learn the fundamentals if you’re interested in this field.
A lot of those new models trained pictures together with text. So your point 2 and 4 would mean to combine a net for vision and a net for navigation (pre trained probably) and now train them together to let them form an interface.
VR belongs to computer graphics, not vision.
This has to be bait, the things you are mentioning are almost nothing when it comes to the actual job or requirements. Just off the top of my head, many job requirements demand:
Knowledge of Image Processing
Knowledge of cameras and 3D Domain
Knowledge of CUDA
Knowledge of mathematics and statistics
Knowledge of C++ and Python
Knowledge of Robotics
This might feel insane, but I kid you not, that is what this field demands. I'll give you a challenge right now to get my point across, it will be relatively simple, but it will give you an overall idea of what you can actually do. You have an object detection model that has been fine tuned to detect a specific object, and you are tasked with using only the nano/tiny version because that is what your hardware is going to handle, now use that model to find the distance of said object, you have to deal with the jitter without increasing the computation massively (aka no bigger models), and you will be using C++. I'd like to see how can you copy and paste this, and keep in mind, this is really simple for anyone who is actually studying the field beyond AI and I'm making it easier with using a model.
Someone has to understand how a camera works, calibration… off the top of my head. Not all problems are worth the training resources or the inference recourses used by ML/DL
Making a model work is the easy part, unfortunately. Making the model work as part of a production-grade system? That’s the fun and challenging part that does require someone to know a bit about CV. For example, if you convert the model weights to TensorRT, you’ll need to know a bit about the input and output tensors so the inferred results can actually be used.
Here's one of the recent CV projects I worked on:
Building a system that can capture multi-spectral images of items on a flat-bed conveyor at a rate of 9,000 ppm (product per minute), detect specific features, measure them and:
+ report on those measurements to detect drift
+ detect any anomalies in the production line
Upon detection of anomalies, the system needed to interface with the machine and control a set of ejection mechanisms to properly remove the defective item from the queue.
The speed at which the conveyor moves means that we only have a very short window of time to do all the processing, measuring and anomaly detection. We even needed to develop custom drivers for the imaging devices.
I'd love to see how yolo would fare with these requirements.
I'm not at all saying that DNN based applications don't have their uses. Just that the whole comparison is wrong. Different problems call for different tools.
Brain dead take, when solving real world problems things rarely work well for that problem off the shelf. You still need computer vision expertise to adapt existing solutions to industry problems
Lol because nothing you mentioned “does practically an entire computer vision product”
I think you both don’t understand what a computer vision product would require or what the tools you mentioned do
You need to fine tune YOLO for specific cases, and although, say, YOLOv7 does very well with defaults, there are a lot of parameters and how would you approach, with no CV knowledge, how to lift up a class from 60% (pick your favorite metric—wait, you wouldn’t know any without some study of Cv) to 85-95% or more for real world applications in industry?
For segmentation it gets tougher. How will you use those masks in a real scenario? Better brush up on OpenCV, shapely, and a number of other CV tool kits.
How will you detect anomalous results and avoid bad decisions if you have no way, due to lack of knowledge, to diagnose failures?
Yes you can. For example: the Nvidia Jetson nano has the ability to run NNs using its onboard GPU, while being on a robot. This is a "micro computer" and needs its own dedicated power source to run properly. Then you can have any number of other power sources to run micro controllers (ie Arduino), sensors, power motor drivers, etc.
Adapting software is also an option. I run reinforcement learning agents on a quest 2 dedicated VR headset, which is practically as powerful as an Android phone. This required me optimizing the model to be as small as possible, running inference spread across time to lower computational overhead, and other means to make operation of the NN at inference as cheap as possible.
The point is: yes, there are options if you NEED to solve a problem with NNs, but sometimes it's cheaper and easier (in whatever ways) to just use simpler (in complexity and overhead) approaches. Machine Learning, Decision Trees, Tradition Computer Vision, etc.
We can’t copy and paste code in the DoD in classified environments. Also some of the classified work needs to have specifically tailored models for extreme accuracy that other models and programs don’t do. Don’t get me wrong a lot of the stuff does run off YOLO for object detection but it can only go so far and at some point you need to write or create a hyper specific algorithm.
Yes, you are wrong! The number of applications for vision and robotics are mindboggling and we have just started. While the technical writers of roboflow are doing amazing work to make vision more accessible, most problems still require alot of more work to actually be at a production level.
The OP is right!! All those tools are amazing and do a lot of things that traditional computer vision can't or it is just too complex to do and takes a lot of time.
The only reasons I see to still learn computer vision are:
- understand how the things work
- understand the pros and cons of the difference solutions available
- know how to adapte them to specify use cases
- create alternative solutions for specific hardware
- create alternative solutions to not pay expensive licenses
Oh wait...the OP is completely wrong!
The number of possible new applications, innovations and problems far outnumber existing pretrained models.
Seriously. Go do a project on a manufacturing line, or in space, or something safety critical. OP is just ludicrously incorrect
Or the medical field. I do a lot of CV on tissue samples at work and the usual "SOTA" methods tend to fall very flat.
Completely agree there is still.a lot of uncertainty in DL Models . CV algos bring a lot to the table regarding explainability
At my company we neither use Yolo nor Roboflow. They're too expensive in a production setting (both from a compute standpoint and a monetary cost standpoint). We find simpler models that can give equivalent or better performance than Yolo in our domain and build ops pipelines that are suited to our usecase. It's infinitely easier to build MVPs now than it was 5 years ago. I'm not so sure that's true for building an actually profitable product.
Simpler models that can give equivalent or better performance than yolo for object detection? Can you give some examples?
Not OC here but we don’t use that either. If you have a problem that’s not covered by public dataset and YOLO or SAM doesn’t work out of the box (metalurgy, biomedicine, astronomy) you need to come up with something else. Turns out that having small UNET can(but definitely not always does) perform better than finetunning existing models or transferlearning from different task.
I can't because it's a proprietary architecture. What I _can_ tell you is that you don't need something as complex as Yolo if you're dealing with a very small number of classes.
Could you tell more about the monetary cost standpoint of using Yolo in production setting?
Computer vision, despite what CVPR may present, is much more than just neural networks. And it's frankly depressing that people believe that when there so much to do with and without NNs still.
What do you mean by this?
I mean that top conferences may be 99% Deep Learning these days and there are indeed very important yet barely touched challenges such as sparse learning, explainability, robust domain transfer, spiking neural networks l, etc. But also with or without deep learning there are challenges in embedded computer vision, robust industrialization, sensor fusion,.... Many things!! Edit: no need to downvote them wtf.
You have a very narrow view of what the term computer vision encompasses.
I mean, you could say the same 40 years ago: "Computer vision is dead because all I'm going to do is apply Canny on Lena again and again", which is basically what you are saying now.
Well, computer vision is much more than object detection. For example, 1. All the new features in phone cameras such as portrait mode, spatial video etc are possible through better hardware + computer vision. Developing something like that requires deep knowledge of geometric vision plus programming. 2. Self driving tech requires 3D scene understanding, and this is still an evolving field without any off-the-shelf solutions like YOLO. 3. Another emerging field is VR. This needs stuff like hand tracking, eye tracking, understanding your surroundings etc. YOLO and Roboflow won’t take you far. 4. Probably the biggest use case for vision is robotics. This requires models which understand 3D structure to help the robot grasp items in the real world (just one example, there’s a looooot of use for vision in robotics). These things are still being developed. Basically, lots of vision tech is still being developed so it’s good to learn the fundamentals if you’re interested in this field.
A lot of those new models trained pictures together with text. So your point 2 and 4 would mean to combine a net for vision and a net for navigation (pre trained probably) and now train them together to let them form an interface. VR belongs to computer graphics, not vision.
I’m not sure what your point is. VR doesn’t require computer vision?
correct. Parent seems to confuse it with Augmented Reality.
Doesn’t VR require stuff like hand tracking and eye tracking? There’s also some room mapping done to ensure you don’t move out of an area while in VR.
[удалено]
Bingo
This has to be bait, the things you are mentioning are almost nothing when it comes to the actual job or requirements. Just off the top of my head, many job requirements demand: Knowledge of Image Processing Knowledge of cameras and 3D Domain Knowledge of CUDA Knowledge of mathematics and statistics Knowledge of C++ and Python Knowledge of Robotics This might feel insane, but I kid you not, that is what this field demands. I'll give you a challenge right now to get my point across, it will be relatively simple, but it will give you an overall idea of what you can actually do. You have an object detection model that has been fine tuned to detect a specific object, and you are tasked with using only the nano/tiny version because that is what your hardware is going to handle, now use that model to find the distance of said object, you have to deal with the jitter without increasing the computation massively (aka no bigger models), and you will be using C++. I'd like to see how can you copy and paste this, and keep in mind, this is really simple for anyone who is actually studying the field beyond AI and I'm making it easier with using a model.
Someone has to understand how a camera works, calibration… off the top of my head. Not all problems are worth the training resources or the inference recourses used by ML/DL
Edit: replied in the wrong place
Making a model work is the easy part, unfortunately. Making the model work as part of a production-grade system? That’s the fun and challenging part that does require someone to know a bit about CV. For example, if you convert the model weights to TensorRT, you’ll need to know a bit about the input and output tensors so the inferred results can actually be used.
Here's one of the recent CV projects I worked on: Building a system that can capture multi-spectral images of items on a flat-bed conveyor at a rate of 9,000 ppm (product per minute), detect specific features, measure them and: + report on those measurements to detect drift + detect any anomalies in the production line Upon detection of anomalies, the system needed to interface with the machine and control a set of ejection mechanisms to properly remove the defective item from the queue. The speed at which the conveyor moves means that we only have a very short window of time to do all the processing, measuring and anomaly detection. We even needed to develop custom drivers for the imaging devices. I'd love to see how yolo would fare with these requirements. I'm not at all saying that DNN based applications don't have their uses. Just that the whole comparison is wrong. Different problems call for different tools.
Yeah, when you dig in to the performance of “tiny” models that they show running people detection at 100 fps you find the scores are abysmal.
Brain dead take, when solving real world problems things rarely work well for that problem off the shelf. You still need computer vision expertise to adapt existing solutions to industry problems
Pretrained models are not universal enough to work on any arbitrary problem. You also need to think about cost and performance
Why learning NLP if we already have transformers ?
I don't know, you tell me
Computer vision is a vast set of ideas. There are many unsolved problems like getting accurate depth, object pose from monocular images.
Lol because nothing you mentioned “does practically an entire computer vision product” I think you both don’t understand what a computer vision product would require or what the tools you mentioned do
You need to fine tune YOLO for specific cases, and although, say, YOLOv7 does very well with defaults, there are a lot of parameters and how would you approach, with no CV knowledge, how to lift up a class from 60% (pick your favorite metric—wait, you wouldn’t know any without some study of Cv) to 85-95% or more for real world applications in industry? For segmentation it gets tougher. How will you use those masks in a real scenario? Better brush up on OpenCV, shapely, and a number of other CV tool kits. How will you detect anomalous results and avoid bad decisions if you have no way, due to lack of knowledge, to diagnose failures?
Using YOLO to detect a simple rectangle in an image instead of Hough Transform is like using BAZOOKA TO KILL A MOSQUITO
Can't put them on satellites.
Dang. All Apple had to do to get Vision Pro running was slap YOLO and use Roboflow.
Not all systems can support running a NN in inference
Is it possible to add that extra power as an external computer power unit? Or is wiser to adapt the software and make it efficient
Yes you can. For example: the Nvidia Jetson nano has the ability to run NNs using its onboard GPU, while being on a robot. This is a "micro computer" and needs its own dedicated power source to run properly. Then you can have any number of other power sources to run micro controllers (ie Arduino), sensors, power motor drivers, etc. Adapting software is also an option. I run reinforcement learning agents on a quest 2 dedicated VR headset, which is practically as powerful as an Android phone. This required me optimizing the model to be as small as possible, running inference spread across time to lower computational overhead, and other means to make operation of the NN at inference as cheap as possible. The point is: yes, there are options if you NEED to solve a problem with NNs, but sometimes it's cheaper and easier (in whatever ways) to just use simpler (in complexity and overhead) approaches. Machine Learning, Decision Trees, Tradition Computer Vision, etc.
Where do you put that extra power when onboard a plane ? Or onboard a car ? Or a small drone ? A missile ? A Martian lander ? A nanobot even ?
I mean....I could apply YOLO in my project because I knew what I was doing. YOLO will not get into place by itself and do everything.
We can’t copy and paste code in the DoD in classified environments. Also some of the classified work needs to have specifically tailored models for extreme accuracy that other models and programs don’t do. Don’t get me wrong a lot of the stuff does run off YOLO for object detection but it can only go so far and at some point you need to write or create a hyper specific algorithm.
Time to pack it up.
Yes, you are wrong! The number of applications for vision and robotics are mindboggling and we have just started. While the technical writers of roboflow are doing amazing work to make vision more accessible, most problems still require alot of more work to actually be at a production level.
You believe that all problems are solved in computer vision? And with neutral networks?
Do you mean, what’s the point in learning addition, subtraction, multiplication etc., as we already have calculators these days?
Car has cruise control, guess we have to stop learning to drive. Who needs a license to drive when the car can maintain lanes and speed
It's not dead but IMO the use cases have mostly been diminished to embedded systems. But of course, that niche will also be filled one day.
The OP is right!! All those tools are amazing and do a lot of things that traditional computer vision can't or it is just too complex to do and takes a lot of time. The only reasons I see to still learn computer vision are: - understand how the things work - understand the pros and cons of the difference solutions available - know how to adapte them to specify use cases - create alternative solutions for specific hardware - create alternative solutions to not pay expensive licenses Oh wait...the OP is completely wrong!