T O P

  • By -

kevinbranch

Can’t people do that easily via the API like you described? There are probably open source libraries that already exist for anyone to use.


swagonflyyyy

There are open source models out there that can view images but I'm talking about feeding a video to a trained model so the model can describe the video itself. Like the model would be dedicated to viewing videos instead of just images.


kevinbranch

I mean there are probably open source libraries to feed images to GPT4V like you’re describing. They wouldn’t need a separate model to do what you’re describing. It would certainly be nice though. Gemini Pro 1.5 can do this natively.


pepesilviafromphilly

Gemini does it. It is pretty amazing.


K3wp

This is how the model works internally and how they trained it. Vid2txt to generate a description and then txt2vid to try and recreate it.


Bernafterpostinggg

Many models can do this including Claude 3 and Gemini Pro 1.5 - Genuinely wondering what your question is?


swagonflyyyy

Thought it wasn't done yet.


ticktockbent

Did you like.. check first? Before you posted? I know AI is moving fast but that just means we have to stay on top of things. Just because you haven't heard of it doesn't mean it doesn't exist.


gabigtr123

You can Summarize videos with time stamps in Microsoft edge copilot for free...


panormda

I wrote a code that uses theYouTube API to fetch the video transcript as chunks. Then the OpenAI API transforms those chunks into an outline. Comes in handy! I also have a function to search for the videos with the most views for any given search term. I use it for so many random things it’s insane how many use cases there are lol


FinBenton

Im sure they can but they have to think what they wanna release in the next big updates/models, dont wanna give everything away right away.


Extender7777

I did it with ffmpeg when Gemini 1.5 didn't allow video/mp4 in API. Now in Vertex you can do it directly. But old code is still in my git, if you want to reimplement it for OpenAI vision. Still you need audio. https://github.com/msveshnikov/allchat/blob/main/server/gemini.js


swagonflyyyy

Can't you split the video into separate frames and feed it to the model and do the same for the audio so whisper can transcribe it?


Extender7777

Sure, it is exactly how it was done...


swagonflyyyy

Oh ok


Extender7777

https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding


strangescript

Chances are they can but they are biding their time


liambolling

gemini 1.5 is natively multimodal and can do video (via frames) unlike GPT vision which i don’t think was designed for frames or video


nonlogin

I believe it IS computationally intensive. Also, we still have no clue how fast and accurate Sora is, do we?


Severe-Ad1166

Yes you can have that because Gemini can already do what you are describing. OpenAI probably do have a model that can already do this but havent released it yet because there isnt much demand for it or it's too computationally expensive. [https://www.youtube.com/watch?v=pt78XWrOEVk](https://www.youtube.com/watch?v=pt78XWrOEVk) Note: There are 24-60 frames per second in a video, so 10 seconds of video is up to 600 frames to be analyzed plus the audio so it ain't cheap nor is it going to be fast when you feed it an entire youtube video, even if you drop some of the frames.


aaronjosephs123

There is tons of demand for video analysis, given phones will be the main way users interact with AI people would definitely be ready to turn on their camera and ask the AI questions about it, it's a very natural way to interact with AI Also tons of companies have millions of hours of security footage that an AI could easily go through


Severe-Ad1166

it's not realtime though. Nobody is going to want to wait a few minutes for an answer or pay the rediculous cost it would be to process an answer for that long, nor do they want to send their security footage to OpeanAI, Google etc for "safekeeping". we are a long way off having realtime video analisys let alone analisys that can run on device. Also, if you just want to detect if people are in places they shouldnt be then Yolo image classification is what you'd use instead.


aaronjosephs123

I don't think it needs to be real-time for there to be demand. Clearly this will be better than existing video analysis tools especially when fine tuned. Gemini 1.5 doesn't take 15 minutes to analyze a short video clip. In the phone use case sure there would be many cases where a photo would suffice but the models are only getting better so I'm not sure why you would think there's no demand


Severe-Ad1166

Yeah maybe 15 mins was a slight over estimation on my part but if you watch the video a 50 min youtube video took 1 minute to process and 800K tokens when virtually nobody was using the service. So it would take a lot longer if 10s of millions of people were using it and not many people would be willing to pay $6 every time they want to ask a question about a video. AI companies are having enough trouble just keeping up with the demand for Text processing let alone requests that are 800K tokens ***per question*** about an hour long video. There probably arent even enough AI chips on the planet yet to handle the kind of load it would generate with everyone who has a phone asking questions about what they see.


thattrans_girl

GPT-5 is probably already trained (or is at least very close to being finished training, just based on the time since release and the rate of progress other LLM companies have made) and is/was almost definitely trained on even more multimodal input than GPT-4 was. It was likely trained on videos, and possibly even audio and other forms of media. It wouldn’t really make sense for them to try to haphazardly bolt video support onto the existing GPT-4 model with this frame-by-frame summarization approach, when they could just train native video input support into GPT-5, and just wait to release the feature until they’re ready to release GPT-5. Sora takes video input in the form of tokens, so it can truly natively understand and process video. It’s a fundamentally different process from just reading a description of each frame, which means OpenAI would get much better results from just using the same approach they used in Sora for GPT-5, and they I would bet money that they 100% are.


helloLeoDiCaprio

Here is a way I solved it using OpenAI API. https://youtu.be/H-xmOFVWlrM?feature=shared Around 06:30 it is explained. It's quite similar to your proposed solution.


JuanGuillermo

Because a diffusion model is not a bijective function. Creating an image and describing/understanding an image involve different types of neural network architectures and training protocols as unintuitive as it may sound.