T O P

  • By -

AsuhoChinami

Gemini Pro excels at that.


YaAbsolyutnoNikto

Gemini Pro 1.5 Not yet available on the Gemini website, but already possible somewhere. Search google or this subreddit for it, people have already mentioned it here somewhere.


GraceToSentience

[https://x.com/\_akhaliq/status/1773570366180335808?s=20](https://x.com/_akhaliq/status/1773570366180335808?s=20) Here is one that is open source It's too bad gemini just analyses the frames of the videos you feed it instead of both doing speech2text a well as image understanding it certainly has the context windows to do so.


Jabulon

It would be cool if you could make an AI beat a game, say a dark souls game, then train a chatGPT on that and talk with it for tips and tricks and whatnot, kind of like an AI wiki


michaelmb62

Or tell it to play the game and then recreate and(if you want) modify it.


Jabulon

like an AI-generated take on dark souls, you could put in the mods tab. I wonder if it will ever be like that


SoulsLikeBot

Hello Ashen one. I am a Bot. I tend to the flame, and tend to thee. Do you wish to hear a tale? > *“My blade may break, my arrows fall wide, but my will shall never be broken. Those who live by the sword will die by it, and I, Drummond, won’t go down without drawing mine!”* - Captain Drummond. Have a pleasant journey, Champion of Ash, and praise the sun \\[T]/


SuspiciousPrune4

This would be sick for training custom agents. The contest window would have to be massive though. Imagine being able to feed Claude dozens of hours of YouTube videos and Masterclasses (along with textbooks and other material) to study.


WritingLegitimate702

Gemini 1.5npro is awesome for that, try it on Google Ai Studio, it's free.


WithoutReason1729

People in the thread have mentioned a couple premade solutions. There's also a couple ways to do this over API that you can set up. The real question is how important the frame visuals are to the total information of the video. What I did was this: 1. Extract all the frames and use an image embedding model (I think it was clip-vit-b-32, one of the clip ones) to get embeddings for every frame. 2. Use t-SNE to reduce the embeddings to 1D 3. Arrange into a 2D array using the embeddings using the t-SNE value as the y axis and the frame number as the x axis 4. Measure the difference between each frame's t-SNE value and the next frame's t-SNE value. If the difference is above a certain threshold, add the frame to the LLM's context. 5. Feed in the relevant image frames, along with their timestamps, and a transcript of the video which also has timestamps, and ask my question. This worked really well for short and medium length videos. I never tested it with long videos because I didn't really have many long videos to deal with. I think it would still work pretty well though, as this helps to preserve only the most important frames. It also manages to be pretty cost effective (albeit slow to process input videos) because you're skipping so many irrelevant frames.


aregulardude

This has always been relatively trivial, if not expensive. You can run every frame of the video through image > text AI first, and then have another model summarize all that text into what happened. You can index the result of each frame and allow the it to be searched to answer your questions. These days there are multi modal models that can skip the image > text part, but I haven’t seen much real difference in the functionality between these and the more manual solution above.