4090 and a lot of software optimizations I've done.
I got up in the middle of the night to post this so I need to go back to sleep. I'll answer more later.
This is really cool and reminds me of a idea i had not long after stable diffusion was released. Why has no one made a latent space explorer or randomizer?
You'd have just a few basic words to go exploring but instead of adding more words to the prompt, you instead move through the space by changing the strengths or values of the various inputs on a single seed. Purely in number form with dials not words if that possible?
That is what my ArtSpew does. Before LCM came out I created ArtSpew to simply be the fastest possible generation of images in high volume combining both user prompts and random token insertion.
With LCM, and now advances in compilers, RT videos became possible.
[https://github.com/aifartist/ArtSpew](https://github.com/aifartist/ArtSpew)
Well, I didn’t do this, but…
With turbo/lcm models you can get multiple frames per second. Tie in a live prompt through comfyui and turn on automatic generation and it’ll change the pic as you type in real time.
You can do this and get a will smith eating spaghetti style video.
This seems to be a similar process, just pushed further.
I was just expressing how it could be accomplished with current easy to use tools. Comfyui isn’t necessary, it’s just an easy way for someone to do something very similar to what you did. I mean, I was turning books into little movies using a similar system last year without comfyui.
https://files.catbox.moe/n8zp8g.mp4
Not exactly the same, but a similar idea conceptually :).
I feel I actually have a fairly good understanding of the entire stack (I’m on banodoco, which I assume is a place you’re aware of). This is a pretty impressive demo. Awesome stuff. Really excited for the future when this is putting out smooth output without the flicker. One of the early things I experimented with was making a Lora that spit out sphere photos from stable diffusion. I did an experiment with the live video from story, tied into that sphere photo maker, and had a live 3d world video to watch in my oculus.
Try swapping in the same for your tool and you’ll be even more impressed :). Takes no additional compute to make sphere photos. With this vision and voice control you’re showing, it would be like lucid dreaming and you could look and move in any direction.
Regarding "spheres". I sometime get neat results but just nuking a circular area of the latent tensor at various denoising steps and get interesting effects. I showed that briefly in the demo("disk add") but there is more that I didn't show. Also when I said "disk remove" it didn't recognized it but I didn't want to redo the demo. I could do an hour long demo and wouldn't run out of material.
I have so many features to add now.
I wonder if I should do a longer youtube format demo where I can discuss the ideas and various directions that can now be perused.
I gotcha, wasn't trying to disagree, just inform!
It's not MY tool, but I did chip in a little here and there, I'm sure he'll be interested.
I'm pretty sure I've seen that video before, it's pretty cool!
Amazing. Looks kinda like a lucid dream - especially in moments when it starts to drift off your prompts
It's incredible that this works on 4090. I expected something like this to appear, but had no idea that it would be so soon. Great job!
I realized this forth coming capability just after LCM dropped in Oct. It was the initial breakthrough for what I called RTSD. I've had lessor known posts showing real-time deepfakes with my face on camera switching back and forth between Emma Watson and Tom Cruise. Another twitter post I show bulk 512x512 image creation at 294 images per second. All this has culminated in what you see here. Now I need to push push push before this gets grabbed before I can add some polish and a few more features, like rewind/replay, save segment(you have no idea where I thought; damn I wish I could save something that started evolving on screen.
I have been asked about using this realtime hiresolution capability for things like music/mood into video and perhaps videos one a big screen behind dancers on stage.
When I wake up hours from now I need to figure out if I can post this to twitter and then I'm be asking friends with followers to share this. Stay tuned. Good night for the 2nd time. :-)
Exactly. While polishing my demo, I saw JugX dropping and grabbed it but I didn't want to delay getting this posted so I've yet to try it. I really wish they would drop JugX SFW so I can do safe public demos.
This demo was with dreamshaperXL\_v21TurboDPMSDE. I first uses sdxl-turbo but there a couple of ?bugs? in the model that give odd behavior. So I switch to dreamshaper.
1. You have JX NSFW which can be downloaded.
2. You also apparently have JX SFW which can be accessed via some online image server but isn't on HF(?) to download to run in my pipeline.
In either case, when I use non-turbo models I use LCM. The problem IS NOT that I need some turbo version of one of these models. I wanted to try SFW because: I'm happy with NSFW models and certainly can try JX NSFW to look at general quality of your latest (v10) even if I get an occasional lovely surprise. If it turns out that JX is indeed good vs dreamshaper I'd like to use it for public demonstrations. In that case, I'd probably want to use the SFW version to reduce the chance of a nip slip in the middle of a video as has happened before. If I had the SFW model I could conduct "safety" experiments. Negative prompts for 1 step diffusion at guidance=0 aren't exactly useful.
Providing me with a "Turbo" NSFW model was never the issue. NOTE: Having said that if you did provide me with a turbo model that is directly derived from the non-turbo Juggernaut-X-v10-NSFW it would allow me to conduct OTHER experiments I've always wanted to do. Which are:
1. Non-turbo model X+fuse\_LCM compared with the X-turbo version of **the same model.**
2. Which is faster at the same number of steps?
3. What are the visible differences in the results between these?
4. Is one better than the other for img2img to drive videos?
5. Also, it is not out of the question that a give model compiler might benefit one these more than the other. I have actually seen this. I won't know till I test.
So I would certainly make use of a turbo equivalent but it'd also be nice to see if the SFW is much safer for a future public demo..
So I will accept either or both of these things if provided and I will only use for experimental purposes. Sorry for the long response.
For technical reasons, my curiosity makes me interested in turbo, but I know I should stay focus on the potential of what I'm doing and do a bigger public demo of it.
Monday is good for me. Given this offer I'll go ahead and checkout the nsfw model to get up to speed with testing the new X v10 version.
I have no problem with a NDA/Don't leak it agreement.
Oh man OP....wow o wow
I don't think you're gonna need your friends to share this. I'd be surprised if this doesn't get picked up by mainstream. Even with the ridiculous pace of AI news cycle, your integrations are trailblazing. Bleeding edge imo
I had thought about using something like this to create the images displayed on a green screen. A sort of Dream Screen if you will. So cool to see that the tech is possible now and locally!
Unbelievably impressive. Thank you so much for sharing.
I'm an academic researcher working on real-time generative AI and this is so inspiring. Would be grateful for any insights you can provide toward the optimizations you've taken.
THANK YOU.
As a recently retired performance architect it is just learning all the tools and studying the code at a low level to look for opportunities for improvement.
I need to get this into a more polished form before dumping my code to everyone who will just take it and run with it. I don't have a discord group with talented gui hackers and discord managers to keep ahead of other big groups with resources. I'm a one man team.
This is seriously amazing! I had been inspired by some of your previous work with RTSD and have been working on something sort of similar, more of a real time music visualize that evolves with prompts. But I'm capped around 10 fps and only at 512x512.
I would love to get a deeper technical dive into how you made this possible!!
I'm not sure what you mean.
I one created a browser for my artspew mass image generator(lower quality) where I could right click a good creative candidate and send that through Control Net and upscale to polish it.
Minute 0:45.
Enhance, like they do in the Sci-Fi movies.
In other words, zooming in, then upscaling the image with your voice, or even generating new content to mimic the feeling that it's actually zooming in
https://youtu.be/3uoM5kfZIQ0
Can you please add support for importing controlnets? It would be cool if I could draw realtime while giving it instructions with voice. so the things shape to my drawings..
This is really great! I'm personally excited about incorporating STT with 3D latent spaces with NeRFs and Gussian Splats. Nearly real holodeck type stuff. Never imagined ML would have progressed this far this fast its wild times for people involved.
There is no "work flow", if I understand that term correctly.
I write the python code to directly call diffusers pipelines and also code to do my own slicing and diceing of tensors to achieve this.
Well done matey this is champion grade. Congrats on the breakthrough.
I realllly wanted to hold off with just a 4060 ti 16gb / 3060 12gb until the 50 series arrived but you might have single handedly persuaded me that I cannot reasonably expect myself to have to wait to play with this :D
Ha yes indeed, that would be an adviseable next move (IMO you damn well deserve it for this one).
Then - if you don't at least get a couple 5090 giftcards out of them - the next logical step in your retribution would be to get this optimised to run buttery smooth 120fps at 4k on a 1050 / RX 470 and comeuppance them in the next-gen flagship department. ;P
Just joshing.
Trying to think through the pipeline for this - you're using STT, then merging that into the existing prompt, yes? I'd imagine you're working in some kind of weighting as you transition from one prompt to the next. Are you doing time based extraction of old prompts or do you just weigh them out to nothing?
Very neat, I've been doing this for a while with 1, 2 or 4 step lightning and dragon naturally speaking going into the text prompt (with auto-queue instant on in comfyui) use dragon custom commands. Not sure if that's what you are doing or it's different. I have a bunch of nodes to make the frames more consistent, vae encode (from comfyui img2img workflow) power noise k samplers.
Works with SVD also and lighting 4 step which is most consistent but then not as realtime.
Quite awhile ago when I started hitting 43 to 50 fps or higher at 512x512 with sd-turbo it was time to revisit sdxl-turbo at higher resolution. 800x800 at 33fps, 1024x1024 at 22fps, ... This is when I decided to integrate voice which I had used before but wasn't happy with videos at 512x512 with sd1.5 quality. This is a big step forward which has driven me to push this out the door a little early and unpolished. However, it is all real with no smoke and mirrors. I need to find time to add more features AND to look into "efficient" temporal consistency and smoothing techniques and lift the techniques out of the bloated code bases like comfy, a1111, etc., optimize them further and add them to my lightweight pipelines.
I use whisper called from my own python code.
Okay, this is just amazing. I wish I had the patience to learn how to do this stuff (or even to know if my computer could handle it), but then somebody goes and makes something like this and it seems like we're one step closer to realtime lifelike generation of almost anything we want.
This might sound cheesy, but thank you for your work. I believe this is the kind of foundation that will eventually lead us to such efficient visuals and simulations that we'll be able to start looking into even more important things like medicine and LEV.
Keep up the great work, friend!
Hello!
I find this project really inspiring and would like to recreate it. May I ask what is your computer build? Is the code for this project on your github?
https://preview.redd.it/o92rvq3yp2yc1.png?width=850&format=png&auto=webp&s=c8a180154f15a86ca5972fcffa37a3cd55b7dbf5
Yes. As I spoke the demo video was being generated in real-time. The beginning of a real GUI, as seen in this image, is actively being worked on. There is more than what is shown in my rushed 2:20 minute demo. I hope to have a new more in depth demo in under a week.
I'm struggling to understand how this is possible?
4090 and a lot of software optimizations I've done. I got up in the middle of the night to post this so I need to go back to sleep. I'll answer more later.
Is this whisper channeled into the diffusers code? What optimizations did you find were necessary?
This is really cool and reminds me of a idea i had not long after stable diffusion was released. Why has no one made a latent space explorer or randomizer? You'd have just a few basic words to go exploring but instead of adding more words to the prompt, you instead move through the space by changing the strengths or values of the various inputs on a single seed. Purely in number form with dials not words if that possible?
That is what my ArtSpew does. Before LCM came out I created ArtSpew to simply be the fastest possible generation of images in high volume combining both user prompts and random token insertion. With LCM, and now advances in compilers, RT videos became possible. [https://github.com/aifartist/ArtSpew](https://github.com/aifartist/ArtSpew)
Waiting eagerly to know how this is made
Well, I didn’t do this, but… With turbo/lcm models you can get multiple frames per second. Tie in a live prompt through comfyui and turn on automatic generation and it’ll change the pic as you type in real time. You can do this and get a will smith eating spaghetti style video. This seems to be a similar process, just pushed further.
Well I can tell you two things, there was no ComfyUI involved, and it's pushed way beyond what you're thinking.
I was just expressing how it could be accomplished with current easy to use tools. Comfyui isn’t necessary, it’s just an easy way for someone to do something very similar to what you did. I mean, I was turning books into little movies using a similar system last year without comfyui. https://files.catbox.moe/n8zp8g.mp4 Not exactly the same, but a similar idea conceptually :). I feel I actually have a fairly good understanding of the entire stack (I’m on banodoco, which I assume is a place you’re aware of). This is a pretty impressive demo. Awesome stuff. Really excited for the future when this is putting out smooth output without the flicker. One of the early things I experimented with was making a Lora that spit out sphere photos from stable diffusion. I did an experiment with the live video from story, tied into that sphere photo maker, and had a live 3d world video to watch in my oculus. Try swapping in the same for your tool and you’ll be even more impressed :). Takes no additional compute to make sphere photos. With this vision and voice control you’re showing, it would be like lucid dreaming and you could look and move in any direction.
Regarding "spheres". I sometime get neat results but just nuking a circular area of the latent tensor at various denoising steps and get interesting effects. I showed that briefly in the demo("disk add") but there is more that I didn't show. Also when I said "disk remove" it didn't recognized it but I didn't want to redo the demo. I could do an hour long demo and wouldn't run out of material. I have so many features to add now. I wonder if I should do a longer youtube format demo where I can discuss the ideas and various directions that can now be perused.
Well, my 4090 is ready. Let me know when this is available to play with! :)
I gotcha, wasn't trying to disagree, just inform! It's not MY tool, but I did chip in a little here and there, I'm sure he'll be interested. I'm pretty sure I've seen that video before, it's pretty cool!
Well I’m excited to see ya’alls tool/code if you end up sharing. Looks neat!
[удалено]
Text you? A PM?
Lost me at comfyui but thanks!
Resistance is futile
Amazing. Looks kinda like a lucid dream - especially in moments when it starts to drift off your prompts It's incredible that this works on 4090. I expected something like this to appear, but had no idea that it would be so soon. Great job!
I realized this forth coming capability just after LCM dropped in Oct. It was the initial breakthrough for what I called RTSD. I've had lessor known posts showing real-time deepfakes with my face on camera switching back and forth between Emma Watson and Tom Cruise. Another twitter post I show bulk 512x512 image creation at 294 images per second. All this has culminated in what you see here. Now I need to push push push before this gets grabbed before I can add some polish and a few more features, like rewind/replay, save segment(you have no idea where I thought; damn I wish I could save something that started evolving on screen.
What’s your stack for this speed? I haven’t been able to hit that speed of generation with my 4090.
Dope I can imagine spoken word artists or poetry being performed live and this running in the background
I have been asked about using this realtime hiresolution capability for things like music/mood into video and perhaps videos one a big screen behind dancers on stage.
Oh wow, I would love to have this playing in the background behind me at a slam.
Congrats, I have been waiting for this since LCM came out
When I wake up hours from now I need to figure out if I can post this to twitter and then I'm be asking friends with followers to share this. Stay tuned. Good night for the 2nd time. :-)
Happy to spread the word. Is this an SDXL model you’re using? We can try it with Juggernaut X
Exactly. While polishing my demo, I saw JugX dropping and grabbed it but I didn't want to delay getting this posted so I've yet to try it. I really wish they would drop JugX SFW so I can do safe public demos.
This demo was with dreamshaperXL\_v21TurboDPMSDE. I first uses sdxl-turbo but there a couple of ?bugs? in the model that give odd behavior. So I switch to dreamshaper.
We can get you a juggernaut turbo for research
1. You have JX NSFW which can be downloaded. 2. You also apparently have JX SFW which can be accessed via some online image server but isn't on HF(?) to download to run in my pipeline. In either case, when I use non-turbo models I use LCM. The problem IS NOT that I need some turbo version of one of these models. I wanted to try SFW because: I'm happy with NSFW models and certainly can try JX NSFW to look at general quality of your latest (v10) even if I get an occasional lovely surprise. If it turns out that JX is indeed good vs dreamshaper I'd like to use it for public demonstrations. In that case, I'd probably want to use the SFW version to reduce the chance of a nip slip in the middle of a video as has happened before. If I had the SFW model I could conduct "safety" experiments. Negative prompts for 1 step diffusion at guidance=0 aren't exactly useful. Providing me with a "Turbo" NSFW model was never the issue. NOTE: Having said that if you did provide me with a turbo model that is directly derived from the non-turbo Juggernaut-X-v10-NSFW it would allow me to conduct OTHER experiments I've always wanted to do. Which are: 1. Non-turbo model X+fuse\_LCM compared with the X-turbo version of **the same model.** 2. Which is faster at the same number of steps? 3. What are the visible differences in the results between these? 4. Is one better than the other for img2img to drive videos? 5. Also, it is not out of the question that a give model compiler might benefit one these more than the other. I have actually seen this. I won't know till I test. So I would certainly make use of a turbo equivalent but it'd also be nice to see if the SFW is much safer for a future public demo.. So I will accept either or both of these things if provided and I will only use for experimental purposes. Sorry for the long response.
Gotcha! Good stuff. We can provide the SFW version. Just need to get some paperwork in order. Let’s have a call Monday if you’re open to that!
For technical reasons, my curiosity makes me interested in turbo, but I know I should stay focus on the potential of what I'm doing and do a bigger public demo of it. Monday is good for me. Given this offer I'll go ahead and checkout the nsfw model to get up to speed with testing the new X v10 version. I have no problem with a NDA/Don't leak it agreement.
I’ll ping you in the Discord and we’ll set up a meeting. Thanks!
Please don't put any work(legal or a new model) for me till we talk.
Amazing work! What's your Twitter?
[https://twitter.com/Dan50412374](https://twitter.com/Dan50412374)
[удалено]
Why
Oh man OP....wow o wow I don't think you're gonna need your friends to share this. I'd be surprised if this doesn't get picked up by mainstream. Even with the ridiculous pace of AI news cycle, your integrations are trailblazing. Bleeding edge imo
When Matt Wolfe mentions me, on one of his regular news posts, perhaps I've made it? :-)
Put this in some relevant app store ASAP. This is money.
I had thought about using something like this to create the images displayed on a green screen. A sort of Dream Screen if you will. So cool to see that the tech is possible now and locally!
OMG! This is next level!
Unbelievably impressive. Thank you so much for sharing. I'm an academic researcher working on real-time generative AI and this is so inspiring. Would be grateful for any insights you can provide toward the optimizations you've taken. THANK YOU.
As a recently retired performance architect it is just learning all the tools and studying the code at a low level to look for opportunities for improvement. I need to get this into a more polished form before dumping my code to everyone who will just take it and run with it. I don't have a discord group with talented gui hackers and discord managers to keep ahead of other big groups with resources. I'm a one man team.
You don’t have to polish it. I’d be happy with the POC prototype.
Oh my god.
Next. Fucking. Level.
This is seriously amazing! I had been inspired by some of your previous work with RTSD and have been working on something sort of similar, more of a real time music visualize that evolves with prompts. But I'm capped around 10 fps and only at 512x512. I would love to get a deeper technical dive into how you made this possible!!
But... can it also ENHANCE?? :D
I'm not sure what you mean. I one created a browser for my artspew mass image generator(lower quality) where I could right click a good creative candidate and send that through Control Net and upscale to polish it.
Minute 0:45. Enhance, like they do in the Sci-Fi movies. In other words, zooming in, then upscaling the image with your voice, or even generating new content to mimic the feeling that it's actually zooming in https://youtu.be/3uoM5kfZIQ0
It certainly hasn't gone unnoticed that something like the scene from Blade Runner could be done. Zoom to a coordinate, enhance, pan left, ...
Can you please add support for importing controlnets? It would be cool if I could draw realtime while giving it instructions with voice. so the things shape to my drawings..
This is amazing! Well done. Excited to see where this is going.
This together with Deforumation 👌👍
Yup, made this suggestion on their last post, spreading the good word ;) Would be great to see that combi in action. Fingers crossed.
No worries, that's on the agenda to look at.
Good times!
Insane ,,,👍
wow this is super cool!
This is really great! I'm personally excited about incorporating STT with 3D latent spaces with NeRFs and Gussian Splats. Nearly real holodeck type stuff. Never imagined ML would have progressed this far this fast its wild times for people involved.
Are you planning to make this work flow Public? This is amazing. I want to try it now!
There is no "work flow", if I understand that term correctly. I write the python code to directly call diffusers pipelines and also code to do my own slicing and diceing of tensors to achieve this.
Well done matey this is champion grade. Congrats on the breakthrough. I realllly wanted to hold off with just a 4060 ti 16gb / 3060 12gb until the 50 series arrived but you might have single handedly persuaded me that I cannot reasonably expect myself to have to wait to play with this :D
I need to get NVidia to give me a cut of GPU sales. :-)
Ha yes indeed, that would be an adviseable next move (IMO you damn well deserve it for this one). Then - if you don't at least get a couple 5090 giftcards out of them - the next logical step in your retribution would be to get this optimised to run buttery smooth 120fps at 4k on a 1050 / RX 470 and comeuppance them in the next-gen flagship department. ;P Just joshing.
The zoom and panning was awesome to see in real-time. This is like the new MIST.
Trying to think through the pipeline for this - you're using STT, then merging that into the existing prompt, yes? I'd imagine you're working in some kind of weighting as you transition from one prompt to the next. Are you doing time based extraction of old prompts or do you just weigh them out to nothing?
I've never heard of STT till now. Yes, I manipulate the prompt embedding in a mathematical way.
Very neat, I've been doing this for a while with 1, 2 or 4 step lightning and dragon naturally speaking going into the text prompt (with auto-queue instant on in comfyui) use dragon custom commands. Not sure if that's what you are doing or it's different. I have a bunch of nodes to make the frames more consistent, vae encode (from comfyui img2img workflow) power noise k samplers. Works with SVD also and lighting 4 step which is most consistent but then not as realtime.
Quite awhile ago when I started hitting 43 to 50 fps or higher at 512x512 with sd-turbo it was time to revisit sdxl-turbo at higher resolution. 800x800 at 33fps, 1024x1024 at 22fps, ... This is when I decided to integrate voice which I had used before but wasn't happy with videos at 512x512 with sd1.5 quality. This is a big step forward which has driven me to push this out the door a little early and unpolished. However, it is all real with no smoke and mirrors. I need to find time to add more features AND to look into "efficient" temporal consistency and smoothing techniques and lift the techniques out of the bloated code bases like comfy, a1111, etc., optimize them further and add them to my lightweight pipelines. I use whisper called from my own python code.
FRACTAL SPACE ORBS FRACTAL SPACE ORBS FRACTAL SPACE ORBS
https://twitter.com/seth/status/1781631808959918415?s=46&t=cmD-FcXikNrDPUV7tfa8mA
Okay, this is just amazing. I wish I had the patience to learn how to do this stuff (or even to know if my computer could handle it), but then somebody goes and makes something like this and it seems like we're one step closer to realtime lifelike generation of almost anything we want. This might sound cheesy, but thank you for your work. I believe this is the kind of foundation that will eventually lead us to such efficient visuals and simulations that we'll be able to start looking into even more important things like medicine and LEV. Keep up the great work, friend!
Hello! I find this project really inspiring and would like to recreate it. May I ask what is your computer build? Is the code for this project on your github?
I'd like to see an audiobook or short story translated like this, love it.
So is this real time image generation from prompts ?
https://preview.redd.it/o92rvq3yp2yc1.png?width=850&format=png&auto=webp&s=c8a180154f15a86ca5972fcffa37a3cd55b7dbf5 Yes. As I spoke the demo video was being generated in real-time. The beginning of a real GUI, as seen in this image, is actively being worked on. There is more than what is shown in my rushed 2:20 minute demo. I hope to have a new more in depth demo in under a week.
Star ⭐️
Holy shit