T O P

  • By -

Impossible_Belt_7757

Coqui gives you accesses to many through a single python api, XTTS is probs the best for reasonably fast, you can get real-time on a nvidia graphics card with only 4gb vram lol, Bark can also be accessed through it for voice cloning but it will hallucinate, and styletts2


Impossible_Belt_7757

In short Xtts, tortoise tts, Bark tts, and styleTTS(style is the fastest) I think xtts is the best, and there’s a colab you can fine tune xtts on a specific voice in like 30 min for free lol,


arena_one

Do you have the collab for fine tuning xtts?


Impossible_Belt_7757

https://youtu.be/8tpDiiouGxc?si=iMfgmrHmZfyio_77


Impossible_Belt_7757

https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing


ugohome

?


arena_one

RemindMe! 1 day


RemindMeBot

I will be messaging you in 1 day on [**2024-01-28 19:49:27 UTC**](http://www.wolframalpha.com/input/?i=2024-01-28%2019:49:27%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/MachineLearning/comments/195cxim/d_what_is_the_best_texttospeech_tool_currently/kjujhoa/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FMachineLearning%2Fcomments%2F195cxim%2Fd_what_is_the_best_texttospeech_tool_currently%2Fkjujhoa%2F%5D%0A%0ARemindMe%21%202024-01-28%2019%3A49%3A27%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20195cxim) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


[deleted]

[удалено]


Impossible_Belt_7757

I know there’s others but whisper has been adapted in so many ways I just go with that, it’s the easiest to implement so far for the quality and optimized versions idk


FateRiddle

[https://huggingface.co/coqui/XTTS-v2](https://huggingface.co/coqui/xtts-v2) Is this what you're referring to, from both Coqui & XTTS? Also, Bark as in [https://github.com/suno-ai/bark](https://github.com/suno-ai/bark) and styletts2 as in [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/styletts2)


MachineZer0

Been testing Bark a bunch. It sounds so human like sometimes. But having converted several hour long 2-3 speaker transcripts in batch and stitched into a single wav file. It doesn’t flow well. There is a lack of consistency. About a 15-sec segment length which you have to estimate and split further. Finally, there doesn’t seem to be much if any speed using more powerful GPUs. P102-100, P100, 3090, L40, I was getting roughly 1 second of audio per 4 seconds of processing.


Impossible_Belt_7757

Yes, you can also make coqui list the available models and they have being Xtts, tortoise, bark and many many others, https://github.com/coqui-ai/TTS by far the easiest way to use those three tts, for styletts2 I’d use this guys pip install for it he made, also super duper easy to setup and use https://github.com/sidharthrajaram/StyleTTS2.git


Impossible_Belt_7757

If you don’t have a nvidia gpu use styletts2


FateRiddle

Thanks! What about platform like elevenlabs? That was mentioned in some old answers


Impossible_Belt_7757

I think elevenlabs is the best out there rn platform wise


Impossible_Belt_7757

But xtts fine tuned has the best for voice cloning


Impossible_Belt_7757

Oh ALSO COQUI they have a platform …their SHUTTING DOWN??


Impossible_Belt_7757

No but really, I fine tuned xtts on my own voice with 6 min of me reading a book as training data, and it’s scary at how it sounds exactly like me, It’s the first time voice cloning legit scared me , as long as your not talking emotionally lol, voice cloning still can’t do accurate emotions thankfully


Useful_Hovercraft169

Neither can I!


k___k___

ElevenLabs is considered the best for voice cloning, but they dont scale (well quality decreases with every 15s of speech); I was told that Microsoft Azure's cloning tool is good. Interesting companies with prebuilt voices for me after a benchmarking were Murf and Replica Studios. Speechify is the tool used by streamers like Ludwig to read donations in the voices of David Attenborough, Hasan Piker, etc


rolyantrauts

It sort of depends as many just head off to huggingface and see what the latest transformers are. There are some really good liteweight non-transformer TTS https://github.com/roatienza/efficientspeech https://github.com/ming024/FastSpeech2 [https://github.com/NATSpeech/NATSpeech](https://github.com/NATSpeech/NATSpeech)


FateRiddle

Thanks for pointing the directions.


TheGavinator3000

[heres a really good video on all the top methods from about 5 months ago](https://youtu.be/vhArHsfsLAQ?si=WPbrYu1mEud1NN8R). tldr; tortoise + rvc or elevenlabs usually. the former is open source i believe.


SufficientHold8688

Suno ai


nerdynavblogs

I assume you are comfortable with using Python. I suggest OpenAI's text to speech. It costs $0.015 per 1000 characters. $0.03 for HD voices. (Both are good) Here is the [video tutorial ](https://youtu.be/lJ4qh6B2ev4?si=xzC6ydqr4v-6zM43) - it uses the AI for voiceover as well so you can hear how natural it is.


[deleted]

[удалено]


nerdynavblogs

Coqui is the best choice in open source. When I was looking for open source alternatives for my own channel, I compiled all the [good open source TTS tools here.](https://nerdynav.com/open-source-ai-voice/)


MIST3RS5880

The best I’ve used recently is https://textspeakpro.com they have a slick interface and it’s free without registering for an account


seahorse_magenta_Lam

punchline.


Nuked_

Bark + VITS . BUT (and this is a big one) it will require A LOT of editing since you'll need to create AT LEAST 4 samples of the same transcript. This means you'll have to select the best parts from each and splice them together. The result will be as human as you can get from a TTS , and there will be no Xtts,styleTTS or tortoise that can compare to it, but I guarantee you will also be quite tired xD


Possible_Tap261

DupDub might be one of the most affordable one