I just recently started playing with Coqui XTTS and I have to say my results have been horrid. I am familiar with 11labs, and have always had great results. My background is originally in audio/video production, so I am very capable of giving it whatever exact formats it needs, however my results so far sound NOTHING like the source material. Very robotic, very distorted. I am assume from all the gushing I have seen regarding this tool that it must be user error. Currently I am just using it as a extension on Oobabooga as that was the easiest way to get it up with a UI. Please let me know any tips and tricks you guys have learned! Thank you!
Current workflow:
Record in Adobe Audition
24bit, sample rate 22050
WAV Format
Check out PIPER TTS, pretty good results and it’s super fast:
https://github.com/rhasspy/piper
https://www.youtube.com/watch?v=GGvdq3giiTQ&ab_channel=Thorsten-Voice
Use 10 second clips of clean audio, no music, no background noise. I like to record samples from audiobooks. Free samples on Amazon recorded with audacity work well for me.
One thing to note, my install (an implementation for SillyTavern) somehow got corrupted, no idea how. It still worked but sounded way worse. Reinstall fixed that so maybe that’s happening to you too.
In the instructions on github it said to use mono 24000 wav. Double check the info though.
I had to go back to the previous version that’s on huggingface to get good audio. Somehow the latest version sounds much worse.
Edit: see https://github.com/coqui-ai/TTS/issues/3309#issue-2010577124
Check which model you are using. The latest 2.0.3 XTTSv2 is really wonky. Manually revert it to 2.0.2.
Do you know an easy way to revert using the Oobabooga extension? Thanks!