Voice Inflection in AI Music

I’m not sure if this is the right place for this, but I wanted to share my experience. I added AI-generated music to a travel video from Utah, and I’m really impressed with the AI vocals. I’ve used SSML to add voice inflection before, but these AI vocals are on a whole new level. How is this achieved? I assume it’s a result of LLMs being trained on high-quality content. Any references to technical papers on this topic would be appreciated.

The AI Language Model song at 2:42 is a hoot!

So where exactly were the vocals generated? How do you know these are AI generated? It sounds more like a tongue in cheek song about someone unfortunate enough to date Replika.

What I have seen is having a person sing nonsense lyrics to an existing song, then generating an ai voice that follows the tonal shifts of the original singer while substituting different words, but I have no idea if that is what is happening here.

So sorry, I don’t know why I did not see your reply. The vocals are generated by https://www.udio.com/. I don’t know what tools they are using. I asked Grok what tools udio is using but did not get any details.

Grok: In summary, Udio’s vocal generation involves a sophisticated interplay between language models for lyrics and music generation models for vocal performance, all tailored by user prompts. While the exact tools and models remain proprietary, the technology’s foundation lies in advanced AI learning from extensive datasets to produce realistic and customizable vocal tracks.