Facebook AI engineers Sean Vasquez and Mike Lewis have discovered a way to take robotic sounding text-to speech systems to the next level, producing lifelike audio clips generated entirely by machine. Called MelNet, this AI-powered system reproduces human intonation and can do so using the same voice as real people, like Bill Gates. Think of this as deepfakes, but for audio instead. Read more for a few examples and additional information.
Vasquez and Lewis don’t use audio waveforms, but rather spectrograms to train their deep-learning network. Why? Spectrograms are capable of recording the entire spectrum of audio frequencies and how they change over time. For comparison, waveforms capture the change over time of one parameter, amplitude, while spectrograms capture the change over a huge range of different frequencies.
“A cramp is no small danger on a swim.”
“He said the same phrase thirty times.”
“Pluck the bright rose without leaves.”
“Two plus seven is less than ten.”
“Having trained the system using ordinary speech from TED talks, MelNet is then able to reproduce the TED speaker’s voice saying more or less anything over a few seconds. The Facebook researchers demonstrate its flexibility using Bill Gates’s TED talk to train MelNet and then use his voice to say a range of random phrases,” reports Technology Review.