H/t: Peta Pixel
MIT’s Speech2Face technology is capable of reconstructing a facial image of a person using just a short audio recording of them speaking. This is made possible by an AI-powered deep neural network that utilizes millions of natural videos of people speaking from the internet. They trained the model by helping it learn audiovisual, voice-face correlations that allow Speech2Face to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. Read more for a video and additional information.
Researchers did not have to monitor Speech2Face during training, as it was completed in a self-supervised manner by utilizing the natural co-occurrence of faces and speech in videos, without the need to model attributes explicitly. The reconstructions were all obtained directly from audio to reveal the correlations between faces and voices. This allowed researchers to evaluate and numerically quantify how Speech2Face reconstructions from audio resemble the true face images of the persons.
- Quick access to Siri by saying “ Hey Siri ”.Note : If the size of the earbud tips does not match the size of your ear canals or the headset is not...
- More than 24 hours total listening time with the Charging Case
- Effortless setup, in-ear detection, and automatic switching for a magical experience
Our model is designed to reveal statistical correlations that exist between facial features and voices of speakers in the training data. The training data we use is a collection of educational videos from YouTube, and does not represent equally the entire world population. Therefore, the model—as is the case with any machine learning model—is affected by this uneven distribution of data,” said the researchers.