Photo credit: Mark Hamilton
Mark Hamilton, an MIT PhD student, created DenseAV, an AI system that can learn human language from scratch by simply watching videos. The algorithm works by taking a word, like “dog,” and begins searching for said object in a video stream.
Discovering language required DenseAV to use two main components for processing audio and visual data separately. This forced the algorithm to recognize objects and created detailed and meaningful features for both audio as well as visual signals. DenseAV then learns by comparing pairs of audio / visual signals to find which signals match and which signals do not. Since this method doesn’t require labeled examples, it allowed DenseAV to figure out the important predictive patterns of language itself.
- Experience total immersion with 3D positional audio, hand tracking and easy-to-use controllers working together to make virtual worlds feel real.
- Explore an expanding universe of over 500 titles across gaming, fitness, social/multiplayer and entertainment, including exclusive releases and...
- Enjoy fast, smooth gameplay and immersive graphics as high-speed action unfolds around you with a fast processor and immersive graphics.

Recognizing and segmenting visual objects in images, as well as environmental sounds and spoken words in audio recordings, are each difficult problems in their own right. Historically researchers have relied upon expensive, human-provided annotations in order to train machine learning models to accomplish these tasks,” said David Harwath, assistant professor in computer science at the University of Texas at Austin.