Access and Feeds

Multi-Modal AI: Algorithms That Use Audio, Visual and Tactile Data

By Dick Weisinger

Computers are mastering how to interpret images and audio. Increasingly AI models are incorporating not just a single sensory set of data, like for vision, sound, and touch, but as an aggregate of information from multiple sources, something being called multi-modal AI, and the result is a smarter AI algorithm.

Multi-modal AI is the ability for AI to process and draw relationships between different types of data. One example is the DALL-E model by OpenAI that can generate images based on textual input. By combining the inputs from language, the algorithm can choose visual objects and render them as an original image.

Ilya Sutskever, co-founder of OpenAI, said that “the world isn’t just text. Humans don’t just talk: we also see. A lot of important context comes from looking.”

Mark Riedl at the Georgia Institute of Technology, said that “the more concepts that a system is able to sensibly blend together, the more likely the AI system both understands the semantics of the request and can demonstrate that understanding creatively.”

Jeff Dean, AI chief at Google, said that “that whole research thread, I think, has been quite fruitful in terms of actually yielding machine learning models that do more sophisticated NLP tasks than we used to be able to do. We’d still like to be able to do much more contextual kinds of models.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published.


twenty + 1 =