Access and Feeds

Synthetic Data: Computer Generated Data Sets for AI Training

By Dick Weisinger

Current Artificial Intelligence algorithms, like machine learning and neural networks, rely heavily on training based on massive amounts of data. But the problem is that large data sets are often difficult to gain access to and data privacy is increasingly an issue.

Some AI researchers are beginning to use ‘synthetic data‘ sets. While the basis for synthetic data are real data sets, synthetic data generates new data points using a statistical sampling of the original data. The result is data which is statistically similar to the original data and preserves the original data structure, but which is fully anonymized.

Rob May, CEO of Talla, described how synthetic data is used in auotomous vehicle algorithms, saying that “you could create an entire machine generated city, drive around that city obeying traffic laws, and feed that data into the autonomous vehicle model. This allows you to simulate things that may be harder to capture in real life (e.g. a car running a stop sign).”

Harry Keen, Hazy CEO, told Techworld that “privacy just isn’t good enough in anonymization. It’s very easy to infer characteristics about someone or even identify an exact individual in an anonymized dataset, because you may very well have access to an ancillary dataset which you can cross-reference … With synthetic data it’s not a record transformation into another record – it’s literally starting from scratch and creating new people based off a generalized statistical approach.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*