Access and Feeds

Machine Learning: Better Data is Always Better

By Dick Weisinger

How much data do you need to generate good Machine Learning results?

The answer is that it depends. It depends greatly on the complexity of your problem and the ML algorithm that you are using to solve it. For example, if your algorithm is linear, you’ll need far fewer data points than you would if attempting to solve the problem with a non-linear algorithm.

When trying to size the data samples that you’ll need, the best approach is to be able to keep a middle of the road position between the opposing dangers of overfitting and underfitting.

Jason Brownlee, founder of the machine Learning Mastery Community, explained that “generally, it is common knowledge that too little training data results in a poor approximation. An over-constrained model will underfit the small training dataset, whereas an under-constrained model, in turn, will likely overfit the training data, both resulting in poor performance. Too little test data will result in an optimistic and high variance estimation of model performance.”

Machine Learning for images is notorious for the huge sample image data sets that are required to be able to learn tasks like image recognition within photographs. But some Machine Learning projects are reporting that for other types of problems, far fewer data sets are required.

A group from the University of Alberta reported that their machine learning algorithm for predicting molecular structure required far less data to achieve good results than they had originally expected. Many of the models that the researchers used performed well in predicting molecular types with only a small amount of training data.

One factor often not mentioned when reporting ML results is the quality of the data that was used. Better quality data may be able to achieve good results with fewer data points.

Xavier Amatriain, CTO at Curai, said that “it is important to point out that better data is always better. There is no arguing against that. So any effort you can direct towards ‘improving’ your data is always well invested. The issue is that better data does not mean more data. As a matter of fact, sometimes it might mean less!”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*