The most popular and comprehensive Open Source ECM platform
Artificial Intelligence: Accurate Modeling isn’t possible without Data Quality
Data quality is one of the most important parameters needed in deriving useful answers from Big Data, Data Analytics, Business Intelligence, and AI algorithms, but it is often considered as ‘operational’ and doesn’t get the attention that it rightfully deserves.
A study done by PwC estimates that high-quality ‘clean’ data can save businesses 33 percent on their data-intensive projects and help those companies boost revenues by one-third.
Tom Redman, president at Data Quality Solutions, said that “Data quality is everything. The first thing is that if you’re using existing data to train a model and you don’t do a really good job cleaning it up, you’re going to get a bad model. Even if the model [you construct] is good, if you put bad data into it, you’re just going to get a bad result. If you stack these things up, it’s like a cascade, and the problem will quickly get out of control.”
Data quality trumps data quantity. Researchers from MIT and Amazon found that “traditionally, ML practitioners choose which model to deploy based on test accuracy — our findings advise caution here… Small increases in the prevalence of originally mislabeled test data can destabilize ML benchmarks, indicating that low-capacity models may actually outperform high-capacity models in noisy real-world applications… This gap increases as the prevalence of originally mislabeled test data increases.”
Tadhg Nagle, lecturer at Cork University Business School, wrote in Harvard Business Review that “even if you don’t care about data per se, you still must do your work effectively and efficiently. Bad data is a lens into bad work, and [our research shows] that most data is bad. Unless you have strong evidence to the contrary, managers must conclude that bad data is adversely affecting their work.” Nagle estimates that only 3 percent of data sets that they’ve examined meet ‘basic quality standards’.