Artificial Intelligence: Data Quality Can Make or Break an AI Project

By Dick Weisinger

Getting the data right for AI and data analysis projects may be the biggest factor in making or breaking the project. Without accurate, current, and diverse data sets, it isn’t possible to build models that make reliable predictions.

Data prep is tedious and one of the most overlooked steps of an AI project. Even frequently used publicly available training sets have been found to have a significant number of problems with data quality, and if commonly used can’t get it right, you have to be suspicious about projects that don’t have as high of a profile.

A study in 2021 by MIT found that the public ImageNet database has “systemic annotation issues” with as many as 20 percent of the collection containing duplicates. Another investigation into a dataset created by Google found as many as 30 percent of the entries as mislabeled. A project between IBM and MD Anderson was canceled because poor results were obtained due to using outdated data.

Andrew Ng, a professor at Stanford University, commented that “AI has a proof-of-concept-to-production gap. The full cycle of a machine learning project is not just modeling. It is finding the right data, deploying it, monitoring it, feeding data back [into the model], showing safety—doing all the things that need to be done [for a model] to be deployed. [That goes] beyond doing well on the test set, which fortunately or unfortunately is what we in machine learning are great at.”

A Forbes article by Kathleen Walch suggests that analytics and AI practitioners need to get back to basics and pay more attention to data prep. In the mid-1990’s a set of best practices for data mining projects was developed called the CRoss Industry Standard Process for Data Mining (CRISP-DM). Steps two and three of this process are “Data Understanding” and “Data Preparation” and both of these are crucial in building a project that has high data quality.

Arvind Krishna, the CEO of IBM, noted that data prep is hard and the main reason why AI projects are canceled is the difficulty in preparing quality data. Krishna said that many companies “run out of patience along the way, because they spend their first year just collecting and cleansing the data. And they say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And they kind of bail on it.”

May 11th, 2023

Category: Artificial Intelligence

Leave a Reply Cancel reply

Legal Terms & Disclaimers

This blog site is accessed from the website of Formtek, Inc. All visitors to or users of this blog site are subject to the terms and conditions and privacy policy that govern the Formtek website, links for which are provided above.

Some of the individuals posting to this blog site, including the moderators, work for Formtek. Postings by these individuals are the personal opinions of these individuals, not of Formtek. Their posted content is provided for informational purposes only and is not meant to be an endorsement or representation by Formtek or any other party. Postings to this blog site may be outdated, invalid or inaccurate by the time you read them. Individuals posting to this blog site make no statements, representations or warranties as to the timing, validity, accuracy or reliability of their postings.

This blog site may contain links to third party sites. Access to any third party site linked to this blog site is at your own risk. None of Formtek, the blog site moderator(s) and the individuals posting on this blog site that work for Formtek is responsible for the timing, validity, accuracy or reliability of any information, data, opinions, advice or statements made on these third party sites. These links are provided merely as a convenience and do not imply any endorsement.

Postings to this blog site are available to the public. You should not post, link to or otherwise upload any information considered confidential to this blog site. All postings to this blog site are moderated. Postings will appear if and when they are approved by the moderator. Notwithstanding any approval by the moderator, by posting information to this blog site, you agree to be solely responsible for the information you post, link to, or otherwise upload to the blog site. You agree to release Formtek from any liability related to that information or to your use of the blog site. You grant Formtek a worldwide, perpetual, irrevocable, royalty-free, fully-paid, and transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any information you post, link to or otherwise upload to this blog site.

Artificial Intelligence: Data Quality Can Make or Break an AI Project

Leave a Reply Cancel reply

Company

Products and Services

News

Resources

Legal Terms & Disclaimers