Access and Feeds

Data Cleaning: Avoid Misguided Decisions by Practicing Data Cleansing

By Dick Weisinger

Data analytics are enabling businesses to make better decisions, but the presence of “dirty data” can adversely affect the results of the analysis.  If data isn’t accurate, forecasts, outcomes and decisions can be misguided. Data cleansing is the process of removing duplicate data, identifying and fixing incorrect data, and appending data that was incomplete or missing.

Thomas C. Redman, president of Data Quality Solutions, wrote in Harvard Business Review that “poor data quality is enemy number one to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.”

Her are steps you can follow to clean your data:

  • Know your data. Understand the data source for data which has caused errors or problems in the past.
  • Standardize your data collection processes.
  • Automate a de-duplication process
  • Use tools to validate the accuracy of data.  New tools that use AI can uncover issues.
  • Repeat. Data needs to be cleaned on a regular basis and will otherwise can go stale quickly
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*