Access and Feeds

Data Cleansing: How you Clean your Data Can Alter Your Results

By Dick Weisinger

Sophisticated uses of data is allowing companies to derive insights, made predictions, and analyze customer behavior. In surveys, companies often rate data as being highly valued by businesses and organizations. 81 percent say that data can help their company be successful and 75 percent say that data is critical to achieving innovation.

But using raw data for analytics can lead to bad predictions. Garbage in, garbage out. Techniques that rely heavily on processing data, like data analytics, business intelligence ,and machine learning, all stress that that the best results are obtained when data is cleaned and scrubbed for accuracy.

The European Commission’s statistical office, Eurostat, said that “not cleaning data can lead to a range of problems, including linking errors, model misspecification, errors in parameter estimation and incorrect analysis leading users to draw false conclusions.”

On the other hand, to some extent the notion of ‘clean’ data is subjective, and how data is cleaned can radically alter analysis and machine learning that’s made based on that data. In statistics, the relationship of the steps taken to clean data and the final results is called the ‘researcher degrees of freedom.’

Joseph Simmons and authors explored the idea of ‘degrees of freedom’ in cleaning data. “In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both? It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields ‘statistical significance,’ and to then report only what ‘worked.’ The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*