Access and Feeds

Data Wrangling: Refining Messy and Complex Data

By Dick Weisinger

Data Analytics. Machine Learning. Big Data. These are all in-demand technologies that are built on the heavy use of data. Clean and accurate data is key for all of them to achieve good results, but the process of cleaning data often isn’t given the full attention that it deserves.

Data wrangling is the term often used to describe the data cleaning process of transforming raw data into the format needed for processing and the removal of any duplicate, outdated, incomplete, or irrelevant data.

Mark Conrad, retired archive specialist at NARA, said that “the dramatic rise in our ability to collect data isn’t yet matched by our ability to support, analyze and manage it. We generate more data than we can possibly read or comprehend, and need a way to summarize and analyze the ‘right’ data in order to use this information effectively and efficiently.”

Eurostat, European Commission’s statistical office, said that “all data sources potentially include errors and missing values – data cleaning addresses these anomalies. Not cleaning data can lead to a range of problems, including linking errors, model misspecification, errors in parameter estimation and incorrect analysis leading users to draw false conclusions.”

It’s not a once and done process either. Almost all data sets are subject to data decay, or the gradual loss of relevance as the data ages and becomes increasingly less representative of the real world. Data decay means that data wrangling and cleansing need to be updated on a regular basis.

Data wrangling and cleansing are key for data-based technologies. Andrew Ng, early evangelist for AI, summed it up when he wrote for Harvard Business Review that “data is food for AI, and modern AI systems need not only calories, but also high-quality nutrition.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *