Access and Feeds

Big Data: Help Wanted: Data Janitor Positions

By Dick Weisinger

Garbage In.  Garbage out.  As much as big data leads the hype headlines for technology, data quality remains an achilles heel of the technology.

Eva Ho, a partner with the early-stage venture capital firm Susa Ventures , said that “the little secret of big data is that a lot of it isn’t clean.”  20 percent of big data developers say that that managing the quality of their data is a major issue for them, as reported by Evans Data.

Jeffrey Heer, a professor of computer science at the University of Washington, said that “it’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”

‘Data cleaning’ is the task of taking raw data and making it usable.

Erhard Rahm and Hong Hai Do, researchers at the University of Leipzig, explained thatData cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, for example, in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly.”

50 to 80 percent of time spent by data scientists is estimated to be allocated to collecting and cleaning up data sets.  Data scientists tasked with data cleaning have been nicknamed ‘Data Janitors’ and ‘Data Wranglers’.

Monica Rogati, vice president for data science at Jawbone, told the New York Times that “data wrangling is a huge — and surprisingly so — part of the job.  It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*