Big Data may bring with it Big Changes in the actual data flow involved in the data analysis process. Data analysis is often preceded by a process called ETL which preps and collects the data that is to be analyzed.
The ETL (Extract-Transform-Load) process consists of the following three steps:
- Extract data from the source
- Transform and possibly clean the data
- Load it into the target system for analysis
ETL, the process of transforming and loading data into a target system for analysis has traditionally been a long and tedious one, one that often resulted in lags of weeks or even months from when data was first collected until it reached a point where it could be analyzed.
Now, newer systems are bringing improvements and efficiencies. Increasingly data is being collected and moved in near real time to newer systems with Big Data processing capabilities, like Hadoop. These Big Data systems act as a central data repository and can provide near real time analysis. While data still needs to be extracted from applications, very often data is first loaded into a repository like Hadoop and transformed there. Rather than ETL, the new process is more like ELT, with transformation happening as the last step in the target system. Some users of Hadoop, like Sears Holdings, are declaring that the ELT approach means “death to ETL
” and that it provides a solution to many of the problems and expense that ETL has caused.
, CTO of Sears Holding and CEO of Metascale, said that
“The Holy Grail in data warehousing has always been to have all your data in one place so you can do big models on large data sets, but that hasn’t been feasible either economically or in terms of technical capabilities. With Hadoop we can keep everything, which is crucial because we don’t want to archive or delete meaningful data… ETL is an antiquated technique, and for large companies it’s inefficient and wasteful because you create multiple copies of data. Everybody used ETL because they couldn’t put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy.”
Newer technologies like Hadoop have certainly created efficiencies and brought improvements for how data analysis is done. But the basic components of the ETL process still live on and really can’t be eliminated unless a business finds it possible to standardize on a single central repository for native use by all applications. Without that, data still needs to be extracted from applications in order to move it into an analysis repository like Hadoop. Once there, data still needs to be cleaned and appropriately transformed in order to process it. And, at some point, whether before of after data transformation, the data needs to be loaded into the target system.
, CTO of Informatica, pointed out
that whether you use newer Big Data technologies or continue to use Data Warehousing techniques, there are still problems that have to be addressed which just don’t go away. Those problems include: profiling data, discovering relationships between data, handling metadata, explaining context, accessing data, transforming data, cleansing data, and governing data for compliance.