The most popular and comprehensive Open Source ECM platform
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples,” said Pentaho CTO James Dixon, creator of the term ‘Data Lake’.
Joe McKendrick explains that “Data lakes are part of the Apache Hadoop ecosystem, serving as low-cost repositories for data of all types and sizes. Since data can be quickly poured into them with little fuss or muss, they’re relatively low cost to operate, unlike data warehouses, which require ETL, cleaning and normalization of data.”
Cesar Rojas, Product Marketing Director at Teradata, said that “one of the main appeals of data lakes is that they incorporate data from any source, from social media to clickstream data, into a single location that empowers enterprises to capitalize on this information.”
Over the last year, vendors in the Big Data space have been building products based on the concept of the Data Lake. Both Teradata and Hortonworks, for example, offer platforms now that can store and process large amounts of data, and data is pushed into these systems at very rapid velocities.
Tasso Argyros, member of a stealth data startup, said that “in many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.”
Bill Schmarzo, chief technology officer at EMC, said that “we’re going to see company after company who have already jumped into the lake. The data lake is going to be a great enabler. It’s going through its overhyped status right now, but a year from now we’re going to see a lot of different organizations that have implemented the data lake and are running not only their analytics on top of that but, in some cases, have actually moved some of their data warehouse capabilities of that as well.”