Get a Data Lake – ELT not ETL
Data Lakes are one of the buzzwords that has been going around for some time in the “big data” era. Many companies/people has figures out what a data lake is, have create one, and are using it to great effect. Others are still confused or unsure.
There are many articles and blog posts these days which provide clarity on data lakes. Here’s one definition:
A Data Lake is a data store used for storing and processing large volumes of data. They are often used to collect raw data in native format before datasets are used for analytics purposes
Which leads to, in many ways, a pivotal line, “ELT not ETL” – thanks to James Serra’s posting.
ELT instead of ETL (loading the data into the data lake and then processing it). This can speed up transformations as the data lake is usually in a Hadoop cluster that can transform data much faster than an ETL tool
Which then leads to identification of all the data sources in your organisation, a deciding how best to extract and load the data from those sources – SpreadSheets, REST services, relational databases, etc.