For most of data warehousing’s history, ETL (extract, transform and load) has been the primary means of moving data between source systems and target data stores. Its dominance has coincided with the growth and maturity of on-premise physical data warehouses and the need to physically move and transform data in batch cycles to populate target tables efficiently and with minimal resource consumption. The “heavy lifting” of data transformation has been left to ETL tools that use caching and DDL processing to manage target loads.
However, the data warehouse landscape is changing, and it may be time to reconsider the ETL approach in the era of MPP appliances and cloud-hosted DW’s. These architectures are characterized by high-capacity processing capabilities and dynamic scalability. For this reason, it may be more appropriate to move the “T” in ETL to the end of the procedure: ELT (extract, load and transform). With ELT, data is never persisted to disk as part of the load; instead, transformation occurs on the high-performance target database that is optimized for the query methodology underlying transformative processes.
ELT works best when applied to large datasets. It is especially efficient with multi-key joins on non-indexed tables. Where ETL needs to do table-by-table, row-by-row lookups to create those joins (using cached subsets for particularly large tables), ELT will rely on the processing power of the storage engine where the data is already persisted (additionally, joins can be instantly updated with new data as it appears). Another good use case for ELT is in creating multiple aggregate tables of the same data at different granularity. ETL would require parallel data paths consuming high resources, but pushing those resources to the target allows the developer to create aggregations “on the fly” or even virtually with query optimization.
Does this mean ETL is dead? Far from it, but it’s important to consider the architecture of the target environment. Traditional on-premise data warehouses will still benefit from robust ETL tools since they are write-optimized, and may need to offload the heavy processing of transformations while handling high query traffic. Another big advantage of ETL is target-agnostic modularity, standardization and portability . . . this can make for efficiencies in a multi-platform environment. The ideal data integration portfolio would use a mix of ETL and ELT to provide the best target loading solutions to create the optimal enterprise data landscape.
About the Author:
Joe Caparula is a Senior Consultant with Pandata Group and works with clients across data integration and business intelligence development.