In a previous blog post, we examined the differences between traditional ETL (extract, transform and load) and ELT, where the “heavy-lifting” of data transformation is handled by the robust and scalable (usually cloud-hosted) target platform. In today’s modern cloud data warehouse environment, ELT maximizes the speed at which data is staged and ingested, while leveraging massive computing power in the cloud to cleanse, aggregate and otherwise prepare that data for general consumption. But where is the best place to manage that transformation piece? Is it using cloud-friendly ETL tools, or is it within the management consoles of the cloud DWs themselves?
A common perception among CIOs is that the use of ETL tools is resource-intensive and time-consuming with low total ROI. This perception comes largely from legacy, on-prem “large footprint” ETL applications that have traditionally required dedicated servers and administration. This leads IT departments to look to cloud solutions that can natively manage transformation processes without a lot of up-front development. A cloud-native data warehouse platform such as Snowflake or AWS Redshift is instantly scalable and adaptable, and many data transformative processes can be engineered directly on such a platform with enough coding knowledge.
The big advantage of handling data transformations directly on the cloud DW platform is customization and fine-tuning. An astute developer can use techniques such as user-defined functions, common table expressions and stored procedures to achieve the kind of granular transformative results common to a traditional on-prem DBMS. This is a good approach for organizations used to vesting their data management responsibilities among the DBAs and data architects
On the ETL side, cloud data warehouses are leveraging data integration products that are definitely not “your father’s ETL”. Unlike legacy ETL tools, today’s cloud-hosted data integration tools such as Matillion are much nimbler and more agile, and can be spun up on cloud VMs in minutes. Using ETL tools to manage data transformations provides a degree of enterprise standardization largely unachievable with user-created source code. Pre-packaged ETL components are modular and their usage is repeatable across multiple workflows with similar structural requirements. Additionally, Matillion takes advantage of data-driven workflows that utilize iterative loops in combination with table-based variable values to automate and simplify processes with a “write-once-use-many” approach.
In the end, the decision on where data transformations should live comes down to where in the organization their ownership and responsibility resides. CIO’s may want to put the onus of transformation remediation solely in the hands of the cloud architects who are also administering and maintaining the platform as a whole. On the other hand, the short learning curve and ease-of-use of modern, modularized ETL packages are best suited to data modelers and data analysts (who are likely closer to the business SMEs) . . . a data-driven organization with highly-formulated business rules may wish to codify their transformations with this user base. These decisions will largely depend on where a CIO/CTO places his or her comfort and confidence.
About the Author
Joe Caparula is a Senior Consultant with Pandata Group with extensive experience across several data integration platforms. His most recent projects include architecting and developing modern cloud data platforms with Matillion and Snowflake.