As enterprise data volumes continue to grow and the velocity of data change increases, companies are more challenged than ever to provide timely and complete data sets for business analysis. Change Data Capture is a methodology that has been employed for some time to effectively extract changes into target data environments in near-real time by avoiding regularly scheduled batch cycles and loading source data as it updates. The objective is to have a provisioned data store be as reflective of the current source system data as possible. There are several approaches to CDC employed by data engineers, but in the end, which is the most effective and least costly strategy that insures complete and up-to-the-minute data updates?
Most CDC methodologies require extra horsepower from the database management system. Database triggers and application-based event programming require extra processing from both the operational source data as well as the target platforms that often contain additional tables dedicated to tracking the changes. In the modern cloud-based data stack, this means additional computing cycles that translate to higher metered costs. Furthermore, these approaches are most often “snapshot”-based, i.e. relying on row timestamps at regular intervals, which means rapidly changing data points could be missed if they fall between those points in time.
Log scanning, in which the native database transaction logs (which by their very nature document any changes) are utilized to monitor updates to the source data, proves to be the most efficient, least costly and simplest approach to CDC. No additional flagging columns, timestamping or custom event coding is needed. Latency is low, and there is little to no impact on the database infrastructure since you’re not “reinventing the wheel” by creating additional tracking mechanisms.
Matillion Data Loader, a cloud-native pipeline platform from the leading cloud data integration company, now incorporates log-scanning CDC into automated data ingestion tool. After you set up the source and target of your scheduled pipeline, you can create a SaaS agent in a container on your VPC (where your MDL deployment resides) in the same wizard-driven Data Loader interface. This agent actively monitors the source logs and initiates the pipeline’s data load whenever changes occur. You can still have separate scheduled batch loads within the same pipeline . . . no separate solution is needed to incorporate both batch and CDC.
The CDC agent solution for Matillion Data Loader means that enterprises no longer need to rely on complex code or database customizations to ensure that their data migrations are timely and complete. It is an ideal approach for the modern data stack where cloud compute resources need to be minimized and efforts can be focused on the business value of data instead of the minutiae of process engineering.