In our previous installment on the hybrid data integration portfolio, we looked at the role of data virtualization in a unified, multi-platform approach to creating a managed enterprise data foundation. In this chapter, we examine data replication and synchronization, i.e. the ongoing copying of data (without massaging or transformation) from one physical location to another, usually in conjunction with change data capture (CDC).
Data replication is often considered ETL without the "T", though where ETL is usually a batch-based delivery process, replication is often driven by "update-upon-change". Through this process, the target database only updates when changes occur to the source. Often referred to as "just-in-time" data, this represents a compromise between scheduled batch updates and real-time streaming. Replication then becomes an ideal methodology for capturing source transactional data into a separate repository that can be purposed for further reporting, analysis and application activity.
In a hybrid data integration platform, replication is best used in conjunction with other transformative processes that complement its speed and reliability. In the last installment we showed how customer data from multiple systems (CRM, delivery logistics, marketing, etc) could be replicated to a staging area in near-real-time. A virtualization layer could then be built to present this staged data as a holistic view of the customer.
A common and growing use case for replication involves cloud data warehousing. While traditional ETL is used to feed a legacy on-premise physical data warehouse for operational reporting and core analytics, the same data can be replicated via CDC to a cloud DW to reach a wider user audience and serve a broader base of analytic solutions. In this way, complementary data platforms are created from the same single source of truth (preferably a staged ODS).
In essence, replication fits best into low-latency data integration when it can be used to quickly move data from source to target for further transformation. Consider a scenario where replication is utilized to monitor customer activity and create real-time responses. CDC provides real-time "listening" to changes in account activity (from multiple sources, e.g. web, mobile, third-party, etc.) which update a single XML data store. This data store can then generate message queues that talk to a fulfilment application to generate incentives, rewards, etc. as the customer is in the process of placing an order. Thus, replication goes hand-in-hand with real-time, actionable data streams that drive application activity.
Part three of this series will be a deeper dive into real-time streaming data and its role in the broader, fully-integrated enterprise data architecture.
About the Author:
Joe Caparula is a Managing Partner at Pandata Group and works with clients on their data integration and business intelligence initiatives. Questions? Reach him at jcaparula@pandatagroup.com.