Last month I blogged about the virtues of having seperate full-load vs. incremental load packages when doing data warehousing. Well... that isn't the end-all of that argument...
When you design your data warehouse, if you take table partitioning into account (which I highly recommend, since most BI data is temporally chronological), then you can usually get away with writting a single data flow task which achieves minimal logging even during incremental updates, by simply targetting a staging partition and then switching it in at the end of the data load.
In one such data flow, I needed to be able to normalize the incomming data into three tables. The challenge comes in that you don't want to perform lookups on every record when you're doing a full load (since this slows the data flow down), but you do need to do those lookups to avoid duplicates during incremental loads.
A great pattern for solving this to check whether the target table is empty before starting the data flow. Below, the "Get MaxSaleID" task runs a SQL script to put the row-count of the target table into a variable named MaxSaleID...
... then in the data flow "Sales - Vouchers", a conditional split inspects the MaxSaleID variable, and re-routes rows to avoid the lookup if there is guaranteed to be no match from the lookup component...
Data Flow for full and incremental uploads |
The "Scaffold Voucher Sales" destination bulk inserts into a staging table which is subseuently switched in onto the tail end of thew live table.
A quick note about partitions: If you know that the source system which you are pulling data from will always append new records and will never/rarely update old records, then you can skip technologies like Change Data Capture or Change Tracking and simply re-import the last n-thousand records into a partition, swap out the old tail-partition and swap in the new one. In generaly this results in a far leaner, simpler package and process than would result from implementing complex data-change detection.
Comments