SSIS - High Performance Data Warehouse Loading

In a previous post I mentioned how to achieve data transfers at near file-copy speeds using SSIS. In this follow-on, I’m going to show you how to put this information to practical use for quickly loading a data-warehouse.

Generally speaking, the idea behind data warehousing is to reduce the load on transactional systems by providing a central repository of which has been normalized in a manner that is ideal for drawing reports and analytics. Strategies for loading data warehouses typically fall into one of two categories:

1. Periodically resetting and repopulating the data warehouse during off-peak times,

2. Performing an initial data load, followed by on-going incremental loads.

Significant changes to any transactional systems usually necessitate changes to a data warehouse or, at least, changes to how the data warehouse is populated, which means that no matter which strategy you follow, you’ll benefit from being able to achieve fast full-reloads of the data warehouse…

Lobbying for separate packages

A common challenge in data warehousing is building a solution that performs well, can cope with increasing data volumes, and is reasonably maintainable. While it is arguably a matter of preference, I recommend creating separate packages for full reloads vs. incremental loads, because there are glaring differences in how you handle data in either scenario:

	Full Refresh	Incremental Load
Type 1 Slowly Changing Dimensions	Can be bulk loaded with minimal logging.	Should be checked – any change represents an error.
Type 2 Slowly Changing Dimensions	Can be bulk loaded with minimal logging, with historic information (if inferable from source data) being generated using common SSIS normalization tactics.	Must be verified against existing data and handled accordingly.
Fact Tables	Can be bulk loaded with minimal logging.	Can require some sort of “update if already loaded” logic (like SCD2). Can be bulk loaded, but not minimally logged.

The remainder of this post discusses tactics for achieving a full refresh as quickly as possible, with focus on lessons learned in the wild…

My previous post about fast, minimally logged data inserts mentioned having data sorted by a table’s primary key before it arrives at the SSIS destination (OLEDB destination or SQL Destination). Finding an appropriate means of sorting data really depends on your creativity, but there are usually some clever short cuts if you look for them…

Avoid the Merge component

There are plenty of woeful tales afloat on the net about the merge component. Particularly prominent in its list of evils is the fact that it doesn’t scale well… meaning that your package which works fine in the lab may choke and die in the wild. The reason behind this comes from two rather ugly characteristics:

· Merge must see at least one row in both inputs before it can release any output.
This makes sense – the output is a sorted union – a comparison must be made before a decision can be made on which row/s to output first.

· Merge does not respect down-stream throughput.
Unforgivably, the component will consume data as rapidly as possible at both inputs regardless of whether down-stream components are able to consume the output fast enough.

Both of the above create a situation where large volumes of data cause resource exhaustion. Although SSIS does page memory buffers out to disk, there are limits to how many memory buffers can be allocated; Even with plenty of free disk space, a 40GB data load can easily choke on a machine with 8GB RAM when a merge component is used.

Avoid the Sort component

Sorting all but the smallest sets of data inside the pipeline is madness, and leads to the same scalability problems that the merge component does. Instead, consider ways to get the data into the pipeline in an already-sorted (or semi-sorted) state,

Some tricks for achieving this include:

1) Use ORDER BY at source
If the source system has an appropriate index then simply use an ORDER BY clause when selecting data from the source database. Be sure to set the pipeline meta-data in SSIS to let SSIS know that the data is indeed sorted.

2) Use a Staging database
If you can’t sort the data at source, consider moving it into an intermediate table, then let an RDBMS engine perform the sort – this approach scales far better than using a Sort component in SSIS.

3) Look for the natural data order
SSIS doesn’t actually check whether your data is really sorted – it simply takes your word for it. It is common in transactional systems to date fiscal records, but to give them an auto-incrementing primary key (for SQL read: IDENTITY) – such that the natural order of the primary key correlates with the order of the fiscal date.

It is typical in data warehousing need source data ordered by date, not identity, but adding “ORDER BY Date” to the select query causes the source system to choke. In cases like these, it’s fine to order the data by its primary key, but tell SSIS that the data is ordered by date.

Avoid using temporary files for staging

For reading and writing files, SSIS supports two formats - RAW, which perform well, and Text/CSV which is more readable. Various SSIS gurus have posted about the performance advantages of Raw files over Text/CSV, but there tend to be relatively few postings about the tactical advantages/disadvantages of using either one.

In general, RAW files are easier to work with at either end of a pipeline, and unlike using a database, no special effort is required achieve very fast write speed for these files, making them a very tempting solution for intermediate data storage. I recommend giving a staging database preference to using temporary files, however, for the simple reason that staging files are inflexible, which detracts from the maintainability of your package… there is no SELECT/JOIN/WHERE/UNION capability across RAW files, and while this sort of can be achieved inside of SSIS, you will likely find scenarios (as pointed out above) where SSIS simply doesn’t handle large data as easily as a database will.

Use a staging database

A common misconception in data warehousing is that using a staging database is a necessary evil that slows down the overall loading process. At first glance this makes sense… after all - why move data twice when you can move it once?

The answer lies in the prerequisites for fast minimally logged inserts. Given that sorted data allows us to completely by-pass the transaction log and TempDB, it turns out that using a staging database to combine and sort source data can actually improve performance when compared to the alternatives – most particularly when any of the following are true:

· Source data is not naturally sorted per the target table’s key

· Multiple sources need to be combined

· Lookups/Joins are required on large data sets

If this advice seems counter intuitive, then consider the following performance characteristics of different versions of a package which achieves exactly the same goal – to combine 40GB from 5 tables and load them into a fact table with 13 dimension lookups:

Merge Components	Sort Components	RAW Files	Lookups Used?	Staging Tables?	Load Sorted Data?	Execution Time
0	6	8	16	0	Yes	(Out Of Memory)
7	0	8	16	0	Yes	(Out Of Memory)
3	0	2		6	Yes	4:33:01
0	0	8	16	0	No	6:15:32
0	0	0	16	8	No	5:56:47
0	0	0	7	8	Yes	2:01:58

The trick is to use a staging database to eliminate sort and merge components, as well as to eliminate as many lookups as possible and finally order data according to the target table’s primary key. As you can see, the difference in performance is dramatic.

Loading data from an Oracle source

Building Business Intelligence using the Microsoft BI stack in an organization that has an Oracle based transactional system is not uncommon, and here I'll outline a couple of tips and tricks that should ease the building of SSIS packages in that type of environment. Attunity Components Attunity offer several nice solutions for getitng data out of Oracle and into SQL - one of which is their CDC (Change Data Capture) offering . If you don't have the budget or stomach for setting up CDC in an oracle environment, then your next best bet is to use the free Attunity oracle connectors for SSIS , which demonstrate a measurable performance boost. The Oracle Index and Order By Gotcha In previous posts I mention the performance benefits of loading sorted data into target tables. I'm currently loading data from Oracle 10, which exhibits a rather strange characteristic... Oracle does not use make us of an index for ordering data . In other words, Oracle only uses ...

Mark White's BI Blog

Search This Blog

SSIS - High Performance Data Warehouse Loading

Labels

Comments

Popular posts from this blog

Reading Zip files in PowerQuery / M

Loading data from an Oracle source

Power Query: Transforming YYYYMM dates (the quick way)