Skip to main content

SSIS - High Performance Data Warehouse Loading

In a previous post I mentioned how to achieve data transfers at near file-copy speeds using SSIS. In this follow-on, I’m going to show you how to put this information to practical use for quickly loading a data-warehouse.
Generally speaking, the idea behind data warehousing is to reduce the load on transactional systems by providing a central repository of which has been normalized in a manner that is ideal for drawing reports and analytics. Strategies for loading data warehouses typically fall into one of two categories:

1.      Periodically resetting and repopulating the data warehouse during off-peak times,
2.      Performing an initial data load, followed by on-going incremental loads.

Significant changes to any transactional systems usually necessitate changes to a data warehouse or, at least, changes to how the data warehouse is populated, which means that no matter which strategy you follow, you’ll benefit from being able to achieve fast full-reloads of the data warehouse…

Lobbying for separate packages
A common challenge in data warehousing is building a solution that performs well, can cope with increasing data volumes, and is reasonably maintainable. While it is arguably a matter of preference, I recommend creating separate packages for full reloads vs. incremental loads, because there are glaring differences in how you handle data in either scenario:

Full Refresh
Incremental Load
Type 1 Slowly Changing Dimensions
Can be bulk loaded with minimal logging.
Should be checked – any change represents an error.
Type 2 Slowly Changing Dimensions
Can be bulk loaded with minimal logging, with historic information (if inferable from source data) being generated using common SSIS normalization tactics.
Must be verified against existing data and handled accordingly.
Fact Tables
Can be bulk loaded with minimal logging.
Can require some sort of “update if already loaded” logic (like SCD2). Can be bulk loaded, but not minimally logged.

The remainder of this post discusses tactics for achieving a full refresh as quickly as possible, with focus on lessons learned in the wild

My previous post about fast, minimally logged data inserts mentioned having data sorted by a table’s primary key before it arrives at the SSIS destination (OLEDB destination or SQL Destination). Finding an appropriate means of sorting data really depends on your creativity, but there are usually some clever short cuts if you look for them…
Avoid the Merge component
There are plenty of woeful tales afloat on the net about the merge component. Particularly prominent in its list of evils is the fact that it doesn’t scale well… meaning that your package which works fine in the lab may choke and die in the wild. The reason behind this comes from two rather ugly characteristics:
· Merge must see at least one row in both inputs before it can release any output.
This makes sense – the output is a sorted union – a comparison must be made before a decision can be made on which row/s to output first.
· Merge does not respect down-stream throughput.
Unforgivably, the component will consume data as rapidly as possible at both inputs regardless of whether down-stream components are able to consume the output fast enough.

Both of the above create a situation where large volumes of data cause resource exhaustion. Although SSIS does page memory buffers out to disk, there are limits to how many memory buffers can be allocated; Even with plenty of free disk space, a 40GB data load can easily choke on a machine with 8GB RAM when a merge component is used.
Avoid the Sort component
Sorting all but the smallest sets of data inside the pipeline is madness, and leads to the same scalability problems that the merge component does. Instead, consider ways to get the data into the pipeline in an already-sorted (or semi-sorted) state,
Some tricks for achieving this include:

1) Use ORDER BY at source
If the source system has an appropriate index then simply use an ORDER BY clause when selecting data from the source database. Be sure to set the pipeline meta-data in SSIS to let SSIS know that the data is indeed sorted.

2) Use a Staging database
If you can’t sort the data at source, consider moving it into an intermediate table, then let an RDBMS engine perform the sort – this approach scales far better than using a Sort component in SSIS.

3) Look for the natural data order
SSIS doesn’t actually check whether your data is really sorted – it simply takes your word for it. It is common in transactional systems to date fiscal records, but to give them an auto-incrementing primary key (for SQL read: IDENTITY) – such that the natural order of the primary key correlates with the order of the fiscal date.

It is typical in data warehousing need source data ordered by date, not identity, but adding “ORDER BY Date” to the select query causes the source system to choke. In cases like these, it’s fine to order the data by its primary key, but tell SSIS that the data is ordered by date.
Avoid using temporary files for staging
For reading and writing files, SSIS supports two formats - RAW, which perform well, and Text/CSV which is more readable. Various SSIS gurus have posted about the performance advantages of Raw files over Text/CSV, but there tend to be relatively few postings about the tactical advantages/disadvantages of using either one.

In general, RAW files are easier to work with at either end of a pipeline, and unlike using a database, no special effort is required achieve very fast write speed for these files, making them a very tempting solution for intermediate data storage. I recommend giving a staging database preference to using temporary files, however, for the simple reason that staging files are inflexible, which detracts from the maintainability of your package… there is no SELECT/JOIN/WHERE/UNION capability across RAW files, and while this sort of can be achieved inside of SSIS, you will likely find scenarios (as pointed out above) where SSIS simply doesn’t handle large data as easily as a database will.
Use a staging database
A common misconception in data warehousing is that using a staging database is a necessary evil that slows down the overall loading process. At first glance this makes sense… after all - why move data twice when you can move it once?
The answer lies in the prerequisites for fast minimally logged inserts. Given that sorted data allows us to completely by-pass the transaction log and TempDB, it turns out that using a staging database to combine and sort source data can actually improve performance when compared to the alternatives – most particularly when any of the following are true:

· Source data is not naturally sorted per the target table’s key
· Multiple sources need to be combined
· Lookups/Joins are required on large data sets

If this advice seems counter intuitive, then consider the following performance characteristics of different versions of a package which achieves exactly the same goal – to combine 40GB from 5 tables and load them into a fact table with 13 dimension lookups:

Merge Components
Sort Components
RAW Files
Lookups Used?
Staging Tables?
Load Sorted Data?
Execution Time
0
6
8
16
0
Yes
(Out Of Memory)
7
0
8
16
0
Yes
(Out Of Memory)
3
0
2
6
Yes
4:33:01
0
0
8
16
0
No
6:15:32
0
0
0
16
8
No
5:56:47
0
0
0
7
8
Yes
2:01:58

The trick is to use a staging database to eliminate sort and merge components, as well as to eliminate as many lookups as possible and finally order data according to the target table’s primary key. As you can see, the difference in performance is dramatic.
Post a Comment

Popular posts from this blog

Reading Zip files in PowerQuery / M

Being a fan of PowerBI, I recently looked for a way to read zip files directly into the Data Model, and found this blog which showed a usable technique. Inspired by the possibilities revealed in Ken's solution, but frustrated by slow performance, I set out to learn the M language and write a faster alternative.
UnzipContents The result of these efforts is an M function - UnzipContents - that you can paste into any PowerBI / PowerQuery report. It takes the contents of a ZIP file, and returns a list of files contained therein, along with their decompressed data:



If you're not sure how to make this function available in your document, simply:

Open up PowerQuery (either in Excel or in PowerBI)Create a new Blank Query.Open up the Advanced Editor  (found on the View tab in PowerBI).Copy-Paste the above code into the editor, then close the editor.In the properties window, rename the the function to UnzipContents Usage Using the function is fairly straight forward: Choose "New Quer…

Easily Move SQL Tables between Filegroups

Recently during a Data Warehouse project, I had the need to move many tables to a new file group. I didn't like any of the solutions that I found on Google, so decided to create on of my own. The result?

MoveTablesToFilegroupClick here for a nifty stored proc allows you to easily move tables, indexes, heaps and even LOB data to different filegroups without breaking a sweat. To get going, copy-paste the code below into Management Studio, and then run it to create the needed stored procedure.
Hopefully the arguments are self explanatory, but here are some examples:

1. Move all tables, indexes and heaps, from all schemas into the filegroup named SECONDARY:
EXEC dbo.sp_MoveTablesToFileGroup
@SchemaFilter = '%',-- chooses schemas using the LIKE operator
@TableFilter  = '%',-- chooses tables using the LIKE operator
@DataFileGroup = 'SECONDARY',-- The name of the filegroup to move index and in-row data to.
@ClusteredIndexes = 1,-- 1 means "Move all clustered inde…

SQL Server vs Azure Data Warehouse vs Netezza vs Redshift

The Great Database Shoot Out In Jan'17, I teamed up with Marc van der Zon (Data Scientist), to test and compare several database technologies for our organization's future analytics and BI platform. The technologies that made the shortlist were:
SQL Server, because it is the organization's existing BI platform.Azure Data Warehouse, because of its high similarity to SQL Server.Amazon's Redshift, because of its attractive cost, scalability and performance.Netezza, because it is anaffordable on-prem appliance that performs well. Considerations We were primarily looking for the best bang-for-buck option, so price and performance were viewed as more important than how feature complete any particular option was. That said, what we regarded as important included: A shallow learning curve for BI developers (i.e. no need for expert skills in order to get good performance)
Ability to re-use existing code  (i.e. ANSI-92 SQL compatibility)
Time-to-solution.   (i.e. does the platform …