In a previous post I mentioned how to achieve data
transfers at near file-copy speeds using SSIS. In this follow-on, I’m going to
show you how to put this information to practical use for quickly loading a
data-warehouse.
Generally speaking, the idea behind
data warehousing is to reduce the load on transactional systems by providing a
central repository of which has been normalized in a manner that is ideal for
drawing reports and analytics. Strategies for loading data warehouses typically
fall into one of two categories:
1.
Periodically resetting and
repopulating the data warehouse during off-peak times,
2.
Performing an initial data load,
followed by on-going incremental loads.
Significant changes to any
transactional systems usually necessitate changes to a data warehouse or, at
least, changes to how the data warehouse is populated, which means that no
matter which strategy you follow, you’ll benefit from being able to achieve
fast full-reloads of the data warehouse…
Lobbying
for separate packages
A common challenge in data
warehousing is building a solution that performs well, can cope with increasing
data volumes, and is reasonably maintainable. While it is arguably a matter of
preference, I recommend creating separate packages for full reloads vs.
incremental loads, because there are glaring differences in how you handle data
in either scenario:
Full Refresh
|
Incremental Load
|
|
Type 1 Slowly
Changing Dimensions
|
Can be bulk loaded
with minimal logging.
|
Should be checked –
any change represents an error.
|
Type 2 Slowly
Changing Dimensions
|
Can be bulk loaded
with minimal logging, with historic information (if inferable from source
data) being generated using common SSIS normalization tactics.
|
Must be verified
against existing data and handled accordingly.
|
Fact Tables
|
Can be bulk loaded
with minimal logging.
|
Can require some
sort of “update if already loaded” logic (like SCD2). Can be bulk loaded, but
not minimally logged.
|
The remainder of this post discusses
tactics for achieving a full refresh as quickly as possible, with focus on
lessons learned in the wild…
My previous post about fast,
minimally logged data inserts mentioned having data sorted by a table’s primary
key before it arrives at the SSIS destination (OLEDB destination or SQL
Destination). Finding an appropriate means of sorting data really depends on
your creativity, but there are usually some clever short cuts if you look for
them…
Avoid
the Merge component
There are plenty of woeful tales
afloat on the net about the merge component. Particularly prominent in its list
of evils is the fact that it doesn’t scale well… meaning that your package
which works fine in the lab may choke and die in the wild. The reason behind
this comes from two rather ugly characteristics:
· Merge must see at least one row in both inputs before
it can release any output.
This makes sense – the output is a sorted union – a comparison must be made before a decision can be made on which row/s to output first.
This makes sense – the output is a sorted union – a comparison must be made before a decision can be made on which row/s to output first.
· Merge does not respect down-stream throughput.
Unforgivably, the component will consume data as rapidly as possible at both inputs regardless of whether down-stream components are able to consume the output fast enough.
Unforgivably, the component will consume data as rapidly as possible at both inputs regardless of whether down-stream components are able to consume the output fast enough.
Both of the above create a situation
where large volumes of data cause resource exhaustion. Although SSIS does page
memory buffers out to disk, there are limits to how many memory buffers can be
allocated; Even with plenty of free disk space, a 40GB data load can easily
choke on a machine with 8GB RAM when a merge component is used.
Avoid
the Sort component
Sorting all but the smallest sets of
data inside the pipeline is madness, and leads to the same scalability problems
that the merge component does. Instead, consider ways to get the data into the
pipeline in an already-sorted (or semi-sorted) state,
Some tricks for achieving this
include:
1) Use ORDER BY at source
If the source system has an appropriate index then simply use an ORDER BY clause when selecting data from the source database. Be sure to set the pipeline meta-data in SSIS to let SSIS know that the data is indeed sorted.
If the source system has an appropriate index then simply use an ORDER BY clause when selecting data from the source database. Be sure to set the pipeline meta-data in SSIS to let SSIS know that the data is indeed sorted.
2) Use a Staging database
If you can’t sort the data at source, consider moving it into an intermediate table, then let an RDBMS engine perform the sort – this approach scales far better than using a Sort component in SSIS.
If you can’t sort the data at source, consider moving it into an intermediate table, then let an RDBMS engine perform the sort – this approach scales far better than using a Sort component in SSIS.
3) Look for the natural data order
SSIS doesn’t actually check whether your data is really sorted – it simply takes your word for it. It is common in transactional systems to date fiscal records, but to give them an auto-incrementing primary key (for SQL read: IDENTITY) – such that the natural order of the primary key correlates with the order of the fiscal date.
SSIS doesn’t actually check whether your data is really sorted – it simply takes your word for it. It is common in transactional systems to date fiscal records, but to give them an auto-incrementing primary key (for SQL read: IDENTITY) – such that the natural order of the primary key correlates with the order of the fiscal date.
It is typical in data warehousing
need source data ordered by date, not identity, but adding “ORDER BY
Date” to the select query causes the source system to choke. In cases like
these, it’s fine to order the data by its primary key, but tell SSIS that the
data is ordered by date.
Avoid
using temporary files for staging
For reading and writing files, SSIS
supports two formats - RAW, which perform well, and Text/CSV which is more
readable. Various SSIS gurus have posted about the performance advantages of
Raw files over Text/CSV, but there tend to be relatively few postings about the
tactical advantages/disadvantages of using either one.
In general, RAW files are easier to
work with at either end of a pipeline, and unlike using a database, no special
effort is required achieve very fast write speed for these files, making them a
very tempting solution for intermediate data storage. I recommend giving a
staging database preference to using temporary files, however, for the simple
reason that staging files are inflexible, which detracts from the
maintainability of your package… there is no SELECT/JOIN/WHERE/UNION capability
across RAW files, and while this sort of can be achieved inside of SSIS, you
will likely find scenarios (as pointed out above) where SSIS simply doesn’t
handle large data as easily as a database will.
Use
a staging database
A common misconception in data
warehousing is that using a staging database is a necessary evil that slows
down the overall loading process. At first glance this makes sense… after all -
why move data twice when you can move it once?
The answer lies in the prerequisites
for fast minimally logged inserts. Given that sorted data allows us to
completely by-pass the transaction log and TempDB, it turns out that using a
staging database to combine and sort source data can actually improve
performance when compared to the alternatives – most particularly when any of
the following are true:
· Source data is not naturally sorted per the target table’s
key
· Multiple sources need to be combined
· Lookups/Joins are required on large data sets
If this advice seems counter
intuitive, then consider the following performance characteristics of different
versions of a package which achieves exactly the same goal – to combine 40GB
from 5 tables and load them into a fact table with 13 dimension lookups:
Merge Components
|
Sort Components
|
RAW Files
|
Lookups Used?
|
Staging Tables?
|
Load Sorted Data?
|
Execution Time
|
0
|
6
|
8
|
16
|
0
|
Yes
|
(Out Of Memory)
|
7
|
0
|
8
|
16
|
0
|
Yes
|
(Out Of Memory)
|
3
|
0
|
2
|
6
|
Yes
|
4:33:01
|
|
0
|
0
|
8
|
16
|
0
|
No
|
6:15:32
|
0
|
0
|
0
|
16
|
8
|
No
|
5:56:47
|
0
|
0
|
0
|
7
|
8
|
Yes
|
2:01:58
|
The trick is to use a staging
database to eliminate sort and merge components, as well as to eliminate as
many lookups as possible and finally order data according to the target table’s
primary key. As you can see, the difference in performance is dramatic.
Comments