Mark White's BI Blog

Posts

Record Versions in ETL staging areas

Ever wanted to be able to use the OLEDB Destination component in SSIS to load staging tables and have the underlying data versioned for you? If so, here's a solution which is working very well herein a production environment. Below are some scripts which implement automatic triggers on staging tables so that whenever a newly loaded record collides with the primary key of an existing record, the existing record is quietly moved out into an archive table. In practice, you create two schemas - Staging - contains all staging tables Archive - contains all archive (historic) tables When two tables with the same name are detected in these schemas, then triggers are placed onto the table in the Staging schema to handle moving old data into the archive record. The caveat here is that data is only preserved for columns where the column name and type is identical in both the staging and archive table. There are two scripts that make al...

Conforming Primary Key Names

It's nice to be able to predict the name of a primary key constraint based on the name of the table it constrains, but often times during the design of a database, we simply don't have time to worry about such things. Below is a quick T-SQL script which helps to modify all primary key constraints so that they are named according with the table they belong to... e.g. the table Dim.Sales will have it's primary key renamed to PK_Dim_Sales To use the script, paste it into management studio, then: press CTRL+T to have the output go to text press F5 to run the script Cut'n'Paste the query results into a new query window press F5 again to run the new query

SSIS Script for handling bad dates

Oracle and SSIS both have the ability to handle dates which SQL server's DateTime type won't accept. A reasonably simple solution is to leverage the script component to detect and repair the invalid dates... here's a step-by-step guide for doing exactly that... Add a script component to your data flow, in transform mode. Open the editor for the script component Under Input Columns, select all the date columns you want to repair, and set their usage type to READ/WRITE (this is vital, else the script will break). Under Script , click the edit script button Select and delete all text Paste the C# code from here Close the source code window, click OK to dismass the script editor. Run the data flow. The script will efficiently detect date with bad years and attempt to correct them to something which SQL finds palettable. Give it a whirl and let me know what you think!

Loading data from an Oracle source

Building Business Intelligence using the Microsoft BI stack in an organization that has an Oracle based transactional system is not uncommon, and here I'll outline a couple of tips and tricks that should ease the building of SSIS packages in that type of environment. Attunity Components Attunity offer several nice solutions for getitng data out of Oracle and into SQL - one of which is their CDC (Change Data Capture) offering . If you don't have the budget or stomach for setting up CDC in an oracle environment, then your next best bet is to use the free Attunity oracle connectors for SSIS , which demonstrate a measurable performance boost. The Oracle Index and Order By Gotcha In previous posts I mention the performance benefits of loading sorted data into target tables. I'm currently loading data from Oracle 10, which exhibits a rather strange characteristic... Oracle does not use make us of an index for ordering data . In other words, Oracle only uses ...

Using SSIS with SQL partitions

Last month I blogged about the virtues of having seperate full-load vs. incremental load packages when doing data warehousing. Well... that isn't the end-all of that argument... When you design your data warehouse, if you take table partitioning into account (which I highly recommend, since most BI data is temporally chronological), then you can usually get away with writting a single data flow task which achieves minimal logging even during incremental updates, by simply targetting a staging partition and then switching it in at the end of the data load. In one such data flow, I needed to be able to normalize the incomming data into three tables. The challenge comes in that you don't want to perform lookups on every record when you're doing a full load (since this slows the data flow down), but you do need to do those lookups to avoid duplicates during incremental loads. A great pattern for solving this to check whether the target table is empty before starting the data fl...

Threading in SSIS Script component

Alberto Ferrari writes many good blogs on SQL and SSIS. In one of his old posts he mentions the presence of a bug in the SSIS script component, where the output buffer is marked as completed as soon as the ProcessInput method returns from the last row of the last buffer. This is not a bug , and if you think about it, it's actually correct for the stream to be marked as completed once the component returns from processing it. The reason for this is because when synchronous outputs are used, the output buffer is the input buffer - i.e. they're the same block of RAM, so when the component returns from processing the buffer, that buffer is passed on to the next down-stream component... that is... SSIS does not wait for other threads in your component to complete their work before passing the buffer on. Not only is this not a bug, but it's actually a performance booster and, in general, a really good thing. The solution to Alberto's problem is reasonably simple, and s...

Much Needed Improvements To SSIS

SSIS has been my mid-night mistress for months now, and at the risk of sounding off-color I've compiled a short list of things which ought to be changed in the nest release of SSIS: 1) The diagram auto-layout desperately needs to be rethought and replaced. 2) The Data Viewer should at least allow "live vs. page-at-a-time" modes, and showing bottle-neck indicators on pipelines would be at least as useful as showing the row count. 3) Manual tuning of buffers should either go away (replaced by automated solution), or at very least be augmented with debug helpers and visual feedback. 4) Lookup and SCD components need to be binned and replaced with components that add more flexibility in terms of how lookup data is retrieved, how matches are performed (range-value matching would be a great start), allow for multiple joins (lookup), enable sliding window lookups (as opposed to partial caching). 5) Exposing only the most useless component properties to the data-flow'...