July 29, 2014

Virtuous Designs for Tabular Data Modelling

Much of the beauty of Microsoft's Tabular model is the apparent ability to escape weeks of star-schema modelling that are common place in OLAP cubes. While tabular can be blazingly fast both to develop models for as well as to use, the performance of the Vertipaq engine varies massively depending on how you present your data to it.

Below are several data modelling patterns you are likely to encounter:


The Monolithic table design involves joining all source tables together into a single denormalized representation. Tabular is able to group / aggregate and filter rows easily in this model, so while care needs to be taken when writing DAX expressions, the resulting cube will perform well.
Pros:
  • Easy to get started. 
  • Performs well.
Cons:
  • DAX expressions trickier to write. 
  • Cube loading times may suffer. 
  • Only similar-grained data can be accommodated.


When facts are derived from disparate sources, a monolithic design is not practical. In this case, multiple fact tables can be conformed by presenting a fa├žade of filtering tables. Unlike traditional OLAP dimensions, these table do not need to present surrogate keys – only the union of unique columns values that appear in the joined fact tables.

Pros:
  • Easy to implement.
  • Fact tables perform relatively well.
Cons:
  • Quickly becomes messy due to all the small filter tables - User experience degraded.



The end user experience can, in some cases, be improved by hiding facts behind a chain of filtering tables. When this chain of tables present behavior that is consistent with an end-user’s understanding of the business, the model becomes easier t consume.

Such chains usually perform moderately when compared to other virtuous patterns, but can provide an ideal end-user experience.

Pros:
  • Balance between performance and end user experience.
Cons:
  • Harder to implement – careful planning needed to figure out how to push each filtering column into just one place upstream of its facts.
  • Is only useful when calculations only rely on one fact table at a time – i.e. where the formula engine can convert the entire expression into an “inner joined” SQL statement.

There are also several design patterns that, with few exceptions, produce poorer performance than their virtuous counterparts:



Avoid designs where the same filter(able) columns appear in multiple places. Inevitably, business will ask a question that involve both facts tables. The filter can only be (easily) applied to one table or the other, but not both.
Options:
  • Add a common filtering table, or
  • Combine (union) the facts into a monolithic table.

While many-to-many relationships are possible in analysis services, they rarely perform well since they are often bridges between large tables millions of rows.

Alternatives to consider:
  • Combine all of the fact tables.
  • Combine two of the fact tables into a child of the remaining table.


Patterns that rely on large (>1m rows) tables to facilitate joins between facts perform poorly.
The reason for this is that the formula engine cannot resolve the joins in a single query to the storage engine, and so instead it:
  1. Uses the filter to query all keys from the large intermediate table.
  2. Passes the keys back to the storage engine to get the first set of facts.
Characteristics of this pattern:
  • Extremely slow measures (minutes instead of milliseconds)
  • Memory exhaustion failures.
Alternatives to consider:
  • Combine the fact tables.
  • Create a third [combined] fact table.
  • Join the fact tables directly to the filtering table.

For similar reasons to the “large intermediate table” design, using filtered child tables to filter parent tables can result in lack-lustre performance, stemming from how the formula engine resolves the query.
Alternatives to consider:
  • Link the parent table directly to the filtering of the small dimension.
  • Combine the child and parent table.

September 18, 2013

Cloning SQL tables

Plenty of folks have blogged about various techniques for cloning tables in SQL server, and for good reason... during data loading and data processing its very useful to be able to build one table while simultaneously reporting off of another. When the processing of the new table is completed, it can be switched in to replace the data of the old table.

To simplify the creation of a build table, I've written a stored procedure which will take any table and clone it and its indexes:




July 19, 2013

Revisiting Earned Premium


In a previous post about earned premium, I outlined how you could calculate a monetary value based on a period of time over which it was earned using DAX.

Serendipitously, the next day a colleague forwarded Alberto Ferrari's paper on understanding DAX query plans, and after giving it a thorough read I fired up the query profiler and set out to optimize our calculated measure for earned premium.

Alberto's paper details a performant solution to the classic events in progress problem, of which earned premium is a close cousin. My excitement at lazily shoplifting Alberto's work came to a grinding halt when I discovered that his 40ms Jedi solution only worked if data was queried at a specific granularity. This wasn't going to cut it... we need an earned premium measure that works at any level of aggregation. Back to the drawing board.

It turns out that much of Alberto's advice is (as always) really valuable. While I strongly recommend reading Alberto's paper, here's the cheat sheet for optimizing any DAX calculated measure:
  1. Help the Formula Engine (FE) to push the heavy lifting down to the Storage Engine (SE).
    FE is single threaded and non-caching, whereas SE is multithreaded and can cache results.
  2. Avoid complex / inequality predicates that cause SE to call back to the FE.
    This not only slows down data retrieval, but also prevents SE from caching results.
Our original  measure used inequality predicates in the expression:
     'Premium'[Start Date] <= LASTDATE('Date'[Date])
  && 'Premium'[End Date] >= FIRSTDATE('Date'[Date])

... which forces SE to callback to FE, slowing down the calculation somewhat.

To recap:
Earned Premium  = Amount Paid * Days In Current Period / Total Days of Cover

Which we can calculate as follows:
Earned Premium:=
SUMX (
 SUMMARIZE (
  'Premium',
  'Premium'[Start Date],
  'Premium'[End Date],
  "Earned Premium",
  SUM ('Premium'[Amount Paid] )        
  * COUNTROWS (                          
      CALCULATETABLE ('Date',
        KEEPFILTERS(
          DATESBETWEEN (
            'Date'[Date],'Premium'[Start Date], 'Premium'[End Date]
          )
        )
      )
    )
   / COUNTROWS (                        
      DATESBETWEEN (
        'Date'[Date], 'Premium'[Start Date], 'Premium'[End Date]
      )
    )
 ),
 [Earned Premium]
)


With this formula, we're able to calculate earned premium for 22 million records across 84 months in 2.2 seconds. Happy days.

To explain the calculation:
  • The SUMMARIZE function groups all premium by start & end date. 
  • We project "Earned Premium" as the sum of all premium for a distinct start & end period
  • The earned premium is multiplied by the number of days (records) in the distinct period that are also present in period currently being observed in the resulting cell. KEEPFILTERS applies this filtering for us - it effectively says "filter out days that aren't in the current month / quarter / year filtering being applied to the Date table" 
  • The result is divided by the number of days occurring in the distinct period regardless of filtering. 
  • Finally SUMX adds up all of the summarized values. 

July 15, 2013

DAX and Insurance's Earned Premium problem

In the world of short term insurance, "Earned Premium" is a common BI metric. In its simplest form, an amount of money is earned as time elapses through a period of cover. On any given day, you will have earned some, all or none of the premium paid.

Using DAX calculations, solving earned premium turns out to be both easy and efficient, and in this post I'll show you how to do it. Before you start, download and open this Excel 2013 file. Make sure that you have the PowerPivot add-in enabled in Excel.

If you open the PowerPivot cube, you'll find two tables, the first being a regular "Date" table used to represent the hierarchy of days, months, years etc. The second "Premium" table contains the data that's of interest, and to keep things simple, I've only included 4 columns:

  • Product Line   - describes a type of insurance product
  • Amount           - the amount of premium paid by a policy holder for a given period of cover.
  • Start Date        - the date when insurance cover starts for the premium paid.
  • End Date         - the date after which insurance cover ends for the premium paid.

The DAX calculation, Earned Premium is where it get interesting. The layman's calculation for earned premium is:

Earned Premium  = Amount Paid * Days In Current Period / Total Days of Cover

For example, if you paid $100 for 1 year of insurance, then in the month of March you will earn $8.49:

Earned Premium = $100 * 31 (Days in March)  / 365 (Days of insurance purchased)

To move into the world of DAX, the above equation is pseudo coded as follows:

Earned Premium = SUM(
        Amount
        * [Days in Date table overlapping period of cover]
        / [Days in Date table for total period of cover]
)

Step 1: Calculate Days in Total Period Of Cover

The DAX technique for calculating the number of days in a period is to produce a table of dates that fall within a period using DATESBETWEEN, and then counting the number of rows:

COUNTROWS( DATESBETWEEN( Dates, Start Date, End Date )  ) 

Filling in the actual table and column names, the DAX formula becomes:

COUNTROWS ( DATESBETWEEN( 'Date'[Date], 'Premium'[Start Date], 'Premium'[End Date] )  ) 

Step 2: Calculate Days in Current Period 

Determining which days fall into the current period of cover requires checking for overlap between the period currently being calculated, and the total period of cover. In practice, this means using the latest of the two start dates and the earliest of the two end dates:

COUNTROWS (
   DATESBETWEEN (
      'Date'[Date],
      IF(FIRSTDATE('Date'[Date]) > 'Premium'[Start Date], FIRSTDATE('Date'[Date]), 'Premium'[Start Date] ) ,
      IF(LASTDATE ('Date'[Date]) < 'Premium'[End Date] , LASTDATE ('Date'[Date]), 'Premium'[End Date] )
   )
)

Step 3: Optimizing the input data

No relationship is defined between the Date and Premium tables - so to improve the performance of our calculation, we give DAX a way of quickly eliminating records that are not applicable to our calculation. The logic for doing this is:

  • Filter out records where cover ends before the period we're calculating starts.
  • Filter out records where the cover starts after the period we're calculating ends.
  • Group the remaining records by start and end date, projecting the sum total of premium for the given period.

In DAX this looks somewhat inside out...

FILTER (
   SUMMARIZE (
      'Premium',
      'Premium'[Start Date],
      'Premium'[End Date],
      "EarnedPremium",
      SUM('Premium'[Amount])
      ),
   'Premium'[Start Date] <= LASTDATE('Date'[Date]) && 'Premium'[End Date] >= FIRSTDATE('Date'[Date])


In T-SQL, this might look as follows:

SELECT [Start Date], [End Date], [EarnedPremium] = SUM(Amount)
FROM Premium
GROUP BY [Start Date],[End Date]
HAVING [Start Date] < {Current Cell in Excel's Period End}
AND [End Date] > {Current Cell in Excel's Period Start}

This calculation efficiently reduces the number of times our earned premium expression needs to be run inside of the SUM statement. The completed DAX calculation then:

Earned Premium:=SUMX (
   FILTER (
      SUMMARIZE (
         'Premium',
         'Premium'[Start Date],
         'Premium'[End Date],
         "EarnedPremium",
         SUM('Premium'[Amount])
      ),
   'Premium'[Start Date] <= LASTDATE('Date'[Date]) && 'Premium'[End Date] >= FIRSTDATE('Date'[Date])
   ),
   [EarnedPremium]
   *
   COUNTROWS (
      DATESBETWEEN (
         'Date'[Date],
         IF(FIRSTDATE('Date'[Date]) > 'Premium'[Start Date], FIRSTDATE('Date'[Date]), 'Premium'[Start Date] ) ,
         IF(LASTDATE ('Date'[Date]) < 'Premium'[End Date] , LASTDATE ('Date'[Date]), 'Premium'[End Date] )
      )
   ) /
   COUNTROWS ( DATESBETWEEN( 'Date'[Date], 'Premium'[Start Date], 'Premium'[End Date] ) )
)