Skip to main content

Virtuous Designs for Tabular Data Modelling

Much of the beauty of Microsoft's Tabular model is the apparent ability to escape weeks of star-schema modelling that are common place in OLAP cubes. While tabular can be blazingly fast both to develop models for as well as to use, the performance of the Vertipaq engine varies massively depending on how you present your data to it.

Below are several data modelling patterns you are likely to encounter:

The Monolithic table design involves joining all source tables together into a single denormalized representation. Tabular is able to group / aggregate and filter rows easily in this model, so while care needs to be taken when writing DAX expressions, the resulting cube will perform well.
  • Easy to get started. 
  • Performs well.
  • DAX expressions trickier to write. 
  • Cube loading times may suffer. 
  • Only similar-grained data can be accommodated.

When facts are derived from disparate sources, a monolithic design is not practical. In this case, multiple fact tables can be conformed by presenting a fa├žade of filtering tables. Unlike traditional OLAP dimensions, these table do not need to present surrogate keys – only the union of unique columns values that appear in the joined fact tables.

  • Easy to implement.
  • Fact tables perform relatively well.
  • Quickly becomes messy due to all the small filter tables - User experience degraded.

The end user experience can, in some cases, be improved by hiding facts behind a chain of filtering tables. When this chain of tables present behavior that is consistent with an end-user’s understanding of the business, the model becomes easier t consume.

Such chains usually perform moderately when compared to other virtuous patterns, but can provide an ideal end-user experience.

  • Balance between performance and end user experience.
  • Harder to implement – careful planning needed to figure out how to push each filtering column into just one place upstream of its facts.
  • Is only useful when calculations only rely on one fact table at a time – i.e. where the formula engine can convert the entire expression into an “inner joined” SQL statement.

There are also several design patterns that, with few exceptions, produce poorer performance than their virtuous counterparts:

Avoid designs where the same filter(able) columns appear in multiple places. Inevitably, business will ask a question that involve both facts tables. The filter can only be (easily) applied to one table or the other, but not both.
  • Add a common filtering table, or
  • Combine (union) the facts into a monolithic table.

While many-to-many relationships are possible in analysis services, they rarely perform well since they are often bridges between large tables millions of rows.

Alternatives to consider:
  • Combine all of the fact tables.
  • Combine two of the fact tables into a child of the remaining table.

Patterns that rely on large (>1m rows) tables to facilitate joins between facts perform poorly.
The reason for this is that the formula engine cannot resolve the joins in a single query to the storage engine, and so instead it:
  1. Uses the filter to query all keys from the large intermediate table.
  2. Passes the keys back to the storage engine to get the first set of facts.
Characteristics of this pattern:
  • Extremely slow measures (minutes instead of milliseconds)
  • Memory exhaustion failures.
Alternatives to consider:
  • Combine the fact tables.
  • Create a third [combined] fact table.
  • Join the fact tables directly to the filtering table.

For similar reasons to the “large intermediate table” design, using filtered child tables to filter parent tables can result in lack-lustre performance, stemming from how the formula engine resolves the query.
Alternatives to consider:
  • Link the parent table directly to the filtering of the small dimension.
  • Combine the child and parent table.
Post a Comment

Popular posts from this blog

Reading Zip files in PowerQuery / M

Being a fan of PowerBI, I recently looked for a way to read zip files directly into the Data Model, and found this blog which showed a usable technique. Inspired by the possibilities revealed in Ken's solution, but frustrated by slow performance, I set out to learn the M language and write a faster alternative.
UnzipContents The result of these efforts is an M function - UnzipContents - that you can paste into any PowerBI / PowerQuery report. It takes the contents of a ZIP file, and returns a list of files contained therein, along with their decompressed data:

If you're not sure how to make this function available in your document, simply:

Open up PowerQuery (either in Excel or in PowerBI)Create a new Blank Query.Open up the Advanced Editor  (found on the View tab in PowerBI).Copy-Paste the above code into the editor, then close the editor.In the properties window, rename the the function to UnzipContents Usage Using the function is fairly straight forward: Choose "New Quer…

Easily Move SQL Tables between Filegroups

Recently during a Data Warehouse project, I had the need to move many tables to a new file group. I didn't like any of the solutions that I found on Google, so decided to create on of my own. The result?

MoveTablesToFilegroupClick here for a nifty stored proc allows you to easily move tables, indexes, heaps and even LOB data to different filegroups without breaking a sweat. To get going, copy-paste the code below into Management Studio, and then run it to create the needed stored procedure.
Hopefully the arguments are self explanatory, but here are some examples:

1. Move all tables, indexes and heaps, from all schemas into the filegroup named SECONDARY:
EXEC dbo.sp_MoveTablesToFileGroup
@SchemaFilter = '%',-- chooses schemas using the LIKE operator
@TableFilter  = '%',-- chooses tables using the LIKE operator
@DataFileGroup = 'SECONDARY',-- The name of the filegroup to move index and in-row data to.
@ClusteredIndexes = 1,-- 1 means "Move all clustered inde…

SQL Server vs Azure Data Warehouse vs Netezza vs Redshift

The Great Database Shoot Out In Jan'17, I teamed up with Marc van der Zon (Data Scientist), to test and compare several database technologies for our organization's future analytics and BI platform. The technologies that made the shortlist were:
SQL Server, because it is the organization's existing BI platform.Azure Data Warehouse, because of its high similarity to SQL Server.Amazon's Redshift, because of its attractive cost, scalability and performance.Netezza, because it is anaffordable on-prem appliance that performs well. Considerations We were primarily looking for the best bang-for-buck option, so price and performance were viewed as more important than how feature complete any particular option was. That said, what we regarded as important included: A shallow learning curve for BI developers (i.e. no need for expert skills in order to get good performance)
Ability to re-use existing code  (i.e. ANSI-92 SQL compatibility)
Time-to-solution.   (i.e. does the platform …