The Great Database Shoot Out
In Jan'17, I teamed up with Marc van der Zon (Data Scientist), to test and compare several database technologies for our organization's future analytics and BI platform. The technologies that made the shortlist were:- SQL Server, because it is the organization's existing BI platform.
- Azure Data Warehouse, because of its high similarity to SQL Server.
- Amazon's Redshift, because of its attractive cost, scalability and performance.
- Netezza, because it is anaffordable on-prem appliance that performs well.
Considerations
We were primarily looking for the best bang-for-buck option, so price and performance were viewed as more important than how feature complete any particular option was. That said, what we regarded as important included:
- A shallow learning curve for BI developers (i.e. no need for expert skills in order to get good performance)
- Ability to re-use existing code (i.e. ANSI-92 SQL compatibility)
- Time-to-solution. (i.e. does the platform help bypass IT department bureaucracy)
Testing Methodology
Important to modern insurers are rating factors - qualities about the insured items that signify a change in the risk of insuring that item. Our test was based on taking 1.8 billion rating factors, presented in 5th normal form using 5 relational tables, and transform them into a series of 5 flattened tables. In short, it is a massive denormalizing exercise which tends to lend itself to traditional relational database technologies.Contenders
In the IT department's corner was an on-prem Dell 970 server, packed with 4 Xeon E7 4th generation quad core processors, teamed with 600GB of error correcting RAM, riding an EMC ExtremeIO solid-state disk SAN on dual 8GB/s HBAs. Including software licenses, the ticket to ride is roughly $551K (ZAR 7,5 million), given that the ExtremeIO is a shared resource.
In the Data Scientist's corner was the other on-prem offering - an IBM Netezza - a 32TB analytics appliance with a solid reputation for performance, and a comparatively reasonable price tag of roughly 88K (ZAR 1.2 million).
Next up was Microsoft's cloud based Azure Data Warehouse - a massively parallel appliance which splits data and workloads across 60 distributions, and is provisioned using DWUs (Data Warehouse units) - which translate into the amount of RAM and query concurrency the appliance will yield.
We tested Azure at several price/performance points:
In the Data Scientist's corner was the other on-prem offering - an IBM Netezza - a 32TB analytics appliance with a solid reputation for performance, and a comparatively reasonable price tag of roughly 88K (ZAR 1.2 million).
Next up was Microsoft's cloud based Azure Data Warehouse - a massively parallel appliance which splits data and workloads across 60 distributions, and is provisioned using DWUs (Data Warehouse units) - which translate into the amount of RAM and query concurrency the appliance will yield.
We tested Azure at several price/performance points:
- DWU 600 - costing ~$24k / yr
- DWU 1200 - costing ~$53k / yr
- DWU 2000 - costing ~$88k / yr
- 1 Node - costing ~$9k/yr (ZAR 122K)
- 12 Nodes - costing ~$12k/yr (ZAR 161K)
- 32 nodes - costing ~$31K/yr (ZAR 429K)
Results
To graphically represent the results, we created two measures:
- Affordability - being the inverse of cost.
- Productivity - being the inverse of the query performance.
When plotted onto a Cartesian plane, the differences between these technologies really become apparent:
Both Azure, Netezza and Redshift utterly annihilate on-prem SQL Server in terms of value proposition, and all of them are hands-down a better choice for building out future BI platforms.
For me, Amazon's offering really stood out both on the performance-to-price-point curve, as well as in it's features and ease of use. The ability to scale while remaining on-line sounds like a nice-to-have, but our experience with Azure quickly revealed that an inability to scale online is a productivity killer.
Query Time | ||
On Prem | Existing Server | 0:59:00 |
New Server | 0:43:40 | |
Netezza | 0:04:50 | |
Azure | DWU600 | 0:43:30 |
DWU1200 | 0:15:25 | |
DWU2400 | 0:09:28 | |
Redshift | 1 node | 0:36:09 |
12 nodes | 0:04:45 | |
32 nodes | 0:01:47 |
Notes
We encountered several minor annoyances during the set up and testing of Azure Data Warehouse - sometimes it would take upwards of 30 minutes for the database to pause, so subsequently scaling operations (which take the database offline) would take inordinately long to complete. When running below DWU600, some of the queries would terminate with errors or else take upwards of 4 hours to run - suggesting that the Azure offering doesn't handle low resource conditions particularly well. There were also significant inconsistencies in the elapsed time required for some queries to complete when run at different times, possibly the result of either congestion or contention? Azure DW's Polybase also proved to be a comparatively fussy affair, with lots of trial and error necessary to get the 5 CSV files loaded into tables.
The testing of Netezza went smoothly and query times were predictable, as one might expect from a dedicated appliance. There wasn't lots of configuration necessary to get things going, and the loading of data was a reasonably simple affair.
Amazon's Redshift has a lot going for it. The documentation is 1st class, and the follow-along tutorials can be completed in a matter of minutes each. Some of the features that really set Redhisft apart from competitors were:
- It stays online when scaling up or down, and it scales reasonably quickly.
- It is an incredibly simple, single step process to load CSV data into Redshift.
- It analyses data as it is loaded and intelligently determines the best compression algorithm to use for storing the data.
- It allows lookup tables to be present, in their entirety, on every compute node.
This significantly improves the performance of relational joins.
Finally, there's on-premises SQL Server machine, which the server team and DBAs say has many, many CPU cores, and gigabytes per second of awesome performance, and will be ready next Friday. Probably. After the latest service pack has been installed.
...It also has column store indexes. Did we mention how many CPU cores it has?
Conclusion
With no specific query tuning, Amazon's Redshift demonstrated performance that we were unable to obtain on any of the other platforms at a comparable price point. With respect to get-it-done productivity it stands head and shoulders above of the competition.Netezza makes for an attractive alternative for use cases that require an on-premise platform, and its important to remember that large volumes of data can be shipped into and out of an on-premise server for free, whereas data charges do apply for both Microsoft and Amazon's cloud offerings.
Seriously, though... if like me you've been a SQL Server stalwart most of your career, take Redshift for a spin. It'll knock your socks off!
Comments