Monday, August 11, 2014

Microsoft Parallel Data Warehouse: POC lessons learned Part 3

The first part of the PDW lessons learned series gave an introduction about the goals of the POC, a customer introduction, an overview of the requirements and the test cases. Within the second part I gave some more insights on how to migrate the existing code base I used for the POC (relational structures, stored procedures, SSIS packages) in order to work with PDW. The last post will concentrate on the results and a summary of our findings.

ETL Performance

As shown in my series “SSIS Design for Performance and Scale” a typically designed SSIS package with only 1 stream in the dataflow will give us a write performance of around 25 MB/s. Because data load performance is a critical issue in the current customer environment one of the original SSIS packages looked like this:
image
What we can see here are 6 sequence container that are executed as 6 threads in parallel, every container handles a subset of the data. But there are several disadvantages with that kind of design:
  • Complex parallel loads of partitions (partition by quarter)
  • Manual parallelism within package design
  • Multiple implementation of business logic per thread
With PDW we wanted to see how much the engine itself can help us in order to parallelize data loads. So we decided to remove all parallelism from the package so that it looks like this:
image
This gives us the following advantages:
  • Parallelism of load processes is handled by PDW (let’s see)
  • Much simpler SSIS Design -> less implementation effort
  • Business logic exists only once -> simpler and faster maintainability
So far so good, but what about the performance numbers? Let’s have a quick look at the 2 scenarios we have tested. We used the Landing Zone as host for Integration Services and executed all packages on this server.
image

SCENARIO 1

In the original customer scenario the SSIS package loads data from on table to another table within the same database. The whole database is stored in PDW.

SCENARIO 2

Because we had some trouble with the read performance we decided to put a backup of the database to the landing zone and load the data from the landing zone to the PDW nodes.

PERFORMANCE RESULTS

The following matrix shows the performance numbers we got during our tests.
image
The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.
Now we can see 2 interesting behaviors. The first scenario is 2x slower, but the second one up to 4x faster. So what is the reason for these big differences in performance?
  • Read data from PDW is slow, around 8 MB/s (OleDB driver) –> Scenario 1
  • Load data into PDW is fast, around 45 MB/s –> Scenario 2
  • PDW automatically distributes loading processes
Important: The product group is aware of the slow OleDB driver and is working hard to make it faster. We have already seen a much better performance on a later Appliance Update.

SUMMARY

I always hear people saying that SSIS is not the right tool for data loading scenarios with PDW. It’s maybe not the fastest option, but we have seen nevertheless good performance measures. Just to remember the easiest SSIS package on a normal SQL Server gives us a write performance of 25 MB/s. With PDW we get 45MB/s, so performance is nearly doubled. If this performance is good enough for you, SSIS is still a good choice and of course additionally you can apply the same parallelism techniques for further scale. And sometimes you don’t have the option to migrate all existing SSIS packages to a T-SQL based ETL.
My feeling is, that there is much more possible in case of performance, if the driver becomes faster and more efficient in the future.
Very often also the BCP tool (bulk copy utility) is recommended for data loading. This is a very fast option if you get your source data as flat file. But when you first need to create flat files before you can use BCP you better use SSIS for direct connectivity.

Relational Query Performance

Now let’s have a look at the relational query part. We got 3 reference queries from the customer that we used for our tests. In order to make the queries work with the PDW database we only had to change the schema name (because only the “dbo” schema is currently supported) and that’s it. And the performance? Awesome, as you can see here:
image
The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.
All queries are up to 8 times faster. This is the real power that PDW offers us. A highly scalable SQL engine with lot’s of query power. So no surprises here, it works as we had expect it.

Summary

So what do we learn from this POC. I’d like to address the most important things we have seen again in a short bullet list.

USABILITY

einfachheit
  • PDW is not (yet) a normal SQL Server. So you need to understand the architecture of a MPP system and why it works differently to a SMP system
  • It’s not a no brainer. You need to think about, why would you need to buy a MPP System and which problem should it solve. It’s not a “one size fit’s all” thing.
  • Plan for migration effort. There are features in SQL Server that are currently not supported in PDW. Work closely with the Microsoft team to understand your specific case and there are already a lot of good design pattern you can use to overcome the missing features.
Performance
PowerRanger
  • The relational query engine is fast. You have seen the results, but you can also write statements that do not well perform. Think about data distribution and other techniques and avoid data shuffling.
  • Plan your ETL. As we have see we get good parallelism support for SSIS, although the read problem has to be solved. So the system helps us with built-in parallelism support and we can keep our SSIS packages much simpler for a longer time.
  • Think about ELT.  I think in a MPP environment the SSIS engine itself becomes the bottleneck. We take out the data from a highly parallelized engine pump it through a buffer oriented SSIS engine and put it back to our highly parallelized engine. This works like a funnel which scales only to a certain point. If you can start from scratch or optimize certain packages try to use SSIS to bring the data into PDW and do the transformations within the engine.

For your Reference






























No comments:

Post a Comment