Monday, August 11, 2014

Microsoft Parallel Data Warehouse : Parallel Data Warehouse (PDW) POC – lessons learned Part 2

After the introduction of the POC itself and the high level test cases I will now go into more detail of every test case. But let’s start with the migration first.

Database and SSIS Package migration to PDW

In order to execute our test cases we first had to migrate our POC codebase  so that it is able to run on the PDW machine. This means:
  • Migration of the POC Core Data Warehouse structure (tables, views) to PDW
  • Initial load of the database hosted on PDW
  • Migration of the SSIS packages
  • Migration of Stored Procedures to execute on PDW

Relational Structures

The migration itself went very smoothly. Microsoft provides a tool that analyzes the database structure based on a traditional SQL Server design and automatically creates scripts to create the structures on PDW. The things  that are currently not supported on PDW (like primary keys or identity columns) are automatically disabled within the create scripts. Also based on the table size the tool makes a suggestion whether to replicate or distribute the table.
In order to create a new database on PDW you can use the following script:
CREATE DATABASE Datawarehouse WITH
(
       AUTOGROW = ON, -- make the database as big as you need it
       REPLICATED_SIZE = 25 GB, -- space allocated on each node
       DISTRIBUTED_SIZE = 500 GB, -- total size of all distributed tables
       LOG_SIZE = 25 GB -- should be same size as replicated space
);
The replication size specifies the size on each node used for replicated tables. The distribution size is the total size of all distributed table over all nodes.
Here is a code sample of a create table script:
CREATE TABLE [dbo].[TableA]
(
    [Attribute1] INT NULL,
    [Attribute2] INT NULL,
    [Attribute3] NVARCHAR(30) COLLATE Latin1_General_100_CI_AS_KS_WS NULL,
    [Attribute4] NVARCHAR(30) COLLATE Latin1_General_100_CI_AS_KS_WS NULL,
    [Attribute5] INT NULL,
)
WITH (CLUSTERED INDEX([Attribute1]),
DISTRIBUTION = REPLICATE);
As you can see there is no need to define file groups anymore, all this is handled by PDW automatically. The only option you have to specify is whether you want to replicate the table to each compute node or you want to distribute it via a hash key.

SSIS Packages

Let’s have a look on the SSIS packages. The migration itself is also not that complex. In order to be able to load data into a PDW instead of a “normal” SQL Server we only had to change the connection and the connection manager within our packages.
image
The PDW Destination has the following 4 options for data loading: Append, FastAppend, Reload or Upsert.
Within the “Append” mode (which works like insert) the data will be loaded into an intermediate staging table first before it gets finally inserted into the target table.  With “FastAppend” data will be directly loaded into the target table. “Reload” does the same as “Append” but it truncates the target table first. The last option “Upsert” works like a Merge statement.
Another nice feature of the PDW task is the transaction option to rollback the data load if something goes wrong. This is a lot easier then with a traditional SQL server where you have to specify transactions manually or work with the Distributed Transaction Coordinator (DTC).

Stored Procedures

For stored procedures the story is a little bit more tricky. Because in the current PDW version AU3 only a subset of the stored procedure functionality of a traditional SQL Server is supported. In order to save some time and to minimize code changes we decided to run the stored procedure on a traditional SQL Server connected as a Linked Server to PDW. The PDW architecture allows this design because we can install a “normal” SQL Server on the Landing Zone and create a Linked Server connection to our PDW database. The script to create a Linked Server connection is listed below:
EXEC master.dbo.sp_addlinkedserver @server = N'PDWLINK1', 
      @srvproduct=N'PDW', 
      @provider=N'Microsoft.SQLServerMPPOLEDBProvider', 
      @provstr=N'Host=0.0.0.0;Port=17000;', 
      @catalog=N'Datawarehouse'

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'PDWLINK1',
      @useself=N'False',
      @locallogin=NULL,
      @rmtuser=N'USER',
      @rmtpassword='XXXXXX'
Unfortunately the Linked Server connection to PDW is not stable enough so that were not able to use it. To be fair this is not a problem of PDW, it is a problem of the Linked Server protocol of SQL Server.

Summary

As described the migration of the relational structures and the SSIS packages is not a big deal. Because the MPP engine works different then a normal SQL Server the overall ETL architecture should be reviewed additionally. PDW makes our live a lot easier in case of loads and transactions because it’s build in the SSIS task. Also the SSIS design will be a lot more clear but this I will explain in the next post when we have a look at the performance numbers for ETL.
The stored procedure / SQL functions logic is the hardest part of the migration. Because of the current limitations (which will be getting much better in PDW Version 2) for stored procedures you have to plan some time for the migration effort. The effort mostly depends on the logic and the complexity of your code.


For your Reference





























No comments:

Post a Comment