Monday, August 11, 2014

Microsoft Parallel Data Warehouse : Building a data warehouse in SQL Server: Tips to get started

When building a data warehouse using SQL Server technologies, there are some challenges you might face. I'll offer you some suggestions for overcoming them. Naturally, each environment is likely to add its own twist to the challenges of building a data warehouse, but most of these issues are generic enough to apply to many companies embarking on architecting business intelligence (BI) applications.

1.  Management support. This is the first and foremost challenge you must overcome. It is annoying to technical people (including yours truly) because it has to do with politics and not software. Let's keep in mind who signs your paycheck, however. If management isn't convinced they need a data warehouse all your technical skills will be in vain.

Unfortunately upper management often thinks of a data warehouse as yet another system they have to invest in without any immediate return. Your job is to convince the powers that be they're not paying for bells and whistles – data warehouse will help them make better, more informed decisions by correlating data scattered across the enterprise. It is not uncommon to totally change the way an organization operates after a data warehouse application is implemented. Sometimes it helps to build a proof-of-a-concept (POC) project to demonstrate the power of BI. POC data warehouse will only contain a small portion of all data

and will only be exposed to a few users for testing. Even if you don't have an approval for a POC you should demonstrate parts of the warehouse as they become available to your users; doing so keeps them interested and assured that the system will help them be more effective in their work.

2.  Data availability. Data warehouse usually combines data from multiple data sources, which aren't necessarily obvious or easily accessible. Depending on the company size, you might have to travel to multiple offices and speak to numerous employees to figure out where you can get the data that is necessary to build analytical views desired by your users. Again, depending on the political climate of your enterprise, you might find it difficult to get your hands on each necessary data element. Data you work with might contain confidential and highly sensitive details. Furthermore, some data might only be accessible through reports provided by an outside entity who will not grant you access to their data source.

How do you overcome this challenge? First, be sure to convince everyone that you're not taking their job. Data warehouse is supposed to supplement and not replace any existing systems. If you have management on your side you shouldn't have too many issues with internal employees. But if you cannot get access to the data source – for unforeseen reasons – then you may have to get creative. If data is only available on report screens or on paper you might have to find a way to capture this data perhaps through scraping report screens or scanning documents.

3.  Complexity of data sources. Sometimes you get lucky and all data elements that you need reside in the database management system (DBMS) of your choice. More often, data is dispersed across multiple DBMS, spreadsheets, email systems, electronic documents and even on paper. Yes, we're in 21st century, but take my word - there are still companies out there who keep certain information only on paper. It's your job to figure out how to get data into your warehouse from all disparate sources and give it a common shape.

4.  Data quality. Many transaction processing applications are thrown together using rapid development techniques, often by people who have very limited knowledge or experience with the tools they're using. This is not a derogatory statement; one must start somewhere and often junior programmers are most affordable resources that a company can use. The trouble is that if the application does not validate data you're likely to encounter string values that are abbreviated, misspelled or completely omitted. For transaction level reports, this might not be a huge deal, but when you're trying to group data and empower your users with the ability to make decisions, data quality is essential.

For example, take a look at the following customer names:

a. ACME Boot Outlet
b. ACME bt outlt
c. A.C.M.E
d. Boots (ACME)
e. A c m e boot store (outlet)


Human eyes can easily spot that each of the above refers to the same business entity. But to the computer program, each of these values represents a separate customer. Unfortunately there is no easy way to correct all bad data using Integration Services (or Data Transformation Services) packages. SQL Server 2005 Integration Services offers the Fuzzy Lookup Transformation which may greatly simplify your data cleansing efforts. Even so, you might also have to enlist the help of clerical workers to correct the data.

5.  Multitude and technical savvy of your users. Helping a handful of users get to the data they need in order to make decisions is not a small fish to fry. Historically, data warehouse was reserved for upper-level management's use. However, data warehousing has slowly made its way to the masses. It's not uncommon to see data warehousing and BI applications used by users that don't necessarily make strategic decisions.

Once your users see the power of data warehousing, they'll want to use it for everything from data entry to retrieving transactional level reports. The larger your user community, the tougher it becomes to keep everyone happy and to educate them on appropriate uses of your application. BI applications perform great when used appropriately, but they are not suitable for every business requirement. Often, you must be the bearer of bad news that data warehouse is not the tool of choice for the functionality your users desire.

6.  Slowly changing dimensions. Your application must reflect data changes that occur in data sources populating the warehouse. Data within dimension tables is particularly prone to change. But how do you maintain the history of such changes?

    a:    The first and easiest method is to overwrite the existing record without tracking changes. Fortunately              this method is acceptable for many dimensions. For example, if a department name changes from                  "finance" to "finance and accounting" you probably won't be required to maintain the history of such              change. However, in a customer or student dimension it is often necessary to keep track of changing              names, marital status, education level, and other attributes - your application must be able to retrieve              current as well as historical values.

  b:      The second method for managing slowly changing dimensions is to create a new record when the                   value changes and tag the older record as obsolete.

  c:        The third and final method is to maintain historical value(s) of the changing domain in separate                        columns of the same row within the dimension table.

So far we've covered the handling of changing values in the data warehouse dimension. But what about Analysis Services dimensions? To reflect changes in attribute values you must re-process dimensions. But if you process dimensions fully you'd have to also reprocess cubes, which might be impractical if your cube contains large volumes of data. Fortunately you can often use the incremental update (process update) option to avoid reprocessing cubes.

7.  Changing facts. Normally one would think that records in the fact tables are static – once the record gets to the warehouse your job is finished, right? Unfortunately the answer is "it depends." In some cases like in a data warehouse tracking patients' hospital stays, all records are generally static. If your hospital stay lasted from January 1 to January 5, that record isn't likely to change.

But consider the retail industry; all sales aren't final – I'm sure you know people who often take the goods they purchased back to the store for various reasons. Some companies would manage such transactions as a series of credits and debits that balance each other out. But in other cases you must either update or delete fact table records, even after they're added to the data warehouse. For example, if a stock trade is recorded incorrectly, balancing it with a corresponding negative trade might not be acceptable. Here's another thing to consider: you might not want your customers to know about problems in your transactional systems. Rather, you want them to see the data only after it is corrected.

One method for dealing with changing facts is to keep the records in the staging area until the data is examined for quality and then migrate it to the warehouse. Even the most thorough testing won't catch all errors in the data sources, however. Occasionally, you might have to refresh cubes by processing those partitions that contain incorrect data. This is why it's essential to keep your Analysis Services partitions small so that processing will complete relatively quickly.

Another way to deal with this challenge is through a write-back partition. With cube write-back, you're not really changing data in the underlying relational warehouse; instead you're adding a record to a separate partition. When a user queries a particular measure group Analysis Services combines data from read-only partition with the data in write-back partition and exposes the result. Not surprisingly, performing such calculations at query execution time imposes additional overhead on Analysis Server and can cause performance degradation.

8.  Implementing security. Much like with any other business application, it's imperative to secure access to your data warehouse and cubes. Each user might only be permitted to see a small portion of the warehouse, depending on their job title or area of responsibility. Analysis Services implements security through Windows accounts which you can associate with Analysis Services roles. You can secure data at a very granular level, all the way down to individual cube cells.

Unfortunately creating a separate role for each user is not only tedious and cumbersome but could also instigate performance problems – this is particularly true of Analysis Services versions prior to 2005. The advice is to group your users into a handful of roles, if possible. If every single user must have her own set of permissions, you'll have to implement a security dimension. Once you have such dimension every MDX query will have to filter the output based on the identity of the current user.

Naturally I can't cover every data warehousing challenge in a single tip. Nor can any one person claim to be familiar with every difficulty involved in building data warehousing solutions. However, if you're contemplating an attempt to implement such solutions, you should carefully consider what you're up against. Do your homework and be prepared for a battle.



For your Reference






























Microsoft Parallel Data Warehouse : SQL Server 2012 Parallel Data Warehouse (PDW) – What’s new?

I want share some insights about the new SQL Server 2012 version of Parallel Data Warehouse. In this post I will talk about some new features as well as architecture changes.

Architecture Update

The SQL Server 2012 version of PDW introduces some major architecture changes:
  • As show in the picture below the Landing Zone and the Backup node have been removed from the appliance and there good reasons for. The standard sizing of these component didn’t met most of the customer requirements and it was very hard to find a configuration that meets 80% of PDW customers. So decision has been made to remove these components and customers have now more flexibilities to size the ETL and the Backup node to their needs.
image

  • Expensive storage components have been replace by an economical high density Direct Attached Storage
  • Virtualization of the Software Components based on Windows Server 2012 and Hyper-V
  • Modular Design and Smarter Scale Out: SQL Server 2012 PDW has reshaped the very hardware specifications required of an appliance through innovations from the software to deliver optimal value to customers. The new version introduces a new modular design, as show in the following picture:
image

The smallest PDW appliance consists of a Base Scale Unit (with a Passive Scale Unit for HA) and can be extended by a number of Capacity Units. If you reach the physical space limit of the rack you can add up to 6 more racks.
Depending of the vendor the units are shipped differently: HP offers a Base Scale Unit with 2 compute nodes and every Capacity Unit is also shipped with 2 compute nodes (left side of the picture). Dell offers the Base Scale Unit and the Capacity Unit with 3 compute nodes each (right side of the picture).
If we have a look at the capacity the smallest PDW appliance from HP (Quarter Rack with only the Base Scale Unit) offers a capacity of 53 – 227 TB (depending on the compression rates) and a raw disk space of 45 TB (with 3 TB disks). The Quarter Rack of DELL provides a capacity of 79 – 3470 TB (depending on the compression rates) and a raw disk space of 68 TB (with 3 TB disks).
  • Further Scale: Based on the new modular hardware design and a hardware refresh PDW offers now a scale out capacity up to 5 Petabytes.

Software Architecture Overview

The new version of PDW provides also some major software updates. As the product name already states PDW is now running on SQL Server 2012. The operating system on all hosts is Windows Server 2012 Standard edition. All fabric and workload activity happens in Hyper-V virtual machines which also run on Windows Server 2012 Standard edition. A PDW Agent runs on all hosts and all VMs and collects appliance health data on fabric and workload.

image

A special PDW version of SQL Server 2012 Enterprise Edition is used on the Control node and on all Compute nodes to provide high scale database capabilities.

New Features

Additionally to the new hardware and software architecture I want to highlight some more very interesting features for customers:
  • Columnar Storage: Microsoft continues the rollout of xVelocity and provides with SQL Server 2012 PDW a new primary storage type for databases. Customers can now choose between a row store and a new updateable version of the xVelocity memory optimized columnstore as table storage format. This means that we can define a writable Clustered Columnstore Index (Updates and bulk load are fully supported) at a table so that the whole table is stored into memory and we benefit of a much higher compression by the column oriented storage format.
  • Visual Studio 2012 Integration: As you probably know the tool used for database administration and database development in PDW Version 1 was Nexus. With the new version of PDW we have now full support forSQL Server Data Tools for Visual Studio 2012.
image   image
Also the project types for SSIS, SSRS & SSAS are fully supported with Visual Studio 2012.
  • Monitoring Enhancements: The whole monitoring of sessions, queries, loads, appliance health status and performance information has been completely redesign. Microsoft did a very good job here and the design looks like the Azure Portal. This means also for customers that monitoring and operations will look like the same On Premises as well as in the Cloud on Windows Azure. The following screenshot gives an impression of the new web user interface:
image
    • Polybase & Hadoop Integration: Today almost every big database vendor also provide a “Big Data” solutions based on Hadoop. Whether it’s a vendor specific distribution like IBM’s BigInsights solution, which is based on Cloudera or Microsoft’s specific implementation of Hortonworks Hadoop distribution called HDInsight or a full appliance which comes pre-installed with Hadoop. No matter which Hadoop platform you choose you still have the challenge of integration. Some type of data and analytics will happen on Hadoop (like text, log or sensor analysis) but we will still use databases for BI & Reporting.
While this might be something you have decided to do, you realize that there is a big learning curve when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, Hbase, etc. rather than T-SQL and a standard RDBSMS design. It will require a significant re-training around Hadoop and the ecosystem as well as a major effort to integrate the Hadoop implementation with the data warehouse.
In order to meet the requirements of a modern data platform, it must provide insights to your end users without having to acquire another tool from a third party or another expensive appliance offering to purchase.
The unfortunate reality is that no one vendor can deliver on all the options you need at a cost that you want to pay. They either have a data warehouse solution but no BI or provide BI but no data warehousing. Some vendors provide a Big Data solution but is disconnected with their data warehouse solution. Finally, some vendors might have a solution for each workload and will happily charge you millions of euros or dollars for them all.
image
Microsoft goes a different way and brings with SQL Server 2012 PDW an integrated query layer called “Polybase” which enables queries across Hadoop and SQL Server. Data structures stored in Hadoop are described by tables in PDW and customers can consume these data by standard T-SQL and can also join those tables with normal relational ones. So to end users that only consume this data it’s totally transparent where the data comes from and IT departments can use their normal database skillset to work with Hadoop. How does those kind of tables look like?
image
Now we can easily query and join this table.
image
Summary
As you can see from this post Microsoft did very heavy investments in its modern Big Data platform and gives customers the possibility to invest build high scale solutions on SQL Server 2012 PDW as well as on HDInsight for Hadoop based workloads. With Polybase customers get a fully integrated Hadoop query engine into the Data Warehouse Layer that can easily consumed with T-SQL knowledge.
The new architecture and the modular design of the appliance gives customers the possibility to start with small investments and scale very cost efficient on demand.
In my next post I will talk in more detail about a POC I did with the new version and the results and lessons learned.
Source: Microsoft slide decks of SQL Server 2012 Parallel Data Warehouse


For your Reference





























Microsoft Parallel Data Warehouse : Building a data warehouse in SQL Server: Tips to get started

When building a data warehouse using SQL Server technologies, there are some challenges you might face. I'll offer you some suggestions for overcoming them. Naturally, each environment is likely to add its own twist to the challenges of building a data warehouse, but most of these issues are generic enough to apply to many companies embarking on architecting business intelligence (BI) applications.

1.  Management support. This is the first and foremost challenge you must overcome. It is annoying to technical people (including yours truly) because it has to do with politics and not software. Let's keep in mind who signs your paycheck, however. If management isn't convinced they need a data warehouse all your technical skills will be in vain.

Unfortunately upper management often thinks of a data warehouse as yet another system they have to invest in without any immediate return. Your job is to convince the powers that be they're not paying for bells and whistles – data warehouse will help them make better, more informed decisions by correlating data scattered across the enterprise. It is not uncommon to totally change the way an organization operates after a data warehouse application is implemented. Sometimes it helps to build a proof-of-a-concept (POC) project to demonstrate the power of BI. POC data warehouse will only contain a small portion of all data

and will only be exposed to a few users for testing. Even if you don't have an approval for a POC you should demonstrate parts of the warehouse as they become available to your users; doing so keeps them interested and assured that the system will help them be more effective in their work.

2.  Data availability. Data warehouse usually combines data from multiple data sources, which aren't necessarily obvious or easily accessible. Depending on the company size, you might have to travel to multiple offices and speak to numerous employees to figure out where you can get the data that is necessary to build analytical views desired by your users. Again, depending on the political climate of your enterprise, you might find it difficult to get your hands on each necessary data element. Data you work with might contain confidential and highly sensitive details. Furthermore, some data might only be accessible through reports provided by an outside entity who will not grant you access to their data source.

How do you overcome this challenge? First, be sure to convince everyone that you're not taking their job. Data warehouse is supposed to supplement and not replace any existing systems. If you have management on your side you shouldn't have too many issues with internal employees. But if you cannot get access to the data source – for unforeseen reasons – then you may have to get creative. If data is only available on report screens or on paper you might have to find a way to capture this data perhaps through scraping report screens or scanning documents.

3.  Complexity of data sources. Sometimes you get lucky and all data elements that you need reside in the database management system (DBMS) of your choice. More often, data is dispersed across multiple DBMS, spreadsheets, email systems, electronic documents and even on paper. Yes, we're in 21st century, but take my word - there are still companies out there who keep certain information only on paper. It's your job to figure out how to get data into your warehouse from all disparate sources and give it a common shape.

4.  Data quality. Many transaction processing applications are thrown together using rapid development techniques, often by people who have very limited knowledge or experience with the tools they're using. This is not a derogatory statement; one must start somewhere and often junior programmers are most affordable resources that a company can use. The trouble is that if the application does not validate data you're likely to encounter string values that are abbreviated, misspelled or completely omitted. For transaction level reports, this might not be a huge deal, but when you're trying to group data and empower your users with the ability to make decisions, data quality is essential.

For example, take a look at the following customer names:

a. ACME Boot Outlet
b. ACME bt outlt
c. A.C.M.E
d. Boots (ACME)
e. A c m e boot store (outlet)


Human eyes can easily spot that each of the above refers to the same business entity. But to the computer program, each of these values represents a separate customer. Unfortunately there is no easy way to correct all bad data using Integration Services (or Data Transformation Services) packages. SQL Server 2005 Integration Services offers the Fuzzy Lookup Transformation which may greatly simplify your data cleansing efforts. Even so, you might also have to enlist the help of clerical workers to correct the data.

5.  Multitude and technical savvy of your users. Helping a handful of users get to the data they need in order to make decisions is not a small fish to fry. Historically, data warehouse was reserved for upper-level management's use. However, data warehousing has slowly made its way to the masses. It's not uncommon to see data warehousing and BI applications used by users that don't necessarily make strategic decisions.

Once your users see the power of data warehousing, they'll want to use it for everything from data entry to retrieving transactional level reports. The larger your user community, the tougher it becomes to keep everyone happy and to educate them on appropriate uses of your application. BI applications perform great when used appropriately, but they are not suitable for every business requirement. Often, you must be the bearer of bad news that data warehouse is not the tool of choice for the functionality your users desire.

6.  Slowly changing dimensions. Your application must reflect data changes that occur in data sources populating the warehouse. Data within dimension tables is particularly prone to change. But how do you maintain the history of such changes?

    a:    The first and easiest method is to overwrite the existing record without tracking changes. Fortunately              this method is acceptable for many dimensions. For example, if a department name changes from                  "finance" to "finance and accounting" you probably won't be required to maintain the history of such              change. However, in a customer or student dimension it is often necessary to keep track of changing              names, marital status, education level, and other attributes - your application must be able to retrieve              current as well as historical values.

  b:      The second method for managing slowly changing dimensions is to create a new record when the                   value changes and tag the older record as obsolete.

  c:        The third and final method is to maintain historical value(s) of the changing domain in separate                        columns of the same row within the dimension table.

So far we've covered the handling of changing values in the data warehouse dimension. But what about Analysis Services dimensions? To reflect changes in attribute values you must re-process dimensions. But if you process dimensions fully you'd have to also reprocess cubes, which might be impractical if your cube contains large volumes of data. Fortunately you can often use the incremental update (process update) option to avoid reprocessing cubes.

7.  Changing facts. Normally one would think that records in the fact tables are static – once the record gets to the warehouse your job is finished, right? Unfortunately the answer is "it depends." In some cases like in a data warehouse tracking patients' hospital stays, all records are generally static. If your hospital stay lasted from January 1 to January 5, that record isn't likely to change.

But consider the retail industry; all sales aren't final – I'm sure you know people who often take the goods they purchased back to the store for various reasons. Some companies would manage such transactions as a series of credits and debits that balance each other out. But in other cases you must either update or delete fact table records, even after they're added to the data warehouse. For example, if a stock trade is recorded incorrectly, balancing it with a corresponding negative trade might not be acceptable. Here's another thing to consider: you might not want your customers to know about problems in your transactional systems. Rather, you want them to see the data only after it is corrected.

One method for dealing with changing facts is to keep the records in the staging area until the data is examined for quality and then migrate it to the warehouse. Even the most thorough testing won't catch all errors in the data sources, however. Occasionally, you might have to refresh cubes by processing those partitions that contain incorrect data. This is why it's essential to keep your Analysis Services partitions small so that processing will complete relatively quickly.

Another way to deal with this challenge is through a write-back partition. With cube write-back, you're not really changing data in the underlying relational warehouse; instead you're adding a record to a separate partition. When a user queries a particular measure group Analysis Services combines data from read-only partition with the data in write-back partition and exposes the result. Not surprisingly, performing such calculations at query execution time imposes additional overhead on Analysis Server and can cause performance degradation.

8.  Implementing security. Much like with any other business application, it's imperative to secure access to your data warehouse and cubes. Each user might only be permitted to see a small portion of the warehouse, depending on their job title or area of responsibility. Analysis Services implements security through Windows accounts which you can associate with Analysis Services roles. You can secure data at a very granular level, all the way down to individual cube cells.

Unfortunately creating a separate role for each user is not only tedious and cumbersome but could also instigate performance problems – this is particularly true of Analysis Services versions prior to 2005. The advice is to group your users into a handful of roles, if possible. If every single user must have her own set of permissions, you'll have to implement a security dimension. Once you have such dimension every MDX query will have to filter the output based on the identity of the current user.

Naturally I can't cover every data warehousing challenge in a single tip. Nor can any one person claim to be familiar with every difficulty involved in building data warehousing solutions. However, if you're contemplating an attempt to implement such solutions, you should carefully consider what you're up against. Do your homework and be prepared for a battle.



For your Reference






























Microsoft Parallel Data Warehouse: Connecting and Configuring SQL Server Parallel Data Warehouse (PDW) Clients

SQL Server 2008 R2 Parallel Data Warehouse (PDW) isn't just about the appliance. It's also about the users who will query and analyze the data stored in it. But users can't simply log on to the appliance and access the data. Client connectivity tools and the tools to query PDW need to be installed on their computers.

I'll walk you through how to install and configure the client connectivity tools and the primary query interface. I'll also show you how to connect to the PDW Administration Console and how to configure some SQL Server business intelligence (BI) tools to work with PDW.

Installing the Client Connectivity Tools

In PDW 1.0, the client connectivity tools and the tools to query and manage PDW are separate from the SQL Server 2008 R2 toolset. You must install the PDW drivers, then separately install the Nexus Query Chameleon tool for graphical query access.

The first step is to install the client libraries. You'll find the client library setup programs on the Landing Zone of the appliance at C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\Redist. A 32-bit package (ClientTools-x86.msi) and a 64-bit package (ClientTools-amd64.msi) are included. Note that the Landing Zone is the component node of a PDW appliance where you can stage files to load into PDW and run SQL Server Integration Services (SSIS) packages to load data into PDW.

Setting up the client libraries is simple. For the standard setup, start the setup program, click Next in the Welcome dialog box, accept the license agreement, and click Next again. For the type of setup, select Typical and click Install. You'll be prompted to elevate to administrative rights. After you do so, the setup program copies and registers the client libraries for PDW. When setup completes, click Finish to exit the wizard. If you want to install both the 32-bit and 64-bit drivers, simply follow these steps for both the x86 and x64 packages.

After you have installed the client tools components, you can make connections using the provided drivers (OLE-DB, ODBC, and ADO.NET). If you plan to use SSIS with BIDS in your PDW appliance, you'll need to install the PDW SSIS Destination Adapter. You'll find the setup file for a 32-bit installation (SSISSQLPDWDest-x86.msi) and the 64-bit installation (SSISSQLPDWDest-amd64.msi) on the Landing Zone. If you don't plan to use BIDS on your client computer, you don't need to install the destination adapter.

At this point, you can use the command-line query tool for PDW, dwsql.exe. Much like osql.exe or sqlcmd.exe, dwsql.exe lets you query PDW from a command-line environment. You query PDW using the Dynamic SQL (DSQL) query language, which is similar to but not quite the same as T-SQL. The DSQL syntax, including connectivity parameters and supported language constructs, is available in PDW Books Online, which you receive with PDW.

Besides running queries from the command line, you can use the graphical query tool, Nexus Query Chameleon. Before you can use it, though, you need to install it.

Installing the Primary Query Interface

Nexus Query Chameleon is the primary query interface you'll use for graphical queries against PDW. You install this query tool from the same location as the other files on the Landing Zone. The files are NexusSetup32.msi and NexusSetup64.msi for 32-bit and 64-bit environments, respectively. Unlike the client components, you must set up the appropriate Nexus client for your computer (i.e., if you have a 64-bit Windows installation, you must install the 64-bit version of the Nexus client).

When you start the setup program, you'll get the traditional Welcome screen. Click Next, accept the EULA, accept the default file location and visibility choices, and click Next to confirm the installation. You'll be prompted to elevate to administrative permissions during the installation, after which the setup will complete.

Now that you have the Nexus client installed, you can connect to your PDW installation. If you're the first person connecting to the appliance, you need to use the standard SQL Server security login of "sa" as your username and the password for your PDW installation. The hardware vendor will have set the password for you during the appliance installation.

If you're not the first person using the appliance and someone has created an account for you, a login must have also been created for you on the appliance (with the CREATE LOGIN statement) and you must have been granted connect permissions (with the GRANT CONNECT statement). For the Nexus client to work fully, you should have three more permissions granted to your account. All three permissions can be granted with the GRANT SELECT ON OBJECT statement.


  • GRANT SELECT ON OBJECT::information_schema.schemata TO <login>
  • GRANT SELECT ON OBJECT::sys.databases TO <login>
  • GRANT SELECT ON OBJECT::sys.types TO <login>

Configuring the Query Client

The next step is to start the freshly installed Nexus query client. On your Windows Start menu, you'll find the shortcut to the Nexus Query Chameleon in the \Coffing Data Warehousing\Nexus by Tera-Tom folder. You should also find a shortcut created on your desktop. Note that the Nexus tool is a third-party query tool and completely separate from SSMS and the other query tools provided with the other SQL Server editions.

When you open Nexus Query Chameleon for the first time, a wizard will guide you through the initial setup. In the wizard's opening screen, click Next. You'll be prompted to add a Data Source Connection, as shown in Figure 1. The Nexus query tool connects to many different kinds of databases, so its configuration options are not specific to PDW.

                                            
                                            Figure 1: Adding a data source connection

In the Source Type drop-down list, select SQL Server PDW as your source type. Because this is probably the first time anyone has configured PDW on your computer, click the Add New button to create a new ODBC Data Source Name (DSN). This action will launch the ODBC Data Source Administrator. Click Add, select Microsoft SQL Server 2008 R2 Parallel Data Warehouse ODBC Client, and click Finish.

This process launches the Microsoft SQL Server 2008 R2 Parallel Data Warehouse ODBC Client Setup dialog box. Give your data source a user-friendly name (I used MyPDW) and a description if desired, then enter the PDW connection information. For this, you need either the computer name or the IP address of your PDW control node. Your administrator will specify the port number. (The system defaults to port 17000.) The default system database is dwsys, but you'll likely be assigned a particular database to use by default. Figure 2 shows the completed ODBC client setup for my test PDW system.


                                        Figure 2: Displaying a completed ODBC client setup for PDW

Click Test Connect to verify that you entered the information correctly. When prompted for your PDW username and password, enter your security credentials and click OK. You should get the message that the connection test was successful. Click OK to dismiss that dialog box and OK again to get back to the ODBC Data Source Administrator screen. Your User DSN has now been created and tested, so click OK one more time to return to the Add Data Source Connection dialog box. Select your newly created ODBC DSN in the drop-down list (you might have to click the Refresh button to see the new entry), then enter your username and password. If this is your first time ever logging on to the PDW appliance, select the Alter Login after connection check box because all passwords are set by default to expire on first usage. You should now see a screen like that in Figure 3, completed and ready for you to proceed.  


                                            
                                           Figure 3: Displaying a completed data source connection

Click Save. If you selected the Alter Login after connection check box, you are prompted to enter your new password. Enter your new password, and then click OK. Congratulations-you have now successfully connected to your PDW appliance.

There will be a tree of your available systems on the left side of your screen, as shown in Figure 4. If you expand a PDW system, you'll see a list of the available databases. If you expand a particular database, you'll see the tables and views in it.        

                                      
                                   Figure 4: Exploring the available PDW systems

As you can see in Figure 4, I entered a query to find out how many rows are in the Sales_Item table in the Sample_Retail_5X database. When you enter a query, the Nexus query tool will highlight the System field on the toolbar. (I circled it so you can find it more easily.) You need to select the active system you're using for the query (MyPDW in this example). The Database field will then default to the system database dwsys.

You run a query by clicking Execute on the toolbar. When I ran my query, the USE command changed the database context to the specified database (Sample_Retail_5X), just like in a typical SQL Server database query. The number of rows in the Sales_Item table was then counted. As you can see on the Results 1 tab, the table has more than 7.6 billion rows.

Accessing the PDW Administration Console

The PDW Administration Console is a website hosted on the control node of the appliance. The console provides a variety of information about your PDW system. To access the console, open Microsoft Internet Explorer (IE), then navigate using a secure connection to the IP address of your PDW control node. You'll be prompted to log on to the appliance, just as when you were using the Nexus query tool. Once connected, you'll see a screen like that in Figure 5.


                                             Figure 5: Exploring the PDW Administration Console

You have access to a significant amount of information in the PDW Administration Console from the following tabs:


  • Sessions-This tab provides information about the active sessions on your appliance. You have the ability to disconnect any session if you have the administrative rights to do so.
  • Queries-This tab shows queries that are currently running and provides a history of recent queries. You can also access showplans from this tab, which I'll discuss shortly.
  • Loads-This tab contains information about current and historical data-loading sessions. For example, you can find out the commands that were run in a past data-loading session, the steps processed during that data load, the start and end times of the load, and the tables affected.
  • Backups/Restores-This tab contains the history of every backup and restore operation that has occurred on the appliance. You can drill down to see the database affected, the backup or restore mode, who ran the backup or restore operation, and the commands used for that operation.
  • Appliance State-This tab lets you quickly check the health state of the appliance. If anything is in a trouble or error state, you'll get a graphical indication. You can drill down to see more detailed health-state information.
  • Alerts-This tab reflects any Critical Alerts or Warning Alerts, again with the option to drill down into detailed diagnostics information about the error or warning.
  • Locks-This tab gives you a view of all locks affecting the appliance, letting you drill into the details of the lock. You also can cancel a lock by terminating the query causing it.
  • Performance Monitor-This tab lets you configure custom graphs of information about disks, memory, processes, processors, and more. After you select the graphs you want to monitor, they'll refresh automatically, providing you with a custom troubleshooting console for your PDW appliance.

Looking at Showplans

A key use of the PDW Administration Console is to look at showplans, which can help you optimize queries and database designs. Showplans are automatically captured for queries run in PDW. I ran a more complicated query and received the showplan in Figure 6. You can drill down into these steps, down to the level of finding out what operations were performed on each PDW node to accomplish the query. When queries are run against a distributed table, you'll see that operations occur on every PDW node, as Figure 7 shows.


                                        Figure 6: Examining a sample showplan 



                                          Figure 7: Drilling down into the sample showplan 

With the showplan information in hand, you can begin the process of tuning and optimizing both your database design and your queries to improve performance.

Using SQL Server's BI Tools with PDW

As you might expect, you can use SQL Server's BI tools in PDW. Let's step through the main BI tools and look at how you can connect to PDW with them.

SQL Server Analysis Services (SSAS). To configure SSAS to work with PDW, start a new SSAS project in BIDS. Right-click the Data Sources folder and select New Data Source to launch the Data Source Wizard. Navigate to the Define the Connection dialog box and click New to bring up the Connection Manager dialog box. From the Provider drop-down list, select the ADO.NET provider named Microsoft SQL Server 2008 R2 Parallel Data Warehouse.

Next, you need to enter the IP address of your PDW installation for the Host parameter. Be sure the port is set to 17000 (or whatever your server is configured to use). Specify which database to use in the Database property, which is at the top of the dialog box under Advanced selections. You also need to enter your username for User ID and your PDW password. Click the Test Connection button to confirm connectivity, and click OK. Click Finish to complete the Data Source Wizard.

At this point, you'll proceed through the Data Source View Wizard as you would with any SQL Server installation. You can then build your cubes and dimensions on top of PDW.

SSIS. Before you can configure an SSIS package to work in PDW, you need to enable the SSIS PDW SQL Server Destination connector. Open BIDS and select the Choose Toolbox Items option on the Tools menu. Go to the SSIS Data Flow Items tab. Find and select the SQL Server PDW Destination check box. Click OK to enable this connector in your graphical interface.

If you right-click Connection Manager in your SSIS package, you'll see that your PDW connector is already available. To use it, you simply drag the SQL Server PDW Destination icon from the Data Flow Destinations list to the Data Flow design surface, just like you would any other data destination.

SQL Server Reporting Services (SSRS). Configuring SSRS to work with PDW is similar to the processes used for configuring SSAS and SSIS. Connectivity is through the ADO.NET provider. After you define the connection, you are good to go, just as you would be with any other SQL Server connection. Keep in mind that programmability through stored procedures in SQL Server 2012 is not the same as it was in SQL Server 2008 R2, so your reports will not necessarily work unchanged between SQL Server systems and PDW systems.

The Microsoft SQL Server PDW team has released a series of connectors to simplify connectivity between PDW and several third-party data products. At the time of this writing, connectors are available for the Apache Hadoop, Informatica PowerCenter, MicroStrategy BI Platform, and SAP BusinessObjects Data Integrator. With these connectors you can interoperate between PDW and the other BI products to allow optimal integration and speed of connectivity.

PDW Is Also About the Clients

Basic connectivity to the PDW appliance requires that you install and set up client drivers and client connectivity components. I explained how to set up that connectivity. I also explained how to install and set up the Nexus Query Chameleon on the client computer so that you can run queries. Finally, I showed you how to access the PDW Administration Console so that you can view and analyze showplans and other BI data. By taking advantage of the showplans, you can optimize the queries you run against your PDW appliance.


For your Reference