Saturday, July 05, 2014

Parallel Data Ware House

The SQL Server 2008 R2 Parallel Data Warehouse (PDW) edition is its first product in the Massively Parallel Processor (MPP) data warehouse space.
PDW uniquely combines MPP software acquired from DATAllegro, parallel compute nodes, commodity servers, and disk storage.
PDW lets you scale out enterprise data warehouse solutions into the hundreds of terabytes and even petabytes of data for the most demanding customer scenarios. In addition, because the parallel compute nodes work concurrently, it often takes only seconds to get the results of queries run against tables containing trillions of rows. For many customers, the large data sets and the fast query response times against those data sets are game-changing opportunities for competitive advantage.
The simplest way to think of PDW [Parallel Data Warehouse Tutorial] is a layer of integrated software that logically forms an umbrella over the parallel compute nodes. Each compute node is a single physical server that runs its own instance of the SQL Server 2008 relational engine in a shared-nothing architecture. In other words, compute node 1 doesn't share CPU, memory, or storage with compute node 2.
The smallest PDW will take up two full racks of space in a data center, and you can add storage and compute capacity to PDW one data rack at a time. A data rack contains 8 to 10 compute servers from vendors such as Bull, Dell, HP, and IBM, and Fiber Channel storage arrays from vendors such as EMC, HP, and IBM. The sale of PDW includes preconfigured and pretested software and hardware that's tightly configured to achieve balanced throughput and I/O for very large databases. Microsoft and these hardware vendors provide full planning, implementation, and configuration support for PDW.
The collection of physical servers and disk storage arrays that make up the MPP data warehouse is often referred to as an appliance. Although the appliance is often thought of as a single box or single database server, a typical PDW appliance is comprised of dozens of physical servers and disk storage arrays all working together, often in parallel and under the orchestration of a single server called the control node. The control node accepts client query requests, then creates an MPP execution plan that can call upon one or more compute nodes to execute different parts of the query, often in parallel. The retrieved results are sent back to the client as a single result set.

No comments:

Post a Comment