Sunday, August 09, 2015

What is Parallelism in SSIS ?

Parallelism is a great technique to improve the performance of your data integration operations. SSIS natively supports the parallel execution of packages, tasks, and transformations. The trick to successful parallelism is to configure operations within the constraints of your system resources.

Configurable settings

Within SSIS, the control flow for each package is controlled by a setting called MaxConcurrentExecutables, which specifies the maximum number of SSIS threads that can execute in parallel per package. By default, this is set to -1, which translates to the number of logical machine processors plus 2.
If SSIS runs on a dedicated server and you have a lot of operations that run in parallel, you will likely want to increase this setting if some of the operations do a lot of waiting for external systems to reply. On the other hand, if you do not have a dedicated SSIS machine and your data integration application runs alongside several other applications, you may need to reduce this setting to avoid resource conflicts.

Design approaches

As you design packages for parallelism, you need to decide whether to run some or all of the operations in the package in parallel. As with buffer sizing decisions, decisions about parallelism are best made when you take into account available system resources.
Consider the tradeoffs of different design approaches that apply parallelism to a package that reads data from a source database, aggregates the data four different ways, and then loads each aggregated data set into a different destination table.

  • Parallelize Destination Operations - In the design approach pictured in Figure 3, data is read from a source, passed through one aggregation transformation with four different sets of group by operations, and then loaded into four destination tables. The only parallel operation in this design is the loading of the four destination tables from the aggregate output. The parsing of the source file and the aggregate calculation are not parallel operations.
    Figure 3: Parallelize destination operations
    If you have a machine that has multiple CPUs, then this approach is not going to enable you to leverage them effectively. This design approach is best when you have a memory constrained machine and if you have multiple aggregates that can be derived from each other. The aggregate transform will create the aggregate with the lowest granularity and then derive all related aggregates from that. For example, if you have two aggregates: (1) sales data aggregated by year and (2) sales aggregated by year by geography, the aggregate transform automatically creates the sales aggregated by year and geography and then derives the yearly summary from that.
  • Partially Parallelize Operations - In the design approach pictured in Figure 4, data is read from a source, passed through a multicast transformation that creates four identical result sets that are passed to four different aggregation transformations, and then loaded into four destination tables.
    Figure 4: Partially parallelize operations
    In spite of what you might think, the only parallelism in this scenario is the loading of the destination tables. Reading the data source, performing the multicast, and aggregating data each operate in the same execution tree and therefore share the same memory and threading.
    If you would like to run the aggregates in parallel, in this scenario, you can introduce a Union All transformation after the Multicast to create a new execution tree. Remember that Union All is a partially blocking transformation and always creates a new execution tree. When you introduce a Union All, data will be copied into additional buffer, but you will also gain additional threads to perform the aggregate in parallel.
  • Parallelize All Operations - In the design approach pictured in Figure 5, there are four independent sets of operations that each read data from the source, aggregate data, and load data into a distinct destination.
    Figure 5 – Parallelize All Operations
    In this scenario, all operations are performed in parallel: data source extraction, aggregate, and insertion to the database. If your server is not constrained by memory and has multiple CPUs, this approach provides a high-performance solution; however, you may be wasting resources by treating every operation uniformly and by reading the same data set four different times.
  • Optimize the Slowest - As an alternative to uniform parallelism, you may want to consider the targeted design approach displayed in Figure 6 where you focus on the slowest component in your package that is most likely to benefit from parallelism. In this design approach, one of the four aggregates has been identified as the slowest aggregate operation and has been broken out into a separate data stream.
    Figure 6: Optimize the slowest
    To speed up performance in the new data stream, the aggregate has been broken down into two “partitions” by using a conditional split that segregates the source data based on key ranges. After each aggregate is calculated, the data sets are combined with a Union All transformation. The resulting data is loaded into the destination.
    This hybrid design approach demonstrates how you can apply parallelism to the specific operations that can benefit the most from the performance gain without potentially wasting machine resources.

No comments:

Post a Comment