Wednesday, April 14, 2010

Partitioning data (Best Practices) in DataStage E.E. 8.1

In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. Based on requirements, and these can be optimize within a job and across job flows.

Objective 1

Choose a partitioning method that gives close to an equal number of rows in each partition, while minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing overall run time.

Objective 2
The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required.

Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps that process groups of related records.

Objective 3

Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or Grid configurations.

Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.

Objective 4
Partition method should not be overly complex. The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. Using the above objectives as a guide, the following methodology can be applied:

Start with Auto partitioning (the default).
Specify Hash partitioning for stages that require groups of related records as follows:
· Specify only the key column(s) that are necessary for correct grouping
as long as the number of unique values is sufficient
· Use Modulus partitioning if the grouping is on a single integer key
column
· Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused)

If grouping is not required, use Round Robin partitioning to redistribute data equally across all partitions.

· Especially useful if the input Data Set is highly skewed or sequential

Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning

· Be mindful that Same partitioning retains the degree of parallelism of the upstream stage
· Within a flow, examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. This may require re-examining key column usage within stages and re-ordering stages
within a flow (if business requirements permit).


Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order.

No comments: