Nuts & Bolts of DataStage: Partitioning & Pipelining in DataStage

Thursday, June 28, 2012

Partitioning & Pipelining in DataStage

Introduction to Enterprise Edition

• Parallel processing = executing your application on multiple CPUs

– Scalable processing = add more resources (CPUs and disks) to increase system performance

• Example system containing 6 CPUs (or processing nodes) and disks

• Run an application in parallel by executing it on 2 or more CPUs

• Scale up system by adding more CPUs

• Can add new CPUs as individual nodes, or add CPUs to an SMP node

Traditional Batch Processing

Write to disk and read from disk before each processing operation

• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O

• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)

• Becomes impractical with big data volumes

• disk I/O consumes the processing

• terabytes of disk required for temporary staging

Pipeline Multiprocessing

Think of a conveyor belt moving rows from process to process!

• Transform, clean and load processes are executing simultaneously

• Rows are moving forward through the flow

• Start a downstream process while an upstream process is still running.

• This eliminates intermediate storing to disk, which is critical for big data.

• This also keeps the processors busy.

• Still have limits on scalability

Partition Parallelism

• Divide large data into smaller subsets (partitions) across resources

– Goal is to evenly distribute data

– Some transforms require all data within same group to be in same partition

• Requires the same transform on all partitions

– BUT: Each partition is independent of others, there is no concept of “global” state

• Facilitates near-linear scalability (correspondence to hardware resources)

– 8X faster on 8 processors

24X faster on 24 processors

Enterprise Edition Combines Partition and Pipeline Parallelisms

• Within the Parallel Framework,
Pipelining and Partitioning Are Always Automatic

– Job developer need only identify

• Sequential vs. Parallel operations (by stage)

• Method of data partitioning

• Configuration file (there are advanced topics here)

• Advanced per-stage options (buffer tuning, combination, etc)

Job Design vs. Execution

User assembles the flow using DataStage Designer

… at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)

njoy the simplicity.......
Atul Singh

Nuts & Bolts of DataStage

Thursday, June 28, 2012

Partitioning & Pipelining in DataStage

No comments :

Post a Comment