We have moved to www.dataGenX.net, Keep Learning with us.

Thursday, June 28, 2012

Partitioning & Pipelining in DataStage


Introduction to Enterprise Edition

          Parallel processing = executing your application on multiple CPUs
         Scalable processing = add more resources (CPUs and disks) to increase system performance

          Example system containing 6 CPUs (or processing nodes) and disks
          Run an application in parallel by executing it on 2 or more CPUs
          Scale up system by adding more CPUs
          Can add new CPUs as individual nodes, or add CPUs to an SMP node





Traditional Batch Processing


Write to disk and read from disk before each processing operation
            Sub-optimal utilization of resources
           a 10 GB stream leads to 70 GB of I/O
           processing resources can sit idle during I/O
            Very complex to manage (lots and lots of small jobs)
            Becomes impractical with big data volumes
            disk I/O consumes the processing
            terabytes of disk required for temporary staging




Pipeline Multiprocessing

Think of a conveyor belt moving rows from process to process!
          Transform, clean and load processes are executing simultaneously
           Rows are moving forward through the flow
            Start a downstream process while an upstream process is still running.
            This eliminates intermediate storing to disk, which is critical for big data.
            This also keeps the processors busy.
            Still have limits on scalability


Partition Parallelism

          Divide large data into smaller subsets (partitions) across resources
         Goal is to evenly distribute data
         Some transforms require all data within same group to be in same partition
          Requires the same transform on all partitions
         BUT: Each partition is independent of others, there is no concept of “global” state
          Facilitates near-linear scalability (correspondence to hardware resources)
         8X faster on 8 processors
24X faster on 24 processors  
Enterprise Edition Combines Partition and Pipeline Parallelisms

          Within the Parallel Framework,
Pipelining and Partitioning Are Always Automatic
         Job developer need only identify
          Sequential vs. Parallel operations (by stage)
          Method of data partitioning
          Configuration file (there are advanced topics here)
          Advanced per-stage options (buffer tuning, combination, etc) 







Job Design vs. Execution

User assembles the flow using DataStage Designer




 … at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)




njoy the simplicity.......
Atul Singh

No comments :

Post a Comment