Introduction to
Enterprise Edition
•         
Parallel
processing = executing your application on multiple CPUs
–        
Scalable
processing = add more resources (CPUs and disks) to increase system performance
•         
Example
system containing 6 CPUs (or processing nodes) and disks
•         
Run
an application in parallel by executing it on 2 or more CPUs
•         
Scale
up system by adding more CPUs
•         
Can
add new CPUs as individual nodes, or add CPUs to an SMP node
Traditional
Batch Processing
Write
to disk and read from disk before each processing operation
•         
  Sub-optimal utilization of resources
•         
 a 10 GB stream leads to 70 GB of I/O
•         
 processing resources can sit idle during I/O
•         
  Very complex to manage (lots and lots of
small jobs)
•         
  Becomes impractical with big data volumes
•         
  disk I/O consumes the processing
•         
  terabytes of disk required for temporary
staging
Pipeline
Multiprocessing
Think
of a conveyor belt moving rows from process to process!
•         
Transform, clean and load processes are executing
simultaneously
•         
 Rows are
moving forward through the flow
•         
  Start a downstream process while an
upstream process is still running.
•         
  This eliminates intermediate storing to
disk, which is critical for big data.
•         
  This also keeps the processors busy.
•         
  Still have limits on scalability
Partition
Parallelism
•         
Divide
large data into smaller subsets (partitions) across resources
–        
Goal
is to evenly distribute data
–        
Some
transforms require all data within same group to be in same partition
•         
Requires
the same transform on all partitions
–        
BUT:
Each partition is independent of others, there is no concept of “global” state
•         
Facilitates
near-linear scalability (correspondence to hardware resources)
–        
8X
faster on 8 processors
24X faster on 24 processors  
Enterprise
Edition Combines Partition and Pipeline Parallelisms
•         
Within
the Parallel Framework, 
Pipelining and Partitioning Are Always Automatic
Pipelining and Partitioning Are Always Automatic
–        
Job
developer need only identify
•         
Sequential
vs. Parallel operations (by stage)
•         
Method
of data partitioning
•         
Configuration
file (there are advanced topics here)
•         
Advanced
per-stage options (buffer tuning, combination, etc) 
Job Design vs.
Execution
User
assembles the flow using DataStage Designer
… at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
Atul Singh







No comments :
Post a Comment