Introduction to
Enterprise Edition
•
Parallel
processing = executing your application on multiple CPUs
–
Scalable
processing = add more resources (CPUs and disks) to increase system performance
•
Example
system containing 6 CPUs (or processing nodes) and disks
•
Run
an application in parallel by executing it on 2 or more CPUs
•
Scale
up system by adding more CPUs
•
Can
add new CPUs as individual nodes, or add CPUs to an SMP node
Traditional
Batch Processing
Write
to disk and read from disk before each processing operation
•
Sub-optimal utilization of resources
•
a 10 GB stream leads to 70 GB of I/O
•
processing resources can sit idle during I/O
•
Very complex to manage (lots and lots of
small jobs)
•
Becomes impractical with big data volumes
•
disk I/O consumes the processing
•
terabytes of disk required for temporary
staging
Pipeline
Multiprocessing
Think
of a conveyor belt moving rows from process to process!
•
Transform, clean and load processes are executing
simultaneously
•
Rows are
moving forward through the flow
•
Start a downstream process while an
upstream process is still running.
•
This eliminates intermediate storing to
disk, which is critical for big data.
•
This also keeps the processors busy.
•
Still have limits on scalability
Partition
Parallelism
•
Divide
large data into smaller subsets (partitions) across resources
–
Goal
is to evenly distribute data
–
Some
transforms require all data within same group to be in same partition
•
Requires
the same transform on all partitions
–
BUT:
Each partition is independent of others, there is no concept of “global” state
•
Facilitates
near-linear scalability (correspondence to hardware resources)
–
8X
faster on 8 processors
24X faster on 24 processors
Enterprise
Edition Combines Partition and Pipeline Parallelisms
•
Within
the Parallel Framework,
Pipelining and Partitioning Are Always Automatic
Pipelining and Partitioning Are Always Automatic
–
Job
developer need only identify
•
Sequential
vs. Parallel operations (by stage)
•
Method
of data partitioning
•
Configuration
file (there are advanced topics here)
•
Advanced
per-stage options (buffer tuning, combination, etc)
Job Design vs.
Execution
User
assembles the flow using DataStage Designer
… at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
Atul Singh
No comments :
Post a Comment