1. When configuring an MPP, you
specify the physical nodes in your system on which the parallel engine will run
your parallel jobs. This is called Conductor Node. For other nodes, you do not
need to specify the physical node. Also, You need to copy the (.apt) configuration
file only to the nodes from which you start parallel engine applications. It
is possible that conductor node is not connected with the high-speed network
switches. However, the other nodes are connected to each other using a very
high-speed network switches. How do you configure your system so that you will be able to
achieve optimized parallelism ??
1.
Make sure that none of the stages are specified to be run on
the conductor node.
2.
Use conductor node just to start the execution of parallel
job.
3.
Make sure that conductor node is not the part of the default
pool.
2. Although, parallelization
increases the throughput and speed of the process, why maximum parallelization is not necessarily
the optimal parallelization ??
1.
Datastage creates one process for every stage for
each processing node. Hence, if the hardware resource is not available to
support the maximum parallelization, the performance of overall system goes
down. For example, suppose we have a SMP system with three CPU and a Parallel
job with 4 stage. We have 3 logical node (one corresponding to each physical
node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be
managed by a single operating system. Significant time will be spent in
switching context and scheduling the process.
3. Since we can have different
logical processing nodes, it is possible that some node will be more suitable
for some stage while other nodes will be more suitable for other stages. So, when to decide which
node will be suitable for which stage ??
1.
If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g.
sorting a data is memory
intensive task and it should be run on such nodes.
2.
If some stage depends on licensed version of software (e.g. SAS
Stage, RDBMS related stages, etc.) then you need to associate those stages with
the processing node, which is physically mapped to the machine on which the
licensed software is installed. (Assumption:
The machine on which licensed software is installed is connected through other
machines using high speed network.)
3.
If a job contains stages, which exchange large
amounts of data then they should be assigned to nodes where stages communicate
by either shared memory (SMP) or high-speed link (MPP) in most optimized
manner.
4.
Basically nodes are nothing but set of machines (specially in
MPP systems). You start the execution of parallel jobs from the conductor node.
Conductor nodes creates a shell of remote machines (depending on the processing
nodes) and copies the same environment on them. However, it is possible to
create a startup script which will selectively change the environment on a
specific node. This script has a default name of startup.apt. However, like
main configuration file, we can also have many startup configuration files. The
appropriate configuration file can be picked up using the environment variable
APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?
1.
Using APT_NO_STARTUP_SCRIPT environment variable, you can
instruct Parallel engine not to run the startup script on the remote shell.
5. What are the generic things one must follow while
creating a configuration file so that optimal parallelization can be achieved?
1.
Consider avoiding the disk/disks that your input files reside
on.
2.
Ensure that the different file systems mentioned as the disk
and scratchdisk resources hit disjoint sets of spindles even if they’re located
on a RAID (Redundant Array of Inexpensive Disks) system.
3.
Know what is real and what is NFS:
1.
Real disks are directly attached, or are reachable over a SAN
(storage-area network -dedicated, just for storage, low-level protocols).
2.
Never use NFS file systems for scratchdisk resources,
remember scratchdisk are also used for temporary storage of file/data during
processing.
3.
If you use NFS file system space for disk resources, then you
need to know what you are doing. For example, your final result files may need
to be written out onto the NFS disk area, but that doesn’t mean the
intermediate data sets created and used temporarily in a multi-job sequence
should use this NFS disk area. Better to setup a “final” disk pool, and
constrain the result sequential file or data set to reside there, but let
intermediate storage go to local or SAN resources, not NFS.
4.
Know what data points are striped (RAID) and which are not.
Where possible, avoid striping across data points that are already striped at
the spindle level.
Parts of this tutorial --
njoy the simplicity.......
Atul Singh
No comments :
Post a Comment