Nuts & Bolts of DataStage: Parallelism

Showing posts with label Parallelism. Show all posts

Wednesday, May 20, 2015

Disable auto insertion of Partition and Sort

Partitioner insertion and sort insertion each make writing a flow easier by alleviating the need for a user to think about either partitioning or sorting data. By examining the requirements of operators in the flow, the parallel engine can insert partitioners, collectors and sorts as necessary within a data flow.

However, there are some situations where these features can be a avoided or not needed.
If data is pre-partitioned and pre-sorted, and the InfoSphere DataStage job is unaware of this, you could disable automatic partitioning and sorting for the whole job by setting the following environment variables while the job runs:

Things need to consider while developing a Datastage job

Datasets
Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.

Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.
2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.

DataStage Configuration file : Explained - 3

Below is the sample diagram for 1 node and 4 node resource allocation:

DataStage Configuration file : Explained - 2

1. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism ??

1. Make sure that none of the stages are specified to be run on the conductor node.

2. Use conductor node just to start the execution of parallel job.

3. Make sure that conductor node is not the part of the default pool.

DataStage Configuration file : Explained - 1

The Datastage configuration file is a master control file (a textfile which sits on the server side) for jobs which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the architecture of the system through this file.

This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data, where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.

Using Configuration Files in Data Stage Best Practices & Performance Tuning

The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources.

When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design.

Nuts & Bolts of DataStage