The Datastage configuration
file is a master control file (a textfile which sits on the server side) for
jobs which describes the parallel
system resources and architecture. The configuration file provides
hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU
, shared memory and disk), Grid ,
Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per
node). DataStage understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage.
For cases in which you have changed your processing configurations, or changed
servers or platform, you will never have to worry about it affecting your jobs
since all the jobs depend on this
configuration file for execution. Datastage jobs determine which node to run
the process on, where to store the temporary data, where to store the dataset
data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having
the configuration file is to separate software and hardware configuration from
job design. It allows changing hardware and software resources without changing
a job design. Datastage jobs can point to different configuration files by
using job parameters, which means that a
job can utilize different hardware architectures without being recompiled.
The configuration file contains the different processing
nodes and also specifies the disk space provided for each processing node which
are logical processing nodes that are specified in the configuration file. So
if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than
one logical node on a single physical node. However you should be wise in
configuring the number of logical nodes on a single physical node. Increasing
nodes, increases the degree of parallelism but it does not necessarily mean
better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you
will be having a very inefficient configuration on your hands.
1. APT_CONFIG_FILE is the file using which
DataStage determines the configuration file (one can have many configuration
files for a project) to be used. In fact, this is what is generally used in
production. However, if this environment variable is not defined then how
DataStage determines which file to use ??
1.
If the APT_CONFIG_FILE environment variable is not defined
then DataStage look for default configuration file (config.apt) in following
path:
1.
Current working directory.
2.
INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top
level directory of DataStage installation.
2.
Define Node in configuration file
A Node
is a logical processing unit. Each node in a configuration file is
distinguished by a virtual name and defines a number and speed of CPUs, memory
availability, page and swap space, network connectivity details, etc.
3. What are the different options a logical node can
have in the configuration file?
1.
fastname – The fastname is the physical node name that stages use to
open connections for high volume data transfers. The attribute of this option
is often the network name. Typically, you can get this name by using Unix
command ‘uname -n’.
2.
pools – Name of the pools to which the node is assigned to. Based on
the characteristics of the processing nodes you can group nodes into set of
pools.
1.
A pool can be associated with many
nodes and a node can be part of many pools.
2.
A node belongs to the default pool unless you explicitly
specify apools list for it, and omit the default pool name (“”) from the list.
3.
A parallel job or specific stage in the parallel job can be
constrained to run on a pool (set of processing nodes).
1.
In case job as well as stage within the job are constrained
to run on specific processing nodes then stage will run on the node which is
common to stage as well as job.
3.
resource – resource resource_type “location” [{pools
“disk_pool_name”}] | resource resource_type “value” . resource_type
can be canonicalhostname (Which takes quoted ethernet name of a
node in cluster that is unconnected to Conductor node by the hight
speed network.) or disk
(To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on
a file system where intermediate data will be temporarily stored. It is local
to the processing node.) or RDBMS
Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)
4. How datastage decides on which processing node a
stage should be run?
1.
If a job or stage is not constrained to run on specific nodes
then parallel engine executes a parallel stage on all nodes defined in the
default node pool. (Default Behavior)
2.
If the node is constrained then the constrained processing
nodes are chosen while executing the parallel stage.
Parts of this tutorial --
njoy the simplicity.......
Atul Singh
No comments :
Post a Comment