The configuration file
tells DataStage Enterprise Edition how to exploit underlying system resources
(processing, temporary storage, and dataset storage). In more advanced
environments, the configuration file can also define other resources such as
databases and buffer storage. At runtime, EE first reads the configuration file
to determine what system resources are allocated to it, and then distributes
the job flow across these resources.
When you modify the system, by adding or removing nodes or disks,
you must modify the DataStage EE configuration file accordingly. Since EE reads
the configuration file every time it runs a job, it automatically scales the
application to fit the system without having to alter the job design.
There is not necessarily one ideal configuration file for a given
system because of the high variability between the way different jobs work. For
this reason, multiple configuration files should be used to optimize overall
throughput and to match job characteristics to available hardware resources. At
runtime, the configuration file is specified through the environment variable
$APT_CONFIG_FILE.
Logical Processing Nodes
The configuration file defines one or more EE processing nodes on
which parallel jobs will run. EE processing nodes are a logical rather than a
physical construct. For this reason, it is important to note that the number of
processing nodes does not necessarily correspond to the actual number of CPUs
in your system.
Within a configuration file, the number of processing nodes
defines the degree of parallelism and resources that a particular job will use
to run. It is up to the UNIX operating system to actually schedule and run the
processes that make up a DataStage job across physical processors. A
configuration file with a larger number of nodes generates a larger number of
processes that use more memory (and perhaps more disk activity) than a
configuration file with a smaller number of nodes.
While the DataStage documentation suggests creating half the
number of nodes as physical CPUs, this is a conservative starting point that is
highly dependent on system configuration, resource availability, job design,
and other applications sharing the server hardware. For example, if a job is
highly I/O dependent or dependent on external (eg. database) sources or
targets, it may appropriate to have more nodes than physical CPUs.
For typical production environments, a good starting point is to
set the number of nodes equal to the number of CPUs. For development
environments, which are typically smaller and more resource-constrained, create
smaller configuration files (eg. 2-4 nodes). Note that even in the smallest
development environments, a 2-node configuration file should be used to verify
that job logic and partitioning will work in parallel (as long as the test data
can sufficiently identify data discrepancies).
Optimizing Parallelism
The degree of parallelism of a DataStage EE application is determined
by the number of nodes you define in the configuration file. Parallelism should
be optimized rather than maximized. Increasing parallelism may better
distribute your work load, but it also adds to your overhead because the number
of processes increases. Therefore, you must weigh the gains of added
parallelism against the potential losses in processing efficiency. The CPUs,
memory, disk controllers and disk configuration that make up your system
influence the degree of parallelism you can sustain.
Keep in mind that the closest equal partitioning of data
contributes to the best overall performance of an application running in
parallel. For example, when hash partitioning, try to ensure that the resulting
partitions are evenly populated. This is referred to as minimizing skew.
When business requirements dictate a partitioning strategy that is
excessively skewed, remember to change the partition strategy to a more
balanced one as soon as possible in the job flow. This will minimize the effect
of data skew and significantly improve overall job performance.
Configuration File Examples
Given the large number of considerations for building a
configuration file, where do you begin? For starters, the default configuration
file (default.apt) created when DataStage is installed is appropriate for only
the most basic environments.
The default configuration file has the following characteristics:
number of nodes = ½ number of physical CPUs
disk and scratchdisk storage use subdirectories within the
DataStage install filesystem
You should create and use a new configuration file that is
optimized to your hardware and file systems. Because different job flows have
different needs (CPU-intensive? Memory-intensive? Disk-Intensive?
Database-Intensive? Sorts? need to share resources with other jobs/databases/
applications? etc), it is often appropriate to have multiple configuration
files optimized for particular types of processing.
With the synergistic relationship between hardware (number of
CPUs, speed, cache, available system memory, number and speed of I/O
controllers, local vs. shared disk, RAID configurations, disk size and speed,
network configuration and availability), software topology (local vs. remote
database access, SMP vs. Clustered processing), and job design, there is no
definitive science for formulating a configuration file. This section attempts
to provide some guidelines based on experience with actual production
applications.
IMPORTANT: It is important to follow the order of all sub-items
within individual node specifications in the example configuration files given
in this section.
Example for Any Number of CPUs and Any Number of Disks
Assume you are running on a shared-memory multi-processor system,
an SMP server, which is the most common platform today. Let’s assume these
properties:
computer host name “fastone”
6 CPUs
4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3
You can adjust the sample to match your precise environment.
The configuration file you would use as a starting point would
look like the one below. Assuming that the system load from processing outside
of DataStage is minimal, it may be appropriate to create one node per CPU as a
starting point.
In the following example, the way disk and scratchdisk resources
are handled is the important.
{ /* config files allow C-style
comments. */
/* Configuration do not have flexible
syntax. Keep all the sub-items of
the individual node specifications in the
order shown here. */
node "n0" {
pools "" /* on an SMP node
pools aren’t used often. */
fastname "fastone"
resource scratchdisk
"/fs0/ds/scratch" {} /* start with fs0 */
resource scratchdisk
"/fs1/ds/scratch" {}
resource scratchdisk
"/fs2/ds/scratch" {}
resource scratchdisk
"/fs3/ds/scratch" {}
resource disk "/fs0/ds/disk"
{} /* start with fs0 */
resource disk "/fs1/ds/disk"
{}
resource disk "/fs2/ds/disk"
{}
resource disk "/fs3/ds/disk"
{}
}
node "n1" {
pools ""
fastname "fastone"
resource scratchdisk
"/fs1/ds/scratch" {} /* start with fs1 */
resource scratchdisk
"/fs2/ds/scratch" {}
resource scratchdisk
"/fs3/ds/scratch" {}
resource scratchdisk
"/fs0/ds/scratch" {}
resource disk "/fs1/ds/disk"
{} /* start with fs1 */
resource disk "/fs2/ds/disk"
{}
resource disk "/fs3/ds/disk"
{}
resource disk "/fs0/ds/disk"
{}
}
node "n2" {
pools ""
fastname "fastone"
resource scratchdisk
"/fs2/ds/scratch" {} /* start with fs2 */
resource scratchdisk
"/fs3/ds/scratch" {}
resource scratchdisk "/fs0/ds/scratch"
{}
resource scratchdisk
"/fs1/ds/scratch" {}
resource disk "/fs2/ds/disk"
{} /* start with fs2 */
resource disk "/fs3/ds/disk"
{}
resource disk "/fs0/ds/disk"
{}
resource disk "/fs1/ds/disk"
{}
}
node "n3" {
pools ""
fastname "fastone"
resource scratchdisk
"/fs3/ds/scratch" {} /* start with fs3 */
resource scratchdisk
"/fs0/ds/scratch" {}
resource scratchdisk
"/fs1/ds/scratch" {}
resource scratchdisk
"/fs2/ds/scratch" {}
resource disk "/fs3/ds/disk"
{} /* start with fs3 */
resource disk "/fs0/ds/disk"
{}
resource disk "/fs1/ds/disk"
{}
resource disk "/fs2/ds/disk"
{}
}
node "n4" {
pools ""
fastname "fastone"
/* Now we have rotated through
starting with a different disk, but the fundamental problem
* in this scenario is that there are
more nodes than disks. So what do we do now?
* The answer: something that is not
perfect. We’re going to repeat the sequence. You could
* shuffle differently i.e., use /fs0
/fs2 /fs1 /fs3 as an order, but that most likely won’t
* matter. */
resource scratchdisk “/fs0/ds/scratch”
{} /* start with fs0 again */
resource scratchdisk “/fs1/ds/scratch”
{}
resource scratchdisk “/fs2/ds/scratch”
{}
resource scratchdisk “/fs3/ds/scratch”
{}
resource disk “/fs0/ds/disk” {} /*
start with fs0 again */
resource disk “/fs1/ds/disk” {}
resource disk “/fs2/ds/disk” {}
resource disk “/fs3/ds/disk” {}
}
node “n5” {
pools “”
fastname “fastone”
resource scratchdisk “/fs1/ds/scratch”
{} /* start with fs1 */
resource scratchdisk “/fs2/ds/scratch”
{}
resource scratchdisk “/fs3/ds/scratch”
{}
resource scratchdisk “/fs0/ds/scratch”
{}
resource disk “/fs1/ds/disk” {} /*
start with fs1 */
resource disk “/fs2/ds/disk” {}
resource disk “/fs3/ds/disk” {}
resource disk “/fs0/ds/disk” {}
}
} /* end of entire config */
The file pattern of the configuration file above is a “give every
node all the disk” example, albeit in different orders to minimize I/O
contention. This configuration method works well when the job flow is complex
enough that it is difficult to determine and precisely plan for good I/O
utilization.
Within each node, EE does not “stripe” the data across multiple
filesystems. Rather, it fills the disk and scratchdisk filesystems in the order
specified in the configuration file. In the 4-node example above, the order of
the disks is purposely shifted for each node, in an attempt to minimize I/O
contention.
Even in this example, giving every partition (node) access to all
the I/O resources can cause contention, but EE attempts to minimize this by using
fairly large I/O blocks.
This configuration style works for any number of CPUs and any
number of disks since it doesn't require any particular correspondence between
them. The heuristic here is: “When it’s too difficult to figure out precisely,
at least go for achieving balance.”
njoy the simplicity.......
Atul Singh
No comments :
Post a Comment