Introduction
This document contains the Data Stage Best practises and recommendations which could be used to improve the quality of data stage jobs. This document would further be enhanced to include the specific Data Stage problems and there troubleshooting.
Recommendations
• Data Stage Version
• DataStage Job
Naming Conventions
• DataStage Job
Stage and Link Naming Conventions
• DataStage Job
Descriptions
• DataStage Job
Complexity
• DataStage Job
Design
• Error and/or
Reject Handling
• Process
Exception Handling
• Standards
Part - 2
• Development
Guidelines
• Component
Usage
• DataStage Data
Types
• Partitinong
Data
• Collecting
Data
• Sorting
• Stage Specific
Guidelines
Development Guidelines
Modular development techniques should be used to maximize re-use of DataStage jobs and components:
Job
parameterization allows a single job design to process similar logic instead of
creating multiple copies of the same job. The Multiple-Instance job property
allows multiple invocations of the same job to run simultaneously.
_ A set of
standard job parameters should be used in DataStage jobs for source and target
database parameters (DSN, user, password, etc.) and directories where files are
stored. To ease re-use, these standard parameters and settings should be made
part of a Designer Job Parameter Sets.
_ Create a
standard directory structure outside of the DataStage project directory for
source and target files, intermediate work files, and so forth.
_ Where
possible, create re-usable components such as parallel shared containers to
encapsulate frequently-used logic.
_ DataStage
Template jobs should be created
_ Standard
parameters such as source and target file paths, and database login properties
_ Environment
variables and their default settings
_ Annotation
blocks
_ Job Parameters
should always be used for file paths, file names, database login settings.
_ Standardized
Error Handling routines should be followed to capture errors and rejects.
Component Usage
The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition:
_ Never use
Server Edition components (BASIC Transformer, Server Shared Containers) within
a parallel job. BASIC Routines are appropriate only for job control sequences.
_ Always use
parallel Data Sets for intermediate storage between jobs unless that specific
data also needs to be shared with other applications.
_ Use the Copy
stage as a placeholder for iterative design, and to facilitate default type
conversions.
_ Use the
parallel Transformer stage (not the BASIC Transformer) instead of the Filter or
Switch stages.
_ Use BuildOp
stages only when logic cannot be implemented in the parallel Transformer.
DataStage Datatypes
The following guidelines should be followed with DataStage data types:
_ Be aware of
the mapping between DataStage (SQL) data types and the internal DS/EE data
types. If possible, import table definitions for source databases using the
Orchestrate Schema Importer (orchdbutil) utility.
Partitioning Data
In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. As the job developer, you have visibility into requirements, and can optimize within a job and across job flows. Given the numerous options for keyless and keyed partitioning, the following objectives form a methodology for assigning partitioning:
_ Objective 1
Choose a
partitioning method that gives close to an equal number of rows in each
partition, while minimizing overhead. This ensures that the processing workload
is evenly balanced, minimizing overall run time.
_ Objective 2
The partition
method must match the business requirements and stage functional requirements,
assigning related records to the same partition if required.
Any stage that
processes groups of related records (generally using one or more key columns)
must be partitioned using a keyed partition method.
This includes,
but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge,
Remove Duplicates, and Sort stages. It might also be necessary for Transformers
and BuildOps that process groups of related records.
_ Objective 3
Unless partition
distribution is highly skewed, minimize re-partitioning, especially in cluster
or Grid configurations.
Re-partitioning
data in a cluster or Grid configuration incurs the overhead of network
transport.
_ Objective 4
Partition method
should not be overly complex. The simplest method that meets the above
objectives will generally be the most efficient and yield the best performance.
Using the above
objectives as a guide, the following methodology can be applied:
a. Start with
Auto partitioning (the default).
b. Specify Hash
partitioning for stages that require groups of related records
as follows:
• Specify only
the key column(s) that are necessary for correct grouping as long as the number
of unique values is sufficient
• Use Modulus
partitioning if the grouping is on a single integer key column
• Use Range
partitioning if the data is highly skewed and the key column values and
distribution do not change significantly over time (Range Map can be reused)
• If grouping is
not required, use Round Robin partitioning to redistribute data equally across
all partitions.
• Especially
useful if the input Data Set is highly skewed or sequential d. Use Same
partitioning to optimize end-to-end partitioning and to minimize re-partitioning
• Be mindful
that Same partitioning retains the degree of parallelism of the upstream stage
• Within a flow,
examine up-stream partitioning and sort order and attempt to preserve for
down-stream processing. This may require re-examining key column usage within
stages and re-ordering stages within a flow (if business requirements permit).
Across jobs,
persistent Data Sets can be used to retain the partitioning and sort order.
This is particularly useful if downstream jobs are run with the same degree of
parallelism (configuration file) and require the same partition and sort order.
Collecting Data
Given the options for collecting data into a sequential stream, the following guidelines form a methodology for choosing the appropriate collector type:
1. When output
order does not matter, use Auto partitioning (the default).
2. Consider how
the input Data Set has been sorted:
– When the input
Data Set has been sorted in parallel, use Sort Merge collector to produce a
single, globally sorted stream of rows.
– When the input
Data Set has been sorted in parallel and Range partitioned, the Ordered
collector might be more efficient.
3. Use a Round
Robin collector to reconstruct rows in input order for round-robin partitioned
input Data Sets, as long as the Data Set has not been re-partitioned or
reduced.
Sorting Data
Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow:
1. Start with a
link sort.
2. Specify only
necessary key column(s).
3. Do not use
Stable Sort unless needed.
4. Use a
stand-alone Sort stage instead of a Link sort for options that are not
available on a Link sort:
– The “Restrict
Memory Usage” option should be included here. If you want more memory available
for the sort, you can only set that via the Sort Stage — not on a sort link.
The environment variable
$APT_TSORT_STRESS_BLOCKSIZE
can also be used to set sort memory usage (in MB) per partition.
– Sort Key Mode,
Create Cluster Key Change Column, Create Key Change Column, Output Statistics.
– Always specify
“DataStage” Sort Utility for standalone Sort stages.
– Use the “Sort
Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a
previously-sorted input Data Set.
5. Be aware of
automatically-inserted sorts:
– Set
$APT_SORT_INSERTION_CHECK_ONLY to verify but not establish
required sort
order.
6. Minimize the
use of sorts within a job flow.
7. To generate a
single, sequential ordered result set, use a parallel Sort and a
Sort Merge
collector.
Stage Specific Guidelines
Transformer
Take precautions
when using expressions or derivations on nullable columns within the parallel
Transformer:
– Always convert
nullable columns to in-band values before using them in an expression or
derivation.
– Always place a
reject link on a parallel Transformer to capture / audit possible rejects.
Lookup
It is most
appropriate when reference data is small enough to fit into available shared
memory. If the Data Sets are larger than available memory resources, use the
Join or Merge stage.
Limit the use of
database Sparse Lookups to scenarios where the number of input rows is
significantly smaller (for example 1:100 or more) than the number of reference
rows, or when exception processing.
Join
Be particularly
careful to observe the nullability properties for input links to any form of
Outer Join. Even if the source data is not nullable, the non-key columns must
be defined as nullable in the Join stage input in order to identify unmatched
records.
Aggregators
Use Hash method
Aggregators only when the number of distinct key column values is small. A Sort
method Aggregator should be used when the number of distinct key values is
large or unknown.
Database Stages
The following
guidelines apply to database stages:
– Where
possible, use the Connector stages or native parallel database stages for
maximum performance and scalability.
– The ODBC
Connector and ODBC Enterprise stages should only be used when a native parallel
stage is not available for the given source or target database.
– When using
Oracle, DB2, or Informix databases, use Orchestrate Schema Importer
(orchdbutil) to properly import design metadata.
– Take care to
observe the data type mappings.till then.....
njoy the simplicity.......