Something about DataStage, DataStage Administration, Job Designing,Developing, DataStage troubleshooting, DataStage Installation & Configuration, ETL, DataWareHousing, DB2, Teradata, Oracle and Scripting.
Showing posts with label Parallelism. Show all posts
Showing posts with label Parallelism. Show all posts
Wednesday, May 20, 2015
Disable auto insertion of Partition and Sort
Partitioner insertion and sort insertion each make writing a flow easier by alleviating the need for a user to think about either partitioning or sorting data. By examining the requirements of operators in the flow, the parallel engine can insert partitioners, collectors and sorts as necessary within a data flow.
However, there are some situations where these features can be a avoided or not needed.
If data is pre-partitioned and pre-sorted, and the InfoSphere DataStage job is unaware of this, you could disable automatic partitioning and sorting for the whole job by setting the following environment variables while the job runs:
Labels:
DataStage
,
disable
,
environment
,
hash
,
insertion
,
Parallelism
,
partition
,
sort
,
variables
Sunday, June 15, 2014
Things need to consider while developing a Datastage job
Datasets
Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.
Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.
2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.
Labels:
Data
,
DataSet
,
DataStage
,
develop
,
Parallel
,
Parallelism
,
partition
,
performance
,
RCP
,
sort
Thursday, October 18, 2012
DataStage Configuration file : Explained - 3
Labels:
Administration
,
application
,
Configuration
,
DataStage
,
Documentation
,
file
,
install
,
Logical
,
Parallelism
,
Physical
,
Resource
,
Scratch
,
Server
,
tips
,
Tutorial
DataStage Configuration file : Explained - 2
1. When configuring an MPP, you
specify the physical nodes in your system on which the parallel engine will run
your parallel jobs. This is called Conductor Node. For other nodes, you do not
need to specify the physical node. Also, You need to copy the (.apt) configuration
file only to the nodes from which you start parallel engine applications. It
is possible that conductor node is not connected with the high-speed network
switches. However, the other nodes are connected to each other using a very
high-speed network switches. How do you configure your system so that you will be able to
achieve optimized parallelism ??
1.
Make sure that none of the stages are specified to be run on
the conductor node.
2.
Use conductor node just to start the execution of parallel
job.
3.
Make sure that conductor node is not the part of the default
pool.
Labels:
Administration
,
application
,
Configuration
,
DataStage
,
Documentation
,
file
,
install
,
Logical
,
Parallelism
,
Physical
,
Resource
,
Scratch
,
Server
,
tips
,
Tutorial
DataStage Configuration file : Explained - 1
The Datastage configuration
file is a master control file (a textfile which sits on the server side) for
jobs which describes the parallel
system resources and architecture. The configuration file provides
hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU
, shared memory and disk), Grid ,
Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per
node). DataStage understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage.
For cases in which you have changed your processing configurations, or changed
servers or platform, you will never have to worry about it affecting your jobs
since all the jobs depend on this
configuration file for execution. Datastage jobs determine which node to run
the process on, where to store the temporary data, where to store the dataset
data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed.
Labels:
Administration
,
application
,
Configuration
,
DataStage
,
Documentation
,
file
,
install
,
Logical
,
Parallelism
,
Physical
,
Resource
,
Scratch
,
Server
,
tips
,
Tutorial
Saturday, October 13, 2012
Using Configuration Files in Data Stage Best Practices & Performance Tuning
The configuration file
tells DataStage Enterprise Edition how to exploit underlying system resources
(processing, temporary storage, and dataset storage). In more advanced
environments, the configuration file can also define other resources such as
databases and buffer storage. At runtime, EE first reads the configuration file
to determine what system resources are allocated to it, and then distributes
the job flow across these resources.
When you modify the system, by adding or removing nodes or disks,
you must modify the DataStage EE configuration file accordingly. Since EE reads
the configuration file every time it runs a job, it automatically scales the
application to fit the system without having to alter the job design.
Labels:
Administration
,
application
,
Configuration
,
DataStage
,
Logical
,
Optimizing
,
Parallelism
,
Practices
,
Resource
,
Scratch
Subscribe to:
Posts
(
Atom
)