Nuts & Bolts of DataStage: stages

Showing posts with label stages. Show all posts

Wednesday, May 27, 2015

Data sources in DataStage

IIS Datastage connectivity options give us a wide scope to connect with different source or targets. It's support RDBMS, ERP, z/OS DB, OLAP system and many more.

Below listed Data Sources are available in IIS v11.3

Execution Steps in Transformer Stage - Explanation

You can access Part1 Here - Execution Steps in Transformer Stage

Certain constructs are inefficient if they are included in output column derivations, because they are evaluated once for every output column that uses them. The following examples describe these constructs:

The same part of an expression is used in multiple column derivations.

For example, if you want to use the same substring of an input column in multiple columns in output links, you might use the following test in a number of output columns derivations:

IF (DSLINK1.col1[1,3] = "001") THEN ...

In this case, the evaluation of the substring of DSLINK1.col1[1,3] is repeated for each column that uses it. The evaluation can be made more efficient by moving the substring calculation into a stage variable. The substring is then evaluated once for every input row. This example has thus stage variable definition for StageVar1:

Execution Steps in Transformer Stage

I've been asked this questions so many times in interviews and by different practitioner also that What are the data processing steps when datastage is processing transformer, So here I tried to compiled. Have a look -

To write efficient Transformer stage derivations, it helps to understand what items get evaluated and when.

Order of temporary disk space uses by Sort stage

The Sort stage uses temporary disk space when performing a sort. It looks in the following locations, in the
following order, for this temporary space.

1. Scratch disks in the disk pool sort (you can create these pools in the configuration file).
2. Scratch disks in the default disk pool (scratch disks are included here by default).

Modify Stage - TRIM function

For trimming spaces or a char from a string can be done in modify stage. Below, I have shared the function syntax and some example, go Play ;-)

string_trim[character,direction,justify](string)
or
string_trim[character,direction](string)
or
string_trim[character](string)
or
string_trim(string)

Read Fixed Width Flat File with Sequential File Stage

If we have a fixed width Flat file, we can read it via Sequential File stage only, no need to use Import Stage. For reading a Fixed Width Flat file, you should know the format of Fixed width data. How are the data splitting into two or more output columns.

Here, I demonstrate How to read a Fixed Width Flat file from Sequential File Stage

A. Input File :

Below is the Input file which we are going to use in our job. This file contains a column RollNo which is concatenation of 4 fixed width columns which we have to split and read.

Peek Stage in DataStage - 3

In today's post, we will discuss a important property of PEEK stage. What if you want to peek some data as well as want to take action or wanna do something with data after peeking. For enabling this, datastage gives you a option to do this.

1. Job Design :

for this demo, we will use below job design to explain, as you can see, There are 2 link coming out from PEEK stage.

First Link --> This link contain the PEEK data as per property defined in peek stage. (here you can see 40 rows ( 10 rows from each partition x 4 nodes )

Second Link --> this link contain the Whole stream data. ( All 100 rows )

Peek Stage in DataStage - 1

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.

1. Design :

We are using below design to demonstrate the functionality of PEEK stage in datastage. Job design is having 3 stages, Row Generator , data source, for generating the dummy data for the job. Here, we can use any data source whether it is DB, dataset or flat files. Transformer, just as a stage, creating negative values of source data. Peek for Peeking :-)

DataStage Scenario - Design10 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem10

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. In this design, we are reading the data from flat file, generating a dummy column in column generator stage, doing aggregate on that and sending the output to seq file.

Column Import Stage - Read a Fixed width File

The Column Export stage is a restructure stage. It can have a single input link, a single output link and a single rejects link. Typically used to divide data arriving in a single column into multiple columns.
Here, I tried to explain How to use it for importing data...

a) Design :

As you can see, Here we have followed this design to elaborate the stage. We are reading a file, Importing Columns with Import Stage and then loading them into Sequential File.

Surrogate Key Generator - Generate Surrogate Key for Data

In this post, We will see how to generate surrogate key for data, where we have to use surrogate key stage.

A) Design :
Below design is a demo design of job. Here our data source is a row generator which is generating rows. In real time scenario, Source can be a flat file, DB stages, Passive Stage or can be a Active stage also.
In Row Generator Stage, we are generating a col "Name".

Surrogate Key Generator - Delete State File

Adding this post as well with Surrogate Key walk through, In this post we will see how to delete a Surrogate Key State file. Its quite simple as creation.

a) Design :
Again, for the deletion of state file, we need only 1 stage in design and that is Surrogate Key Stage.

How to Create Custom Stages in Datastage Parallel Jobs ?

In addition to the wide range of parallel stage types available, the Designer allows you to define your own stage types, which you can then use in parallel jobs.
For creating custom stage in datastage which will functions as per our requirement and we will be able to modify that stage as we may want. Here is some information on custom stage functionality in datastage in parallel jobs and how can we create our own wrapped custom stage.

Performance Tunings in DataStage

JOB LEVEL

Parametrize all the inputs needed for the jobs; avoid hard coding of parameters like username, password, directory paths etc.
By using the environment variable ‘APT_CONFIG_FILE’ as a job parameter, user can dynamically change the number of nodes being used for processing a particular job.
For reading/writing data from large tables/files, make use of the environment variable ‘APT_BUFFER_MAXIMUM_MEMORY’. It can be used to change the memory buffer size being used for each stage.
It is recommended to set the environment variable $APT_DUMP_SCORE to a value of 1. When this environment variable is set, an entry is placed in the WebSphere DataStage job log showing the actual runtime structure (processes, their associated internal operators, datasets, nodes, etc) used to execute the job flow.

DataStage Scenario - Design5 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem5

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

DataStage Scenario - Design4 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem4

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

Some more design tips for DataStage Job Development

• Common information like home directory, system date, username, password should be initialized in a global variable and then variable should be referred everywhere.

• Stage Variables allow you to hold data from a previous record when the next record, allowing you to compare between previous and current records. Stage variables also allow you return multiple errors for a record of information. By being able to evaluate all data in a record and not just error on the first exception that is found, the cleanup of data is more efficient and requires less iteration.

DataStage Naming Standard

For maintaining the design and easy understanding of what happening inside the job, we are giving meaningful name to stages, links, jobs and sequencer. So that other developer/designer can have a idea about purpose of job/seq by seeing the design.

Here I have shared some DataStage Naming Standard we usually follow.

Datastage Coding Checklist

Ensure that the null handling properties are taken care for all the nullable fields. Do not set the null field value to some value which may be present in the source.
Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.
Always save the metadata (for source, target or lookup definitions) in the repository to ensure re usability and consistency.

DataStage Parallel job: Retrieve sql codes on a failed upsert

When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible to create a reject link to trap rows that fail any update or insert statements. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log.

Subscribe to: Posts ( Atom )

Wednesday, May 27, 2015

Monday, May 25, 2015

Thursday, May 21, 2015

Friday, February 06, 2015

Wednesday, February 04, 2015

Wednesday, September 17, 2014

A. Input File :

Monday, September 08, 2014

1. Job Design :

Tuesday, September 02, 2014

1. Design :

Saturday, August 09, 2014

Wednesday, July 09, 2014

Wednesday, June 04, 2014

Tuesday, June 03, 2014

Wednesday, April 30, 2014

Thursday, April 24, 2014

Tuesday, April 15, 2014

Wednesday, April 09, 2014

Friday, April 04, 2014

Monday, February 17, 2014

Thursday, February 13, 2014