We have moved to www.dataGenX.net, Keep Learning with us.

Tuesday, December 30, 2014

Delete Duplicate Rows in DB2 Database

Many times and Many places, this question is being asked :-) that How to delete the duplicate row from a table in different DBs. Here, we will see How to do this in DB2 DB.

Wednesday, December 10, 2014

Create table from another table in TeraData database

Same Queries for Teradata DBs -


Tuesday, December 09, 2014

Create table from another table in Oracle database

How to do the same in Oracle DB as we did in DB2 -Create table from another table in DB2 database
In Oracle, it is quite simple to create the duplicate tables with data :-)

Let's see How --

Monday, December 08, 2014

Create table from another table in DB2 database

We can create a table from another table definition but without data in DB2. We need write a another sql to insert the data into newly created table.

Let's see how to do this --

Monday, November 17, 2014

DataStage Error - Cannot authenticate user

When you attempt to start one of the InfoSphere DataStage and QualityStage Administrator clients, the following message is displayed:
Failed to authenticate
the current user against the selected Domain:  Invalid user name (username)
or password.
There are several possible causes of this problem.
  • The user name is invalid.
  • The password is invalid or has expired.
  • The user has no suite user role.
  • Credential mapping is required, but has not been defined for this user.
  • The user has no DataStage role or has the incorrect DataStage role.

Friday, November 14, 2014

Linux Shell session setup for DataStage Command Line

To setup Linux Shell for Datastage command line is very important if you are a DataStage Developer and want to work hassle free on DataStage Command Line.

Follow below steps to setup your profile ---

a) First, we will create .dshome file which is containing the path of DS Engine. If you know what it is, good, create a file in root dir with DS Engine path
otherwise execute below scripts to do it by itself

Thursday, November 13, 2014

Now We are on Google Groups

Thanks guys for your overwhelming support and interest. Now, we are on Google Groups for some more topics and discussion.

Please Join

Like the Facebook Page & join Group

For WHATSAPP group , drop a msg to 91-88-00-906098

Sunday, November 09, 2014

TimeStamp Comparison in Linux

Comparing Date, Time or TimeStamp is tricky thing in Linux. This is how you can achieve this -

By converting date to Seconds
$ date +%s
the above command will current time in seconds since 1970-01-01 00:00:00 UTC

if you have time in string form you can use below command to convert them like -

Friday, November 07, 2014

DataStage Error - Client logins fail with Error: 39125

DataStage client logins fail with error: 39125: the directory you are connecting to either is not a uv account or does not exist.

This error usually indicates that one of the 6 files required to make a UniVerse account is missing from the directory of the DataStage project which has the problem or that the user does not have permission to access these files. Those 6 files/directories are:

Wednesday, November 05, 2014

Read Sequential File with SCHEMA file

In Datastage, Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs. You do not need to define any column and data format for reading a file. We are using this method when we needed a generic job for scenario like reading multiple files having different metadata defined.

Here, I am going to show you how to read the seq file with help of schema

A. Design

I am using below design to demonstrate this functionality, I am reading the data and dumping it to another file with different format.

Tuesday, October 28, 2014

VI Cheat Sheet

Cursor movement

  • h - move left
  • j - move down
  • k - move up
  • l - move right
  • w - jump by start of words (punctuation considered words)
  • W - jump by words (spaces separate words)
  • e - jump to end of words (punctuation considered words)
  • E - jump to end of words (no punctuation)
  • b - jump backward by words (punctuation considered words)
  • B - jump backward by words (no punctuation)
  • 0 - (zero) start of line
  • ^ - first non-blank character of line
  • $ - end of line
  • G - Go To command (prefix with number - 5G goes to line 5)
Note: Prefix a cursor movement command with a number to repeat it. For example, 4j moves down 4 lines.

Wednesday, October 22, 2014

DataStage Warning - When deleting data set previous delete attempt not complete

main_program: When deleting data set /full_Path_of_dataset.ds, previous delete attempt not complete; removing /full_Path_of_dataset.ds.being_deleted in order to proceed.

Solution -

This error usually comes when a previous run of respective job got aborted and the auto cleanup process hasn't been completed properly.

DataStage Error - irreconcilable constraints on the number of partitions

Fatal Error: There are irreconcilable constraints on the number of partitions of an operator: parallel <stage name>. The number of partitions is already constrained to 1, but an eSame partitioned input virtual dataset produced by parallel {natural=MOD_Nulls, synthetic="modify(0)"} has 2. 

This Error is simply Data Partitioning Error, The Stage mentioned in Error is not getting the Partition as defined for it. This is usually happen in SAME partitioning.

Monday, October 20, 2014

Create Fixed Width Data in DataStage

In day to day life of DataStage, you are facing lot of business scenario and technical scenario. Today, we are going to discuss how to generate Fixed Width data from variable length data in datastage.

Input :

Output :

Thursday, October 16, 2014

DataStage job fails with code=-14 [Timed out while waiting for an event]

DataStage job fails with code=-14 [Timed out while waiting for an event] - this generally happens when jobs are called from sequencers 


Solution :

Follow steps 1-3 in resolving this error.
1) DSWaitStartup - environment variable -
When multiple instances of a job are run from a sequence, and one or more of the job instances are set to reset, the sequence might report a status=99 or -14. This can occur because the controlling sequence did not give the job instances enough time to reset before polling its status. The start up time for a job reset must be increased. The environment variable DSWaitResetStartup can be used for this purpose.

(The maximum value that can be set for DSWaitResetStartup is the value of DSWaitStartup (default is 60). For example, if a value of 120 is required for DSWaitResetStartup, then ensure that DSWaitStartup is also set to a minimum of 120.)

Tuesday, October 14, 2014

Auto Generate Table Schema in DataStage

Few days ago, I got a request to explain how to generate the SCHEMA file of a table ? So here we goes.....

a. Design :

We can generate the SCHEMA file with this job. As we can see there are only two stages. One is DB connector stage ( it can be any DB) and a seq file stage which will hold the SCHEMA definition.

Monday, October 13, 2014

\n not creating new line in Linux

When we are using Linux, Some Shell are quite good to understand the Escape characters we want to print but some are not.

This is how we have to make them understood these characters -

$ echo "Atul \n Singh"
Atul \n Singh

Sunday, October 12, 2014

WAS Profile Creation on Linux (pctLinux.bin)

1. Change to the folder
was installation-root/bin/ProfileCreator which will be normally as follows
in my example /home/opt/IBM/WebSphere/AppServer/bin/ProfileCreator

2. run the command ./pctLinux.bin

3. fill the Profile Name eg. AppSrv01

Saturday, October 11, 2014

Cut, Copy n Paste in VI editor

We Engineers are well trained in Cut, Copy and Paste things :-) and who the hell don't know the Ctrl-X, C n V. But when we come to same but in Linux Platform we are behaving like dumb ;P. Don't feel offended...I'll help u here

So Lets learn some command which can do the same in Linux VI editor --

Cut - Ctrl-X

The command 'dd' is used for cutting text. Cutting current word, lines or many more summarized below -

dd Delete current line D Delete from cursor to end of line
d$ Delete from cursor to end of line
d0 Delete from cursor to beginning of line
dw Delete from cursor to end of current word
db Delete from cursor to beginning of current word
Ndd Delete N no of lines including current line

Thursday, October 02, 2014

How to UPDATE row based on a row in same table

How you will update the Data of a table based on same table ??

update Table_DUMMY T1 
set COL_UPDT = (select T2.COL_FROM from Table_DUMMY T2 where T1.KEY=T2.KEY )  

Like the Facebook Page & join Group

For WHATSAPP group , drop a msg to 91-88-00-906098

Wednesday, October 01, 2014

Difference Between Normal Lookup and Sparse Lookup

Normal Lookup :-
  • Normal Lookup data needs to be in memory
  • Normal might provide poor performance if the reference data is huge as it has to put all the data in memory.
  • Normal Lookup can have more than one reference link.
  • Normal lookup can be used with any database

Tuesday, September 30, 2014

DataStage Warning : When checking operator: Operator of type "APT_TSortOperator": will partition despite the preserve-partitioning flag on the data set on input port 0

DataStage Warning : When checking operator: Operator of type "APT_TSortOperator": will partition despite the preserve-partitioning flag on the data set on input port 0.

These warnings are obtained when we are using some specific partitioning method in a stage in the job e.g hash/same etc but in the preceding stage , preserve portioning(also default) is set in the stage>advanced tab.

Sunday, September 28, 2014

DataStage Warning - APT_CombinedOperatorController

While we are running DataStage job, we got this warning like..

APT_CombinedOperatorController(1),0:  XXXXXXXXXXXXXXXXXXXXXXX
APT_CombinedOperatorController(2),1:  XXXXXXXXXXXXXXXXXXXXXXX

We don't get the stage name which is causing this issue ? How we will get the Culprit Stage :-) ?

Solution :

Friday, September 26, 2014

DataStage Warning - Agg,1: Hash table has grown to 16384 entries

Sometime when we are running Datastage job in which we are using Aggregator stage, we get below warning -

Agg,1: Hash table has grown to 16384 entries.
Agg,0: Hash table has grown to 16384 entries. 

Solution :

Thursday, September 25, 2014

File descripters in Unix

A file descriptor is an abstract value of type Unix.file_descr, containing information necessary to use a file: a pointer to the file, the access rights, the access modes (read or write), the current position in the file
or we can say
Under the covers of the Unix operating system, files are referenced, copied, and moved by unique numbers known as file descriptors.

Three descriptors are predefined. They correspond to standard input, standard output, and standard error.

0 - stdin (Standard Input)  "<"
1 - stdout ( Standard Output) "1>" or ">"
2 - stderr ( Standard Error) "2>"

Tuesday, September 23, 2014

Show Hidden Character in VI Unix

If you need to show hidden characters such as tab, return carriage, you can use the following command in vi

    In vi (command mode -> Escape) , type

    :set list
    to show all the hidden characters.

    You can also type

    :set nolist 
    to turn it off

    Like the Facebook Page & join Group

    For WHATSAPP group , drop a msg to 91-88-00-906098

    Wednesday, September 17, 2014

    Read Fixed Width Flat File with Sequential File Stage

    If we have a fixed width Flat file, we can read it via Sequential File stage only, no need to use Import Stage.   For reading a Fixed Width Flat file, you should know the format of Fixed width data. How are the data splitting into two or more output columns.

    Here, I demonstrate How to read a Fixed Width Flat file from Sequential File Stage

    A. Input File :

    Below is the Input file which we are going to use in our job. This file contains a column RollNo which is concatenation of 4 fixed width columns which we have to split and read.

    Monday, September 08, 2014

    Peek Stage in DataStage - 3

    In today's post, we will discuss a important property of PEEK stage. What if you want to peek some data as well as want to take action or wanna do something with data after peeking. For enabling this, datastage gives you a option to do this.

    1. Job Design :

    for this demo, we will use below job design to explain, as you can see, There are 2 link coming out from PEEK stage. 
    First Link  --> This link contain the PEEK data as per property defined in peek stage. (here you can see 40 rows ( 10 rows from each partition x 4 nodes )
    Second Link --> this link contain the Whole stream data.  ( All 100 rows )

    Friday, September 05, 2014

    Peek Stage in DataStage - 2

    Lets continue our journey with PEEK stage, today in this post, I am going to share one more option about PEEK stage. Sending the Peek data into a output file..

    1. Job design :  

    For this demo job, we will use below job design. You can see we have attached a seq file after the peek stage which will capture the Peeked data

    Tuesday, September 02, 2014

    Peek Stage in DataStage - 1

    The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.
    The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.

    1. Design :

    We are using below design to demonstrate the functionality of PEEK stage in datastage. Job design is having 3 stages, Row Generator , data source, for generating the dummy data for the job. Here, we can use any data source whether it is DB, dataset or flat files. Transformer, just as a stage, creating negative values of source data. Peek for Peeking :-) 

    Monday, August 18, 2014

    Oracle table is Partitioned or Not ?

    For getting whether Oracle table is Partitioned or not, we can use below query on ALL_TABLES -

    Saturday, August 09, 2014

    DataStage Scenario - Design10 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem10 

    Solution Design :

    a) Job Design :  

       Below is the design which can achieve the output as we needed. In this design, we are reading the data from flat file, generating a dummy column in column generator stage, doing aggregate on that and sending the output to seq file.

    Friday, August 08, 2014

    WAS server location for IBM Information Server

    How to find Correct WAS server using by your Information Server ?

    IBM Information Server uses a Websphere Application Server. If you have multiple WAS servers installed, to find the one that IIS is using you can look in:


    Thursday, August 07, 2014

    Python Tutorial for Beginners

    Python is powerful... and fast;
    plays well with others;
    runs everywhere;
    is friendly & easy to learn;
    is Open.

    Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C. The language provides constructs intended to enable clear programs on both a small and large scale.

    Monday, August 04, 2014

    Oracle SQL Tuning Tips - 3

    • Oracle automatically performs simple column type conversions(or casting) when it compares columns of different types. Depending on the type of conversion, indexes may not be used. Make sure you declare your program variables as the same type as your Oracle columns, if the type is supported in the programming language you are using.Use:
      SELECT emp_no, emp_name, sal FROM emp WHERE emp_no = ’123′;
      HERE if emp_no indexed numeric, then after implicit conversion query will be:
      SELECT emp_no, emp_name, sal FROM emp WHERE emp_no = TO_NUMBER(’123′);
      Thus, index is used in this case.
      Don’t use:
      SELECT emp_no, emp_name, sal FROM emp WHERE emp_type = 123;
      HERE if emp_type is indexed varchar2, then after implicit conversion query will be:

      SELECT emp_no, emp_name, sal FROM emp WHERE TO_NUMBER(emp_type) = 123;
      Thus, index will not be used in this case.

    Thursday, July 31, 2014

    Oracle SQL Tuning Tips - 2

    • Don’t forget to tune views. Views are SELECT statements and can be tuned in just the same way as any other type of SLECT statement can be. All tuning applicable to any SQL statement are equally applicable to views.
    • Avoid including a HAVING clause in SELECT statements. The HAVING clause filters selected rows only after all rows have been fetched. Using a WHERE clause helps reduce overheads in sorting, summing, etc. HAVING clauses should only be used when columns with summary operations applied to them are restricted by the clause.Use:
      SELECT city FROM country WHERE city!= ‘Vancouver’ AND city!= ‘Toronto’; GROUP BY city;
      Rather than:
      SELECT city FROM country GROUP BY city HAVING city!= ‘Vancouver’ AND city!= ‘Toronto’;

    Tuesday, July 29, 2014

    Oracle SQL Tuning Tips - 1

    Consideration when writing an SQL statement is that it returns a correct result. The second is that it be the most efficient for a given situation.  You can use many different SQL statements to achieve the same result. It is often the case that only one statement will be the most efficient choice in a given situation.

    Remember that processing SQL is a sequence of Parse (syntax check and object resolution), Execution (required reads and writes), and Fetch (row results retrieved, listed, sorted, and returned). SQL “tuning” consists, quite simply, of reducing one or more of them.

    Note: generally Parse is the greatest time and resource hog. Parse overhead can be minimized by the use of Procedures, Functions, Packages, Views, etc.

    Friday, July 25, 2014

    DataStage Custom Routine to Get a File Size

    Below is a DataStage custom transform routine to get the size of a file. The full path of the file is passed in as a parameter called "Filename".

    Thursday, July 24, 2014

    Connect to Oracle DB with Linux Shell Script

    In this post, we will see how to connect to Oracle DB in Linux Shell Script. There are some prerequisite for a Linux server to connect to Oracle DB. There should be oracle db driver installed on the Linux server from which you want to connect to Oracle DB. Check whether you can call/connect SQLPLUS from command line.
    $ sqlplus
    if you are not able to, check whether oracle path is set in .profile file. Lets assume, Oracle DB driver is installed on Server and Sqlplus path is set.

    Now, there are some basic steps to connect to Oracle DB-
    a) Connect to Oracle DB with UserID and Password
    b) Execute the SQLs
    c) Store the output in a file or send to standard output (terminal)

    Monday, July 21, 2014

    Navigating the many paths of metadata for DataStage 8

    Source : Navigating the many paths of metadata for DataStage 8

    Looking at the methods for importing metadata table definitions into DataStage 8 ETL jobs.
    All of the metadata import methods of DataStage 7 are in DataStage 8 and all execute in the same way.  Developers familiar with previous versions will be right at home!  What is tricky to come to terms with are all the new ways to get metadata into DataStage 8 and the Metadata Server.  The Metadata Server can provide reporting on metadata from products outside of DataStage such as BI tools so in some cases you might be importing the metadata for reporting and not for ETL. 
    The list that follows covers just the techniques for importing metadata to be used by DataStage jobs.

    Friday, July 18, 2014

    Some DataStage Job design

    Updating the Master Reports List (MRL) in Datastage

    Create a Job which has a Row Generator stage linking to an ODBC or DB2 stage. Change the parameter in the Row Generator stage to output zero records. Write the SQL to update the MRL table as 'After SQL' in the output ODBC or DB2 stage.

    Thursday, July 17, 2014

    Tuesday, July 15, 2014

    Handling Filename with Spaces in Linux

    Spaces in Unix/Linux Filename is not a good habit as usually it is used as delimiter for field values rather than being in filename itself. In Linux, SPACE is used as a IFS ( Internal Field Separator) that is used for word splitting after expansion and to split lines into words with the read build in command. The default values of IFS is SPACE, TAB, NEW LINE.

         This is why whenever we are doing something with File-name having spaces in Linux, we got error like..
    : No such File or Directory

       What if you got some files from a "Unix dumb" Server source, How you will handle the filename with spaces, Here are some simple ways to deal with it.

    Wednesday, July 09, 2014

    Column Import Stage - Read a Fixed width File

    The Column Export stage is a restructure stage. It can have a single input link, a single output link and a single rejects link. Typically used to divide data arriving in a single column into multiple columns.
        Here, I tried to explain How to use it for importing data...

    a) Design :

     As you can see, Here we have followed this design to elaborate the stage. We are reading a file, Importing Columns with Import Stage and then loading them into Sequential File.

    Monday, July 07, 2014

    DataStage Scenario - Design9 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem9

    Solution Design :

    a) Job Design :   
    Below design will achieve the output as per our requirement.

    Saturday, July 05, 2014

    Count Rows and Calculate Sum in same Aggregator

    Generally, we are using Fork n Join method ( divide the data into 2 links which goes to 2 Aggregator ) to calculate Sum and Count Rows of Data because Aggregator Stage is providing one type of aggregation at a time, You can Count the rows or can do some other aggregation.
                Here, I have describe a way in which we can Count Rows and Calculate Sum in a Single Aggregator Stage with some limitations :-)

    Design :

    As per design, we used Column Generator and Aggregator Stage to calculate and count.

    Wednesday, June 25, 2014

    Fork n Join in DataStage

    Algorithm :-
    Fork/join parallelism is a style of parallel programming useful for exploiting the parallelism inherent in divide and conquer algorithms on shared memory multiprocessors. The idea is quite simple: a larger task can be divided into smaller tasks whose solutions can then be combined. As long as the smaller tasks are independent, they can be executed in parallel. One important concept to note in this framework is that ideally no worker thread is idle.

    Monday, June 23, 2014

    Purpose of Parameters in ETL

    In an ETL process, a parameter represents an attribute that is not hard coded or sourced from the transactional system. Parameters provide flexibility so that you can adapt ETL processes to fit your business requirements.
    or in Simple Words, A job parameter is a way to change a property within a job without having to alter and recompile it. These are input values to job/process which can be changed time to time as per need.

    Saturday, June 21, 2014

    List all RPMs in Linux

    Use this command

    $ rpm -qa --qf '%11{SIZE} %{NAME}\n' | sort -k1nr

    Thursday, June 19, 2014

    APT_CONFIG_FILE : Configuration File

    APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?

    1)If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
    1)Current working directory.
    2)INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

    What are the different options a logical node can have in the configuration file?

    Tuesday, June 17, 2014

    FastTrack Makes Your DataStage Development Faster

    IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser. 
    The tool was introduced with the Information Server and is available in the 8.1 version.
    As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules.  Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
    I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project.  We have been using the tool for a few weeks now and we are impressed.  It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen.  Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
    I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:

    Sunday, June 15, 2014

    Things need to consider while developing a Datastage job

    Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.

    Performance of the job can be improved if:
    1) Unnecessary column are removed from the up and down stream links.

    2) Removing these unnecessary columns will help reducing the memory consumption.
    3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
    4) Use RCP very carefully.
    5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.

    Why Data Warehouses are Hard to Deploy ?

    Inherent Complexity of Data Warehouse Databases 

    Data Warehouse Databases are Large
    Since data warehouse databases are constructed by combining selected data from several operational databases, data warehouse databases are inherently large. They are often the largest databases within an organization. The very size of these databases can make them very difficult and expensive to query. 

    Tuesday, June 10, 2014

    Interview Questions : Unix/Linux : Part-8

    1. Display all the files in current directory sorted by size?
    ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'

    2. Write a command to search for the file 'map' in the current directory?
    find -name map -type f

    3. How to display the first 10 characters from each line of a file?
    cut -c -10 filename

    4. Write a command to remove the first number on all lines that start with "@"?
    sed '\,^@, s/[0-9][0-9]*//' < filename

    Wednesday, June 04, 2014

    Surrogate Key Generator - Generate Surrogate Key for Data

    In this post, We will see how to generate surrogate key for data, where we have to use surrogate key stage.

    A) Design :  
    Below design is a demo design of job. Here our data source is a row generator which is generating rows. In real time scenario, Source can be a flat file, DB stages, Passive Stage or can be a Active stage also.
    In Row Generator Stage, we are generating a col "Name".

    Tuesday, June 03, 2014

    Surrogate Key Generator - Delete State File

    Adding this post as well with Surrogate Key walk through, In this post we will see how to delete a Surrogate Key State file. Its quite simple as creation.

    a) Design :
    Again, for the deletion of state file, we need only 1 stage in design and that is Surrogate Key Stage.

    Saturday, May 31, 2014

    Surrogate Key Generator - Create/Update State File

    In last post ( Surrogate Key Generator - Create State File ) we have seen how to generate the surrogate key file, In this post we will discuss about what are the other ways to generate it and how to update state file

    Update Surrogate State File (Flat File) :-
    To update the state file, add a Surrogate Key Generator stage to a job with a single input link from another stage.

    Friday, May 30, 2014

    Surrogate Key Generator - Create State File

    Surrogate keys are numeric keys that can replace business keys in the datawarehouse.  A surrogate key has the following characteristics:

    1) It is typically an integer.
    2) Surrogate keys are unique.
    3) Surrogate keys allow you to combine data from tables with incompatible keys.
    4) It has no meaning. You will not be able to know the meaning of that row of data based on the surrogate key value.
    5) It is not visible to end users. End users should not see a surrogate key in a report.

    I will not go in details of surrogate key, I will post it later, But we will see how to generate surrogate keys in DataStage.

    Tuesday, May 27, 2014

    Interview Questions : DataWareHouse - Part6

    What is BUS Schema?
    BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

    What are the methodologies of Data Warehousing?
    Every company has methodology of their own. However, to name a few SDLC Methodology, AIM methodology is standard used.

    Monday, May 26, 2014

    DataStage Scenario - Design8 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem8

    Solution Design :

    a) Job Design : 
    Below is the design which can achieve the output as we needed. Here, we are reading 2 seq file as a input, then data is passing through a Join and Filter stage to achieve the output.

    Tuesday, May 20, 2014

    Data Warehouse Testing Checklist

    Unit testing checklist
     A checklist will aid database programmers to systematically test their code before formal QA testing.

    • Check the mapping of fields that support data staging and in data marts.
    • Check for duplication of values generated using sequence generators.
    • Check the correctness of surrogate keys that uniquely identify rows of data.
    • Check for data-type constraints of the fields present in staging and core levels.
    • Check the data loading status and error messages after ETLs (extracts, transformations, loads).
    • Look for string columns that are incorrectly left or right trimmed.
    • Make sure all tables and specified fields were loaded from source to staging.
    • Verify that not-null fields were populated.
    • Verify that no data truncation occurred in each field.
    • Make sure data types and formats are as specified during database design.
    • Make sure there are no duplicate records in target tables.
    • Make sure data transformations are correctly based on business rules.
    • Verify that numeric fields are populated precisely.
    • Make sure every ETL session completed with only planned exceptions.
    • Verify all data cleansing, transformation, and error and exception handling.
    • Verify stored procedure calculations and data mappings.

    Sunday, May 18, 2014

    DataStage Scenario - Design7- job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem7

    Solution Design :

    a) Job Design :  Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer and Peek stage to achieve the output.


    Friday, May 16, 2014

    DataStage Scenario - Design6 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem6
    Solution Design :

    a) Job Design :

    Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer stage to achieve the output.

    Thursday, May 08, 2014

    What ETL is not?

    ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

    Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

    Wednesday, May 07, 2014

    Framework ( usually followed ) in ETL Testing

    To discuss how testing relates to the different phases of data mart design,  this frame-work includes eight phases:

    Requirement analysis:

    Requirements are elicited from users and represented either informally by means of proper glossaries or formally.

    Analysis and reconciliation:

    Data sources are inspected, normalized, and integrated to obtain a reconciled schema,

    Monday, May 05, 2014

    ETL Testing : Trends & Challenges

    ETL Testing : Trends

    Count Validation
    - Record Count Verification DWH backend/Reporting queries against source and target as a initial check.

    Dimensional Validation
    Data integrity between the various source tables and relationships. Also we should test the slowly changing dimension as given below. The most common slowly changing dimensions are Types 1, 2, and 3.

    Sunday, May 04, 2014

    ETL Testing : Approach

    Testing is undoubtedly an essential part of DW life-cycle but it received a few attention with respect to other design phases.

    DW testing specification:
     – Software testing is predominantly focused on program code, while Software testing is predominantly focused on program code, while DW testing is directed at data and information DW testing is directed at data and information.  .
     – DW testing focuses on the correctness and usefulness of the information delivered to users information delivered to users
     – Differently from generic software systems, DW testing involves a huge data volume huge data volume, which significantly impacts performance and productivity.
     – DW systems are aimed at supporting any views of data, so the possible number of use scenarios is practically infinite and only few them are known from the beginning.
     – It is almost impossible to predict all the possible types of errors that will be encountered in real operational data.

    Thursday, May 01, 2014

    DataStage Scenario - Problem19

    Goal : Reverse duplicate Removal
    Input :
    source | destination | distance

    Wednesday, April 30, 2014

    How to Create Custom Stages in Datastage Parallel Jobs ?

    In addition to the wide range of parallel stage types available, the Designer allows you to define your own stage types, which you can then use in parallel jobs.
    For creating custom stage in datastage which will functions as per our requirement and we will be able to modify that stage as we may want. Here is some information on custom stage functionality in datastage in parallel jobs and how can we create our own wrapped custom stage.

    Performance Tunings in DataStage


    • Parametrize all the inputs needed for the jobs; avoid hard coding of parameters like username, password, directory paths etc.
    • By using the environment variable ‘APT_CONFIG_FILE’ as a job parameter, user can dynamically change the number of nodes being used for processing a particular job.
    • For reading/writing data from large tables/files, make use of the environment variable ‘APT_BUFFER_MAXIMUM_MEMORY’. It can be used to change the memory buffer size being used for each stage.
    • It is recommended to set the environment variable $APT_DUMP_SCORE to a value of 1. When this environment variable is set, an entry is placed in the WebSphere DataStage job log showing the actual runtime structure (processes, their associated internal operators, datasets, nodes, etc) used to execute the job flow.

    Friday, April 25, 2014

    SQL Best Practices - Part1

    Never put SQL scalar functions on columns in the WHERE predicates.
    For example: WHERE YEAR(HIREDATE) = 2009 should be recoded as WHERE HIREDATE BETWEEN '2009-01-01'and '2009-12-31'.
    When placing SQL scalar functions on columns in the SELECT portion of the SQL statement does incur some minimal overhead, but applying it to a column in the WHERE clause causes the predicate to become stage to non indexable.

    Same applies for mathematics.
    For example: WHERE HIREDATE – 7 DAYS > :HV-DATE should be recoded as WHERE HIREDATE > :HV_DATE + 7 days. Having the mathematics on the host variable is not a problem.

    Thursday, April 24, 2014

    DataStage Scenario - Design5 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem5
    Solution Design :

    a) Job Design :

    Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

    Tuesday, April 15, 2014

    DataStage Scenario - Design4 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem4

    Solution Design :

    a) Job Design :

    Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

    Monday, April 14, 2014

    DataStage Scenario - Design3 - job1

    DataStage Scenario Problem -->  DataStage Scenario - Problem3

    Solution Design :

    a) Job Design :

    Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

    Wednesday, April 09, 2014

    Some more design tips for DataStage Job Development

    • Common information like home directory, system date, username, password should be initialized in a global variable and then variable should be referred everywhere.

    • Stage Variables allow you to hold data from a previous record when the next record, allowing you to compare between previous and current records. Stage variables also allow you return multiple errors for a record of information. By being able to evaluate all data in a record and not just error on the first exception that is found, the cleanup of data is more efficient and requires less iteration.

    Tuesday, April 08, 2014

    Setting up "CRON" Jobs in Nix

    If you want to use the emacs editor for editing cron jobs, then, set the following in your "/home/user/.bash_profile"


    Then, to edit cron jobs

    $ crontab -e

    You may want to put in the following header

    Friday, April 04, 2014

    DataStage Naming Standard

    For maintaining the design and easy understanding of what happening inside the job, we are giving meaningful name to stages, links, jobs and sequencer. So that other developer/designer can have a idea about purpose of job/seq by seeing the design.

    Here I have shared some DataStage Naming Standard we usually follow.

    Thursday, April 03, 2014

    DataStage error upon login: DSR.ADMIN: Failed to add user to UV_USERS file

    When user first logs in, they get following error message:
    DSR.ADMIN: Failed to add user to UV_USERS file.

    Thursday, March 27, 2014

    Get list of locked item on DataStage Server with Script

    This script can be run with or without any argument and provide the list of Job, item locked in DataStage environment by any user. This script will help release the locks if user doesn't locked the item.

    Script Name : DSgetJobLocks.sh
    Usage : ./DSgetJobLocks.sh [one or more argument]

    Monday, March 24, 2014

    Linux & Unix Basics – Day 2

    Directory Structure

    /dev  -  Special files that represent devices
    /sbin -  System utilities for system startup
    /etc  -  System configuration files used by System Admins
    /usr  -  contains the subdirectories bin, include, share etc
    /home – contains all users’ home directories
    /var  -  contains user’s mail files, crontab files etc.
    /tmp – directory used by system processes to keep their temporary files. 


    • User interface - Shell provides an interface to the user, wherein the user could issue his commands, and Shell displays output and error messages to the user.
    • Command interpreter - Shell accepts command from user, and interprets it to the kernel.
    • Command processor - Shell parses the command line arguments, expands the special meaning of meta characters, searches for the command, and if the command is found, then transfers control to the command.
    • Programming language - Shell provides a native programming language.

    Sunday, March 23, 2014

    Linux & Unix Basics – Day 1

    Operating System

    • It is the interface between hardware and user.
    • It is responsible for management and coordination of activities and sharing of resources of computer.
    • It acts as host for computing applications that run on the machine.
    • The applications access OS services through APIs or System Calls.
    • It provides two types of interfaces to the users, namely Command Line Interface and Graphical User Interface.

    Wednesday, March 19, 2014

    DataStage Scenario - Problem18

    Goal : Get the y'day date as a column and change the column format

    Input :

    Thursday, March 13, 2014

    DOS Batch Script to Export DataStage Jobs Automatically from a Project

    Copy the following script into a text file and rename into ".bat" file.

     This batch script is used to Export all jobs from a Projects
     This script must be run from a DataStage client machine and the parameters below should be given

     Host is server name
     User is DataStage username
     Password is DataStage password
     Imp Location is the directory where the datastage dsx's are stored

    Tuesday, March 11, 2014

    Interview Questions : DataWareHouse - Part 5

    How Many different schemas or DW Models can be used in Siebel Analytics. I know Only STAR and SNOW FLAKE and any other model that can be used?
    Integrated schema design is also used to define an integrated schema design we have to define the following concepts

    ► Fact constellation
    ► Act less fact table
    ► Onformed dimension

    A: A fact constellation is the process of joining two or more fact tables

    B: A fact table with out any facts is known as fact less fact table

    C:A dimension which is re useful and fixed is known as conformed dimensionA dimension, which is, shared with multiple fact tables known as conformed dimension

    What is an error log table in Informatica occurs and how to maintain it in mapping?
    Error Log in Informatica is a one of output file created by Informatica Server while running the session for error messages. It is created in Informatica home directory.

    Monday, March 10, 2014

    DataStage Scenario - Problem17

    Goal : Count the data in each column

    Input :
    col1 col2 col3
    a {NULL} b
    f k {NULL}
    h {NULL} n
    i d {NULL}
    {NULL} s {NULL}
    g u m
    l x o
    m {NULL} {NULL}
    c d z

    Thursday, March 06, 2014

    Oracle Interview Questions - Part-4

    76. What is the OPTIMAL parameter?
    It is used to set the optimal length of a rollback segment.

    77. What is the functionality of SYSTEM table space?
    To manage the database level transactions such as modifications of the data dictionary table that record information about the free space usage.

    78. How will you create multiple rollback segments in a database?
    - Create a database, which implicitly creates a SYSTEM rollback segment in a SYSTEM tablespace.
    - Create a second rollback segment name R0 in the SYSTEM tablespace.
    - Make new rollback segment available (after shutdown, modify init.ora file and start database)
    - Create other tablespaces (RBS) for rollback segments.
    - Deactivate rollback segment R0 and activate the newly created rollback segments.

    Friday, February 28, 2014

    Check whether DataStage Job is Multi-Instance or not with Sctipt

    The following script can be used to check if the given datastage job is multi instance or not.

    Arguments to the Scripts:

    Arg1:Datastage Project Name

    Wednesday, February 26, 2014

    Get DataStage Job Information without using Director

    With the help of this script, You can get the job no, category and other information without opening DataStage Director.

    This Script need two arguments :
    1. DataStage Project Name
    2. DataStage Job Name

    Script :

    Monday, February 24, 2014

    How to find the patch History of IBM Infosphere DataStage - Version.xml

    Version 8.0.1 onwards IBM Infosphere DataStage maintains the version/patch history in Version.xml file. Version.xml file contains all the information about the product version, modules of product installed, installed date etc.

    Version.xml file is existing in $ISHOME path. So

    a) cd $ISHOME
    b) If this will not work, try below steps to get the IS Home dir
    c) Execute below command to get ..

    Sunday, February 23, 2014

    DataWareHouse (ETL) Designing Steps

    A) Planning and Designing Steps

    i) Requirement and Realities Gathering

                           1. Business Needs
                           2. Data Profiling and other Data source realities
    3. Compliance Requirement
    4. Security Requirement
    5. Data Integration
    6. Data Latency
    7. Archiving and Lineage
    8. End user delivery interfaces
    9. Available Development Skills
    10. Available management Skills
    11. Legacy licenses

    Thursday, February 20, 2014

    Dimension Table and Its Type in Data WareHouse

    A dimension is a structure that categorizes data in order to enable users to answer business questions. It contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. Calculations on fact table are performed through dimensions.

    Dimension table fields

    The dimension tables should have at least fields listed below and contain fields used to group data during the database inquiry process.

    Those are three types of fields:

    Monday, February 17, 2014

    Datastage Coding Checklist

    1. Ensure that the null handling properties are taken care for all the nullable fields. Do not set the null field value to some value which may be present in the source.
    2. Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.
    3. Always save the metadata (for source, target or lookup definitions) in the repository to ensure re usability and consistency.

    Friday, February 14, 2014

    List of strong points for InterView :-)

    Most of time we stuck when Interviewer ask about our Strong Point, there are lot of reasons behind this ;-) anyways, Here sharing some words which can help you to decide. Pick which suits you but know them ;-) before using.

    • A activating, adapting, administering, analyzing information, arranging, advising

    • B budgeting, building teams, briefing, balancing,

    • C communicating, controlling, co-ordinating, creating, checking, counseling, compiling, coaching

    • D deciding, detailing, developing people, directing, devising, discovering, data input

    Thursday, February 13, 2014

    DataStage Parallel job: Retrieve sql codes on a failed upsert

    When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible to create a reject link to trap rows that fail any update or insert statements. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log.

    Wednesday, February 12, 2014

    Interview Questions : DataWareHouse - Part 4

    What are the types of Synonyms?
    There are two types of Synonyms Private and Public

    What is a Redo Log?
    The set of Redo Log files YSDATE, UID, USER or USERENV SQL functions, or the pseudo columns LEVEL or ROWNUM.

    What is an Index Segment?
    Each Index has an Index segment that stores all of its data.

    Explain the relationship among Database, Table space and Data file?
    Each databases logically divided into one or more table spaces one or more data files are explicitly created for each table space.

    Tuesday, February 11, 2014

    DataStage Error : Loop final value not numeric - cannot execute it

    Generally, you will face this Error when you are getting values from a file or Unix commands and using that value in Loop ( either transformer loop or sequencer loop ).

    Solution : 


    Friday, February 07, 2014

    How to find Agents Ports in IBM InfoSphere Server - Version.xml

    The list of installed products can be obtained from the Version.xml file that is located in ISHOME directory on the client, engine and domain machines where InfoSphere Information Server is installed.

    Default locations
      UNIX /opt/IBM/InformationServer,
      WINDOWS C:\IBM\InformationServer

    Below is a example listing of the Version.xml file:

    Wednesday, February 05, 2014

    How to find the product installed with IBM Infosphere Server - Version.xml

    The Installed product details we can get from Version.xml file, this file contains all the information about Server Installation, version etc.

    Version.xml file is existing in $ISHOME path. So

     a) cd $ISHOME
     b) If this will not work, try below steps to get the DSEngine dir

     Execute below command to get ..
     cd `ps -ef| grep dsrpc | grep -v grep | awk '{print $NF}' | sed 's/.\{26\}$//'`

    Tuesday, February 04, 2014

    DataStage Scenario - Design2 - job2

    DataStage Scenario Problem -->  DataStage Scenario - Problem2

    Solution Design :

    a) Job Design :
    In job design, we are using Copy, Aggregator, Filter and Join stage to get the output.

    Monday, February 03, 2014

    DataStage Scenario - Design 2 - job1

     DataStage Scenario Problem -->  DataStage Scenario - Problem2

    Solution Design :

    a) Job Design :

    Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through Aggregator and Filter stage to achieve the output.

    DataStage Scenario - Problem16

    1st I/P file 
    BRAND, MT, RE 
    OV ,1 ,RE 
    VG ,2 ,RE 
    WU ,3, RE 
    2nd I/P file 
    BRAND, MT, CX 
    OV ,4, CX 
    VG ,5, CX 
    WU ,6, CX 

    Thursday, January 30, 2014

    DataStage Scenario - Problem15

    Goal : Get the Previous column value in current value 

    Input file : 

    Sq, No 

    DataStage Scenario - Problem14

    Goal : get below outputs

    Input : 
    dept, emp
    20,            R
    10,            A
    10,            D
    20,            P
    10,            B
    10,            C
    20,            Q
    20,            S 

    Wednesday, January 29, 2014

    DataStage Scenario - Problem13

    Goal : Repeat the input as it is in input

    Repeat the input in output, Here what we have to do is repeat the same row no of times in output file. Suppose if 2 is there then repeat 2 two times, 4 four times in output files.

    Input :

    Monday, January 27, 2014

    DataStage Scenario - Problem12

    Goal : Add a comment line with each row of input file

    Input1 :

    Name, Dept, Salary

    Sunday, January 26, 2014

    DataStage Scenario - Problem11

    Which stages are needed to achieve below output ??

    Input :


    Thursday, January 23, 2014

    DataStage Scenario - Problem10

    Goal :  Get the max salary from data file ( Seq file )

    We know that max function is use to get the max value in a column of a table, but here you have to design a datastage job which get the max value from seq file.

    Input Seq File :

    Wednesday, January 22, 2014

    DataStage Scenario - Problem9

    Goal : Distribute the data is below requirement

    Input file's data is distributed like 1st 3 row in file1, next 3 row in file2, next 3 rows in file3 again next 3 rows in file1 and continues.

    Tuesday, January 21, 2014

    DataStage Scenario - Problem8

    Goal : Distribute the data as below condition -

    Given : FileA and FileB contains some data in single column.
    Target : File_A : Which contains the data which is available in A but not in B
    Target : File_B : Which contains the data which is available in B but not in A
    Target : File_AB : Which contains the data which is available in A and B Both

    Monday, January 20, 2014

    Tuesday, January 14, 2014

    Removing last or more character of string

    Want to remove last character from a string in Linux, We can do this with help of SED commands.

    Use this

    a) Remove last char
    sed 's/.$//' input
    sed 's/.\{1\}$//' input

    i.e. -
    echo 123456 | sed 's/.$//'

    echo 123456 | sed  's/.\{1\}$//'

    Friday, January 10, 2014

    Datastage Common Errors and Solutions

    While running ./NodeAgents.sh start command... getting the following error: “LoggingAgent.sh process stopped unexpectedly”

    SOL:   needs to kill LoggingAgentSocketImpl
                  Ps –ef |  grep  LoggingAgentSocketImpl   (OR)
                  PS –ef |               grep Agent  (to check the process id of the above)

    2.     Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0
    SOL:    Clear the preserve partition flag before Sequential file stages.

    Thursday, January 02, 2014

    Sample Stage in DataStage

    Sample stage is yet another development stage. It can have a single input link and any number of output links when operating in percent mode and have one o/p link when operating in period mode.

    a) Job Design