Nuts & Bolts of DataStage: 2014

Tuesday, December 30, 2014

Delete Duplicate Rows in DB2 Database

Many times and Many places, this question is being asked :-) that How to delete the duplicate row from a table in different DBs. Here, we will see How to do this in DB2 DB.

Create table from another table in TeraData database

Same Queries for Teradata DBs -

===========================================================================
===========================================================================

Create table from another table in Oracle database

How to do the same in Oracle DB as we did in DB2 -Create table from another table in DB2 database
In Oracle, it is quite simple to create the duplicate tables with data :-)

Let's see How --

Create table from another table in DB2 database

We can create a table from another table definition but without data in DB2. We need write a another sql to insert the data into newly created table.

Let's see how to do this --

DataStage Error - Cannot authenticate user

When you attempt to start one of the InfoSphere DataStage and QualityStage Administrator clients, the following message is displayed:

Failed to authenticate
the current user against the selected Domain:  Invalid user name (username)
or password.

There are several possible causes of this problem.

The user name is invalid.
The password is invalid or has expired.
The user has no suite user role.
Credential mapping is required, but has not been defined for this user.
The user has no DataStage role or has the incorrect DataStage role.

Linux Shell session setup for DataStage Command Line

To setup Linux Shell for Datastage command line is very important if you are a DataStage Developer and want to work hassle free on DataStage Command Line.

Follow below steps to setup your profile ---

a) First, we will create .dshome file which is containing the path of DS Engine. If you know what it is, good, create a file in root dir with DS Engine path
otherwise execute below scripts to do it by itself

Now We are on Google Groups

Thanks guys for your overwhelming support and interest. Now, we are on Google Groups for some more topics and discussion.

Please Join
https://groups.google.com/d/forum/datastage4you

Like the Facebook Page & join Group
https://www.facebook.com/DataStage4you
https://www.facebook.com/groups/DataStage4you
https://twitter.com/datastage4you
For WHATSAPP group , drop a msg to 91-88-00-906098

Sunday, November 09, 2014

TimeStamp Comparison in Linux

Comparing Date, Time or TimeStamp is tricky thing in Linux. This is how you can achieve this -

By converting date to Seconds
$ date +%s
the above command will current time in seconds since 1970-01-01 00:00:00 UTC

if you have time in string form you can use below command to convert them like -

DataStage Error - Client logins fail with Error: 39125

DataStage client logins fail with error: 39125: the directory you are connecting to either is not a uv account or does not exist.

This error usually indicates that one of the 6 files required to make a UniVerse account is missing from the directory of the DataStage project which has the problem or that the user does not have permission to access these files. Those 6 files/directories are:

Read Sequential File with SCHEMA file

In Datastage, Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs. You do not need to define any column and data format for reading a file. We are using this method when we needed a generic job for scenario like reading multiple files having different metadata defined.

Here, I am going to show you how to read the seq file with help of schema

A. Design

I am using below design to demonstrate this functionality, I am reading the data and dumping it to another file with different format.

VI Cheat Sheet

Cursor movement

h - move left
j - move down
k - move up
l - move right
w - jump by start of words (punctuation considered words)
W - jump by words (spaces separate words)
e - jump to end of words (punctuation considered words)
E - jump to end of words (no punctuation)
b - jump backward by words (punctuation considered words)
B - jump backward by words (no punctuation)
0 - (zero) start of line
^ - first non-blank character of line
$ - end of line
G - Go To command (prefix with number - 5G goes to line 5)

Note: Prefix a cursor movement command with a number to repeat it. For example, 4j moves down 4 lines.

DataStage Warning - When deleting data set previous delete attempt not complete

main_program: When deleting data set /full_Path_of_dataset.ds, previous delete attempt not complete; removing /full_Path_of_dataset.ds.being_deleted in order to proceed.

Solution -

This error usually comes when a previous run of respective job got aborted and the auto cleanup process hasn't been completed properly.

DataStage Error - irreconcilable constraints on the number of partitions

Fatal Error: There are irreconcilable constraints on the number of partitions of an operator: parallel <stage name>. The number of partitions is already constrained to 1, but an eSame partitioned input virtual dataset produced by parallel {natural=MOD_Nulls, synthetic="modify(0)"} has 2.

This Error is simply Data Partitioning Error, The Stage mentioned in Error is not getting the Partition as defined for it. This is usually happen in SAME partitioning.

Create Fixed Width Data in DataStage

In day to day life of DataStage, you are facing lot of business scenario and technical scenario. Today, we are going to discuss how to generate Fixed Width data from variable length data in datastage.

Input :
23
3345
546576
535
80
2342

Output :
0000023
0003345
0546576
0000535
0000080
0002342

DataStage job fails with code=-14 [Timed out while waiting for an event]

DataStage job fails with code=-14 [Timed out while waiting for an event] - this generally happens when jobs are called from sequencers

Solution :

Follow steps 1-3 in resolving this error.
1) DSWaitStartup - environment variable -
When multiple instances of a job are run from a sequence, and one or more of the job instances are set to reset, the sequence might report a status=99 or -14. This can occur because the controlling sequence did not give the job instances enough time to reset before polling its status. The start up time for a job reset must be increased. The environment variable DSWaitResetStartup can be used for this purpose.

(The maximum value that can be set for DSWaitResetStartup is the value of DSWaitStartup (default is 60). For example, if a value of 120 is required for DSWaitResetStartup, then ensure that DSWaitStartup is also set to a minimum of 120.)

Auto Generate Table Schema in DataStage

Few days ago, I got a request to explain how to generate the SCHEMA file of a table ? So here we goes.....

a. Design :

We can generate the SCHEMA file with this job. As we can see there are only two stages. One is DB connector stage ( it can be any DB) and a seq file stage which will hold the SCHEMA definition.

\n not creating new line in Linux

When we are using Linux, Some Shell are quite good to understand the Escape characters we want to print but some are not.

This is how we have to make them understood these characters -

$ echo "Atul \n Singh"
Atul \n Singh

WAS Profile Creation on Linux (pctLinux.bin)

1. Change to the folder
was installation-root/bin/ProfileCreator which will be normally as follows
/opt/IBM/WebSphere/AppServer/bin/ProfileCreator
in my example /home/opt/IBM/WebSphere/AppServer/bin/ProfileCreator

2. run the command ./pctLinux.bin

3. fill the Profile Name eg. AppSrv01

Cut, Copy n Paste in VI editor

We Engineers are well trained in Cut, Copy and Paste things :-) and who the hell don't know the Ctrl-X, C n V. But when we come to same but in Linux Platform we are behaving like dumb ;P. Don't feel offended...I'll help u here

So Lets learn some command which can do the same in Linux VI editor --

Cut - Ctrl-X

The command 'dd' is used for cutting text. Cutting current word, lines or many more summarized below -

dd Delete current line D Delete from cursor to end of line
d$ Delete from cursor to end of line
d0 Delete from cursor to beginning of line
dw Delete from cursor to end of current word
db Delete from cursor to beginning of current word
Ndd Delete N no of lines including current line

How to UPDATE row based on a row in same table

How you will update the Data of a table based on same table ??

update Table_DUMMY T1
set COL_UPDT = (select T2.COL_FROM from Table_DUMMY T2 where T1.KEY=T2.KEY )

Like the Facebook Page & join Group
https://www.facebook.com/DataStage4you
https://www.facebook.com/groups/DataStage4you
https://twitter.com/datastage4you
For WHATSAPP group , drop a msg to 91-88-00-906098

Wednesday, October 01, 2014

Difference Between Normal Lookup and Sparse Lookup

Normal Lookup :-

Normal Lookup data needs to be in memory
Normal might provide poor performance if the reference data is huge as it has to put all the data in memory.
Normal Lookup can have more than one reference link.
Normal lookup can be used with any database

DataStage Warning : When checking operator: Operator of type "APT_TSortOperator": will partition despite the preserve-partitioning flag on the data set on input port 0

DataStage Warning : When checking operator: Operator of type "APT_TSortOperator": will partition despite the preserve-partitioning flag on the data set on input port 0.

These warnings are obtained when we are using some specific partitioning method in a stage in the job e.g hash/same etc but in the preceding stage , preserve portioning(also default) is set in the stage>advanced tab.

DataStage Warning - APT_CombinedOperatorController

While we are running DataStage job, we got this warning like..

APT_CombinedOperatorController(1),0: XXXXXXXXXXXXXXXXXXXXXXX
APT_CombinedOperatorController(2),1: XXXXXXXXXXXXXXXXXXXXXXX

We don't get the stage name which is causing this issue ? How we will get the Culprit Stage :-) ?

Solution :

DataStage Warning - Agg,1: Hash table has grown to 16384 entries

Sometime when we are running Datastage job in which we are using Aggregator stage, we get below warning -

Agg,1: Hash table has grown to 16384 entries.

Agg,0: Hash table has grown to 16384 entries.

Solution :

File descripters in Unix

A file descriptor is an abstract value of type Unix.file_descr, containing information necessary to use a file: a pointer to the file, the access rights, the access modes (read or write), the current position in the file
or we can say
Under the covers of the Unix operating system, files are referenced, copied, and moved by unique numbers known as file descriptors.

Three descriptors are predefined. They correspond to standard input, standard output, and standard error.

0 - stdin (Standard Input) "<"
1 - stdout ( Standard Output) "1>" or ">"
2 - stderr ( Standard Error) "2>"

Show Hidden Character in VI Unix

If you need to show hidden characters such as tab, return carriage, you can use the following command in vi

In vi (command mode -> Escape) , type

:set list
to show all the hidden characters.

You can also type

:set nolist
to turn it off

Like the Facebook Page & join Group
https://www.facebook.com/DataStage4you
https://www.facebook.com/groups/DataStage4you
https://twitter.com/datastage4you
For WHATSAPP group , drop a msg to 91-88-00-906098

Wednesday, September 17, 2014

Read Fixed Width Flat File with Sequential File Stage

If we have a fixed width Flat file, we can read it via Sequential File stage only, no need to use Import Stage. For reading a Fixed Width Flat file, you should know the format of Fixed width data. How are the data splitting into two or more output columns.

Here, I demonstrate How to read a Fixed Width Flat file from Sequential File Stage

A. Input File :

Below is the Input file which we are going to use in our job. This file contains a column RollNo which is concatenation of 4 fixed width columns which we have to split and read.

Peek Stage in DataStage - 3

In today's post, we will discuss a important property of PEEK stage. What if you want to peek some data as well as want to take action or wanna do something with data after peeking. For enabling this, datastage gives you a option to do this.

1. Job Design :

for this demo, we will use below job design to explain, as you can see, There are 2 link coming out from PEEK stage.

First Link --> This link contain the PEEK data as per property defined in peek stage. (here you can see 40 rows ( 10 rows from each partition x 4 nodes )

Second Link --> this link contain the Whole stream data. ( All 100 rows )

Peek Stage in DataStage - 2

Lets continue our journey with PEEK stage, today in this post, I am going to share one more option about PEEK stage. Sending the Peek data into a output file..

1. Job design :

For this demo job, we will use below job design. You can see we have attached a seq file after the peek stage which will capture the Peeked data

Peek Stage in DataStage - 1

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.

1. Design :

We are using below design to demonstrate the functionality of PEEK stage in datastage. Job design is having 3 stages, Row Generator , data source, for generating the dummy data for the job. Here, we can use any data source whether it is DB, dataset or flat files. Transformer, just as a stage, creating negative values of source data. Peek for Peeking :-)

Oracle table is Partitioned or Not ?

For getting whether Oracle table is Partitioned or not, we can use below query on ALL_TABLES -

DataStage Scenario - Design10 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem10

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. In this design, we are reading the data from flat file, generating a dummy column in column generator stage, doing aggregate on that and sending the output to seq file.

WAS server location for IBM Information Server

How to find Correct WAS server using by your Information Server ?

IBM Information Server uses a Websphere Application Server. If you have multiple WAS servers installed, to find the one that IIS is using you can look in:

C:\IBM\InformationServer\ASBServer\bin\setupEnv.bat
or
/opt/IBM/InformationServer/ASBServer/bin/setupEnv.sh

Python Tutorial for Beginners

Python is powerful... and fast;
plays well with others;
runs everywhere;
is friendly & easy to learn;
is Open.

Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C. The language provides constructs intended to enable clear programs on both a small and large scale.

Oracle SQL Tuning Tips - 3

Oracle automatically performs simple column type conversions(or casting) when it compares columns of different types. Depending on the type of conversion, indexes may not be used. Make sure you declare your program variables as the same type as your Oracle columns, if the type is supported in the programming language you are using.Use:
SELECT emp_no, emp_name, sal FROM emp WHERE emp_no = ’123′;
HERE if emp_no indexed numeric, then after implicit conversion query will be:
SELECT emp_no, emp_name, sal FROM emp WHERE emp_no = TO_NUMBER(’123′);
Thus, index is used in this case.
Don’t use:
SELECT emp_no, emp_name, sal FROM emp WHERE emp_type = 123;
HERE if emp_type is indexed varchar2, then after implicit conversion query will be:
SELECT emp_no, emp_name, sal FROM emp WHERE TO_NUMBER(emp_type) = 123;
Thus, index will not be used in this case.

Oracle SQL Tuning Tips - 2

Don’t forget to tune views. Views are SELECT statements and can be tuned in just the same way as any other type of SLECT statement can be. All tuning applicable to any SQL statement are equally applicable to views.

Avoid including a HAVING clause in SELECT statements. The HAVING clause filters selected rows only after all rows have been fetched. Using a WHERE clause helps reduce overheads in sorting, summing, etc. HAVING clauses should only be used when columns with summary operations applied to them are restricted by the clause.Use:
SELECT city FROM country WHERE city!= ‘Vancouver’ AND city!= ‘Toronto’; GROUP BY city;
Rather than:
SELECT city FROM country GROUP BY city HAVING city!= ‘Vancouver’ AND city!= ‘Toronto’;

Oracle SQL Tuning Tips - 1

Consideration when writing an SQL statement is that it returns a correct result. The second is that it be the most efficient for a given situation. You can use many different SQL statements to achieve the same result. It is often the case that only one statement will be the most efficient choice in a given situation.

Remember that processing SQL is a sequence of Parse (syntax check and object resolution), Execution (required reads and writes), and Fetch (row results retrieved, listed, sorted, and returned). SQL “tuning” consists, quite simply, of reducing one or more of them.

Note: generally Parse is the greatest time and resource hog. Parse overhead can be minimized by the use of Procedures, Functions, Packages, Views, etc.

DataStage Custom Routine to Get a File Size

Below is a DataStage custom transform routine to get the size of a file. The full path of the file is passed in as a parameter called "Filename".

Connect to Oracle DB with Linux Shell Script

In this post, we will see how to connect to Oracle DB in Linux Shell Script. There are some prerequisite for a Linux server to connect to Oracle DB. There should be oracle db driver installed on the Linux server from which you want to connect to Oracle DB. Check whether you can call/connect SQLPLUS from command line.
$ sqlplus
if you are not able to, check whether oracle path is set in .profile file. Lets assume, Oracle DB driver is installed on Server and Sqlplus path is set.

Now, there are some basic steps to connect to Oracle DB-
a) Connect to Oracle DB with UserID and Password
b) Execute the SQLs
c) Store the output in a file or send to standard output (terminal)

Navigating the many paths of metadata for DataStage 8

Source : Navigating the many paths of metadata for DataStage 8

Looking at the methods for importing metadata table definitions into DataStage 8 ETL jobs.
All of the metadata import methods of DataStage 7 are in DataStage 8 and all execute in the same way. Developers familiar with previous versions will be right at home! What is tricky to come to terms with are all the new ways to get metadata into DataStage 8 and the Metadata Server. The Metadata Server can provide reporting on metadata from products outside of DataStage such as BI tools so in some cases you might be importing the metadata for reporting and not for ETL.
The list that follows covers just the techniques for importing metadata to be used by DataStage jobs.

Some DataStage Job design

Updating the Master Reports List (MRL) in Datastage

Create a Job which has a Row Generator stage linking to an ODBC or DB2 stage. Change the parameter in the Row Generator stage to output zero records. Write the SQL to update the MRL table as 'After SQL' in the output ODBC or DB2 stage.

DataStage Scenario - Problem20

Goal : Reverse the Column

Input is like this :

i/p
col1,col2
101,a
102,b
103,c
104,d

Handling Filename with Spaces in Linux

Spaces in Unix/Linux Filename is not a good habit as usually it is used as delimiter for field values rather than being in filename itself. In Linux, SPACE is used as a IFS ( Internal Field Separator) that is used for word splitting after expansion and to split lines into words with the read build in command. The default values of IFS is SPACE, TAB, NEW LINE.

This is why whenever we are doing something with File-name having spaces in Linux, we got error like..
: No such File or Directory

What if you got some files from a "Unix dumb" Server source, How you will handle the filename with spaces, Here are some simple ways to deal with it.

Column Import Stage - Read a Fixed width File

The Column Export stage is a restructure stage. It can have a single input link, a single output link and a single rejects link. Typically used to divide data arriving in a single column into multiple columns.
Here, I tried to explain How to use it for importing data...

a) Design :

As you can see, Here we have followed this design to elaborate the stage. We are reading a file, Importing Columns with Import Stage and then loading them into Sequential File.

DataStage Scenario - Design9 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem9

Solution Design :

a) Job Design :
Below design will achieve the output as per our requirement.

Count Rows and Calculate Sum in same Aggregator

Generally, we are using Fork n Join method ( divide the data into 2 links which goes to 2 Aggregator ) to calculate Sum and Count Rows of Data because Aggregator Stage is providing one type of aggregation at a time, You can Count the rows or can do some other aggregation.
Here, I have describe a way in which we can Count Rows and Calculate Sum in a Single Aggregator Stage with some limitations :-)

Design :

As per design, we used Column Generator and Aggregator Stage to calculate and count.

Fork n Join in DataStage

Algorithm :-
Fork/join parallelism is a style of parallel programming useful for exploiting the parallelism inherent in divide and conquer algorithms on shared memory multiprocessors. The idea is quite simple: a larger task can be divided into smaller tasks whose solutions can then be combined. As long as the smaller tasks are independent, they can be executed in parallel. One important concept to note in this framework is that ideally no worker thread is idle.

Purpose of Parameters in ETL

In an ETL process, a parameter represents an attribute that is not hard coded or sourced from the transactional system. Parameters provide flexibility so that you can adapt ETL processes to fit your business requirements.
or in Simple Words, A job parameter is a way to change a property within a job without having to alter and recompile it. These are input values to job/process which can be changed time to time as per need.

Saturday, June 21, 2014

List all RPMs in Linux

Use this command

$ rpm -qa --qf '%11{SIZE} %{NAME}\n' | sort -k1nr

Thursday, June 19, 2014

APT_CONFIG_FILE : Configuration File

APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?

1)If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:

1)Current working directory.

2)INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

What are the different options a logical node can have in the configuration file?

FastTrack Makes Your DataStage Development Faster

IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser.
The tool was introduced with the Information Server and is available in the 8.1 version.
As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules. Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project. We have been using the tool for a few weeks now and we are impressed. It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen. Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:

Things need to consider while developing a Datastage job

Datasets
Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.

Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.
2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.

Why Data Warehouses are Hard to Deploy ?

Inherent Complexity of Data Warehouse Databases

Data Warehouse Databases are Large

Since data warehouse databases are constructed by combining selected data from several operational databases, data warehouse databases are inherently large. They are often the largest databases within an organization. The very size of these databases can make them very difficult and expensive to query.

Interview Questions : Unix/Linux : Part-8

1. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'

2. Write a command to search for the file 'map' in the current directory?
find -name map -type f

3. How to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

Surrogate Key Generator - Generate Surrogate Key for Data

In this post, We will see how to generate surrogate key for data, where we have to use surrogate key stage.

A) Design :
Below design is a demo design of job. Here our data source is a row generator which is generating rows. In real time scenario, Source can be a flat file, DB stages, Passive Stage or can be a Active stage also.
In Row Generator Stage, we are generating a col "Name".

Surrogate Key Generator - Delete State File

Adding this post as well with Surrogate Key walk through, In this post we will see how to delete a Surrogate Key State file. Its quite simple as creation.

a) Design :
Again, for the deletion of state file, we need only 1 stage in design and that is Surrogate Key Stage.

Surrogate Key Generator - Create/Update State File

In last post ( Surrogate Key Generator - Create State File ) we have seen how to generate the surrogate key file, In this post we will discuss about what are the other ways to generate it and how to update state file

Update Surrogate State File (Flat File) :-
To update the state file, add a Surrogate Key Generator stage to a job with a single input link from another stage.

Surrogate Key Generator - Create State File

Surrogate keys are numeric keys that can replace business keys in the datawarehouse. A surrogate key has the following characteristics:

1) It is typically an integer.
2) Surrogate keys are unique.
3) Surrogate keys allow you to combine data from tables with incompatible keys.
4) It has no meaning. You will not be able to know the meaning of that row of data based on the surrogate key value.
5) It is not visible to end users. End users should not see a surrogate key in a report.

I will not go in details of surrogate key, I will post it later, But we will see how to generate surrogate keys in DataStage.

Some Oracle SQL should known by Developer - 1

========================================================================

Interview Questions : DataWareHouse - Part6

What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

What are the methodologies of Data Warehousing?
Every company has methodology of their own. However, to name a few SDLC Methodology, AIM methodology is standard used.

DataStage Scenario - Design8 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem8

Solution Design :

a) Job Design :
Below is the design which can achieve the output as we needed. Here, we are reading 2 seq file as a input, then data is passing through a Join and Filter stage to achieve the output.

Data Warehouse Testing Checklist

Unit testing checklist

A checklist will aid database programmers to systematically test their code before formal QA testing.

Check the mapping of fields that support data staging and in data marts.
Check for duplication of values generated using sequence generators.
Check the correctness of surrogate keys that uniquely identify rows of data.
Check for data-type constraints of the fields present in staging and core levels.
Check the data loading status and error messages after ETLs (extracts, transformations, loads).
Look for string columns that are incorrectly left or right trimmed.
Make sure all tables and specified fields were loaded from source to staging.
Verify that not-null fields were populated.
Verify that no data truncation occurred in each field.
Make sure data types and formats are as specified during database design.
Make sure there are no duplicate records in target tables.
Make sure data transformations are correctly based on business rules.
Verify that numeric fields are populated precisely.
Make sure every ETL session completed with only planned exceptions.
Verify all data cleansing, transformation, and error and exception handling.
Verify stored procedure calculations and data mappings.

DataStage Scenario - Design7- job1

DataStage Scenario Problem --> DataStage Scenario - Problem7

Solution Design :

a) Job Design : Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer and Peek stage to achieve the output.

DataStage Scenario - Design6 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem6

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer stage to achieve the output.

What ETL is not?

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

Framework ( usually followed ) in ETL Testing

To discuss how testing relates to the different phases of data mart design, this frame-work includes eight phases:

Requirement analysis:

Requirements are elicited from users and represented either informally by means of proper glossaries or formally.

Analysis and reconciliation:

Data sources are inspected, normalized, and integrated to obtain a reconciled schema,

ETL Testing : Trends & Challenges

ETL Testing : Trends

Count Validation
- Record Count Verification DWH backend/Reporting queries against source and target as a initial check.

Dimensional Validation
Data integrity between the various source tables and relationships. Also we should test the slowly changing dimension as given below. The most common slowly changing dimensions are Types 1, 2, and 3.

ETL Testing : Approach

Testing is undoubtedly an essential part of DW life-cycle but it received a few attention with respect to other design phases.

DW testing specification:
– Software testing is predominantly focused on program code, while Software testing is predominantly focused on program code, while DW testing is directed at data and information DW testing is directed at data and information. .
– DW testing focuses on the correctness and usefulness of the information delivered to users information delivered to users
– Differently from generic software systems, DW testing involves a huge data volume huge data volume, which significantly impacts performance and productivity.
– DW systems are aimed at supporting any views of data, so the possible number of use scenarios is practically infinite and only few them are known from the beginning.
– It is almost impossible to predict all the possible types of errors that will be encountered in real operational data.

DataStage Scenario - Problem19

Goal : Reverse duplicate Removal

Input :
source | destination | distance
city1|city2|500
city2|city1|500
city3|city4|500
city4|city3|500

How to Create Custom Stages in Datastage Parallel Jobs ?

In addition to the wide range of parallel stage types available, the Designer allows you to define your own stage types, which you can then use in parallel jobs.
For creating custom stage in datastage which will functions as per our requirement and we will be able to modify that stage as we may want. Here is some information on custom stage functionality in datastage in parallel jobs and how can we create our own wrapped custom stage.

Performance Tunings in DataStage

JOB LEVEL

Parametrize all the inputs needed for the jobs; avoid hard coding of parameters like username, password, directory paths etc.
By using the environment variable ‘APT_CONFIG_FILE’ as a job parameter, user can dynamically change the number of nodes being used for processing a particular job.
For reading/writing data from large tables/files, make use of the environment variable ‘APT_BUFFER_MAXIMUM_MEMORY’. It can be used to change the memory buffer size being used for each stage.
It is recommended to set the environment variable $APT_DUMP_SCORE to a value of 1. When this environment variable is set, an entry is placed in the WebSphere DataStage job log showing the actual runtime structure (processes, their associated internal operators, datasets, nodes, etc) used to execute the job flow.

SQL Best Practices - Part1

Never put SQL scalar functions on columns in the WHERE predicates.
For example: WHERE YEAR(HIREDATE) = 2009 should be recoded as WHERE HIREDATE BETWEEN '2009-01-01'and '2009-12-31'.
When placing SQL scalar functions on columns in the SELECT portion of the SQL statement does incur some minimal overhead, but applying it to a column in the WHERE clause causes the predicate to become stage to non indexable.

Same applies for mathematics.
For example: WHERE HIREDATE – 7 DAYS > :HV-DATE should be recoded as WHERE HIREDATE > :HV_DATE + 7 days. Having the mathematics on the host variable is not a problem.

DataStage Scenario - Design5 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem5

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

DataStage Scenario - Design4 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem4

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

DataStage Scenario - Design3 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem3

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.

Some more design tips for DataStage Job Development

• Common information like home directory, system date, username, password should be initialized in a global variable and then variable should be referred everywhere.

• Stage Variables allow you to hold data from a previous record when the next record, allowing you to compare between previous and current records. Stage variables also allow you return multiple errors for a record of information. By being able to evaluate all data in a record and not just error on the first exception that is found, the cleanup of data is more efficient and requires less iteration.

Setting up "CRON" Jobs in Nix

If you want to use the emacs editor for editing cron jobs, then, set the following in your "/home/user/.bash_profile"

EDITOR=emacs

Then, to edit cron jobs

$ crontab -e

You may want to put in the following header

DataStage Naming Standard

For maintaining the design and easy understanding of what happening inside the job, we are giving meaningful name to stages, links, jobs and sequencer. So that other developer/designer can have a idea about purpose of job/seq by seeing the design.

Here I have shared some DataStage Naming Standard we usually follow.

DataStage error upon login: DSR.ADMIN: Failed to add user to UV_USERS file

When user first logs in, they get following error message:
DSR.ADMIN: Failed to add user to UV_USERS file.
ACCESS DENIED: INTERNAL SQL ERROR FAILED TO REGISTER THIS USER FOR SQL.

Get list of locked item on DataStage Server with Script

This script can be run with or without any argument and provide the list of Job, item locked in DataStage environment by any user. This script will help release the locks if user doesn't locked the item.

Script Name : DSgetJobLocks.sh
Usage : ./DSgetJobLocks.sh [one or more argument]

Linux & Unix Basics – Day 2

Directory Structure

/dev - Special files that represent devices
/sbin - System utilities for system startup
/etc - System configuration files used by System Admins
/usr - contains the subdirectories bin, include, share etc
/home – contains all users’ home directories
/var - contains user’s mail files, crontab files etc.
/tmp – directory used by system processes to keep their temporary files.

Shell

User interface - Shell provides an interface to the user, wherein the user could issue his commands, and Shell displays output and error messages to the user.
Command interpreter - Shell accepts command from user, and interprets it to the kernel.
Command processor - Shell parses the command line arguments, expands the special meaning of meta characters, searches for the command, and if the command is found, then transfers control to the command.
Programming language - Shell provides a native programming language.

Linux & Unix Basics – Day 1

Operating System

It is the interface between hardware and user.
It is responsible for management and coordination of activities and sharing of resources of computer.
It acts as host for computing applications that run on the machine.
The applications access OS services through APIs or System Calls.
It provides two types of interfaces to the users, namely Command Line Interface and Graphical User Interface.

DataStage Scenario - Problem18

Goal : Get the y'day date as a column and change the column format

Input :

Country|Currency|ValidFrom|ExRate
CNH|USD|1-Jan-12|0.11
CNH|USD|5-Jan-12|0.8
CNH|USD|10-Jan-12|0.15
CNH|USD|15-Jan-12|0.14
CNH|USD|20-Jan-12|0.16

DOS Batch Script to Export DataStage Jobs Automatically from a Project

Copy the following script into a text file and rename into ".bat" file.

This batch script is used to Export all jobs from a Projects
This script must be run from a DataStage client machine and the parameters below should be given

Host is server name
User is DataStage username
Password is DataStage password
Imp Location is the directory where the datastage dsx's are stored

Interview Questions : DataWareHouse - Part 5

How Many different schemas or DW Models can be used in Siebel Analytics. I know Only STAR and SNOW FLAKE and any other model that can be used?
Integrated schema design is also used to define an integrated schema design we have to define the following concepts

► Fact constellation
► Act less fact table
► Onformed dimension

A: A fact constellation is the process of joining two or more fact tables

B: A fact table with out any facts is known as fact less fact table

C:A dimension which is re useful and fixed is known as conformed dimensionA dimension, which is, shared with multiple fact tables known as conformed dimension

What is an error log table in Informatica occurs and how to maintain it in mapping?
Error Log in Informatica is a one of output file created by Informatica Server while running the session for error messages. It is created in Informatica home directory.

DataStage Scenario - Problem17

Goal : Count the data in each column

Input :

col1	col2	col3
a	{NULL}	b
f	k	{NULL}
h	{NULL}	n
i	d	{NULL}
{NULL}	s	{NULL}
g	u	m
l	x	o
m	{NULL}	{NULL}
c	d	z

Wishing all of our special ladies...a Happy Women's Day !!

Thursday, March 06, 2014

Oracle Interview Questions - Part-4

76. What is the OPTIMAL parameter?
It is used to set the optimal length of a rollback segment.

77. What is the functionality of SYSTEM table space?
To manage the database level transactions such as modifications of the data dictionary table that record information about the free space usage.

78. How will you create multiple rollback segments in a database?
- Create a database, which implicitly creates a SYSTEM rollback segment in a SYSTEM tablespace.
- Create a second rollback segment name R0 in the SYSTEM tablespace.
- Make new rollback segment available (after shutdown, modify init.ora file and start database)
- Create other tablespaces (RBS) for rollback segments.
- Deactivate rollback segment R0 and activate the newly created rollback segments.

Check whether DataStage Job is Multi-Instance or not with Sctipt

The following script can be used to check if the given datastage job is multi instance or not.

Arguments to the Scripts:

Arg1:Datastage Project Name
Arg2:JobName

Get DataStage Job Information without using Director

With the help of this script, You can get the job no, category and other information without opening DataStage Director.

This Script need two arguments :
1. DataStage Project Name
2. DataStage Job Name

Script :

How to find the patch History of IBM Infosphere DataStage - Version.xml

Version 8.0.1 onwards IBM Infosphere DataStage maintains the version/patch history in Version.xml file. Version.xml file contains all the information about the product version, modules of product installed, installed date etc.

Version.xml file is existing in $ISHOME path. So

a) cd $ISHOME

b) If this will not work, try below steps to get the IS Home dir

c) Execute below command to get ..

DataWareHouse (ETL) Designing Steps

A) Planning and Designing Steps

i) Requirement and Realities Gathering

1. Business Needs

2. Data Profiling and other Data source realities

3. Compliance Requirement

4. Security Requirement

5. Data Integration

6. Data Latency

7. Archiving and Lineage

8. End user delivery interfaces

9. Available Development Skills

10. Available management Skills

11. Legacy licenses

Dimension Table and Its Type in Data WareHouse

A dimension is a structure that categorizes data in order to enable users to answer business questions. It contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. Calculations on fact table are performed through dimensions.

Dimension table fields

The dimension tables should have at least fields listed below and contain fields used to group data during the database inquiry process.

Those are three types of fields:

Datastage Coding Checklist

Ensure that the null handling properties are taken care for all the nullable fields. Do not set the null field value to some value which may be present in the source.
Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.
Always save the metadata (for source, target or lookup definitions) in the repository to ensure re usability and consistency.

List of strong points for InterView :-)

Most of time we stuck when Interviewer ask about our Strong Point, there are lot of reasons behind this ;-) anyways, Here sharing some words which can help you to decide. Pick which suits you but know them ;-) before using.

• A activating, adapting, administering, analyzing information, arranging, advising

• B budgeting, building teams, briefing, balancing,

• C communicating, controlling, co-ordinating, creating, checking, counseling, compiling, coaching

• D deciding, detailing, developing people, directing, devising, discovering, data input

DataStage Parallel job: Retrieve sql codes on a failed upsert

When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible to create a reject link to trap rows that fail any update or insert statements. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log.

Interview Questions : DataWareHouse - Part 4

What are the types of Synonyms?
There are two types of Synonyms Private and Public

What is a Redo Log?
The set of Redo Log files YSDATE, UID, USER or USERENV SQL functions, or the pseudo columns LEVEL or ROWNUM.

What is an Index Segment?
Each Index has an Index segment that stores all of its data.

Explain the relationship among Database, Table space and Data file?
Each databases logically divided into one or more table spaces one or more data files are explicitly created for each table space.

DataStage Error : Loop final value not numeric - cannot execute it

Generally, you will face this Error when you are getting values from a file or Unix commands and using that value in Loop ( either transformer loop or sequencer loop ).

Solution :

How to find Agents Ports in IBM InfoSphere Server - Version.xml

The list of installed products can be obtained from the Version.xml file that is located in ISHOME directory on the client, engine and domain machines where InfoSphere Information Server is installed.

Default locations

Below is a example listing of the Version.xml file:

DataStage Scenario - Design2 - job3

DataStage Scenario Problem --> DataStage Scenario - Problem2

Solution Design :

a) Job Design :

How to find the product installed with IBM Infosphere Server - Version.xml

The Installed product details we can get from Version.xml file, this file contains all the information about Server Installation, version etc.

Version.xml file is existing in $ISHOME path. So

a) cd $ISHOME
b) If this will not work, try below steps to get the DSEngine dir

Execute below command to get ..
cd `ps -ef| grep dsrpc | grep -v grep | awk '{print $NF}' | sed 's/.\{26\}$//'`

DataStage Scenario - Design2 - job2

DataStage Scenario Problem --> DataStage Scenario - Problem2

Solution Design :

a) Job Design :
In job design, we are using Copy, Aggregator, Filter and Join stage to get the output.

DataStage Scenario - Design 2 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem2

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through Aggregator and Filter stage to achieve the output.

1st I/P file 
BRAND, MT, RE 
OV ,1 ,RE 
VG ,2 ,RE 
WU ,3, RE 


2nd I/P file 
BRAND, MT, CX 
OV ,4, CX 
VG ,5, CX 
WU ,6, CX

DataStage Scenario - Problem15

Goal : Get the Previous column value in current value

Input file :

Sq, No
1,1000
2,2200
3,3030
4,5600

DataStage Scenario - Problem14

Goal : get below outputs

Input : 
dept, emp
----------------------------
20,            R
10,            A
10,            D
20,            P
10,            B
10,            C
20,            Q
20,            S

DataStage Scenario - Problem13

Goal : Repeat the input as it is in input

Repeat the input in output, Here what we have to do is repeat the same row no of times in output file. Suppose if 2 is there then repeat 2 two times, 4 four times in output files.

Input :
2
3
4
1

DataStage Scenario - Problem12

Goal : Add a comment line with each row of input file

Input1 :

Name, Dept, Salary
CHRISTINE,A00,152750.00
MICHAEL,B01,94250.00
SALLY,C01,98250.00
JOHN,E01,80175.00
IRVING,D11,72250.00
EVA,D21,96170.00
EILEEN,E11,89750.00
THEODORE,E21,86150.00

DataStage Scenario - Problem11

Which stages are needed to achieve below output ??

Input :

col1,col2
1,1
2,rajesh
3,15000
4,2
5,suresh
6,16000
7,3
8,veeru
9,17000

DataStage Scenario - Problem10

Goal : Get the max salary from data file ( Seq file )

We know that max function is use to get the max value in a column of a table, but here you have to design a datastage job which get the max value from seq file.

Input Seq File :

DataStage Scenario - Problem9

Goal : Distribute the data is below requirement

Input file's data is distributed like 1st 3 row in file1, next 3 row in file2, next 3 rows in file3 again next 3 rows in file1 and continues.

DataStage Scenario - Problem8

Goal : Distribute the data as below condition -

Given : FileA and FileB contains some data in single column.
Target : File_A : Which contains the data which is available in A but not in B
Target : File_B : Which contains the data which is available in B but not in A
Target : File_AB : Which contains the data which is available in A and B Both

DataStage Scenario - Problem7

Goal : Divide the input into 3 columns

source  

c1      
c2      
c3      
c4
c5
c6
c7
c8

Removing last or more character of string

Want to remove last character from a string in Linux, We can do this with help of SED commands.

Use this

a) Remove last char
sed 's/.$//' input
or
sed 's/.\{1\}$//' input

i.e. -
echo 123456 | sed 's/.$//'
12345

echo 123456 | sed 's/.\{1\}$//'
12345

Datastage Common Errors and Solutions

1. While running ./NodeAgents.sh start command... getting the following error: “LoggingAgent.sh process stopped unexpectedly”

SOL: needs to kill LoggingAgentSocketImpl

Ps –ef | grep LoggingAgentSocketImpl (OR)

PS –ef | grep Agent (to check the process id of the above)

2. Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0

SOL: Clear the preserve partition flag before Sequential file stages.

Subscribe to: Posts ( Atom )

Tuesday, December 30, 2014

Wednesday, December 10, 2014

Tuesday, December 09, 2014

Monday, December 08, 2014

Monday, November 17, 2014

Friday, November 14, 2014

Thursday, November 13, 2014

Please Joinhttps://groups.google.com/d/forum/datastage4you

Sunday, November 09, 2014

Friday, November 07, 2014

Wednesday, November 05, 2014

A. Design

Tuesday, October 28, 2014

Cursor movement

Wednesday, October 22, 2014

Monday, October 20, 2014

Thursday, October 16, 2014

Solution :

Tuesday, October 14, 2014

a. Design :

Monday, October 13, 2014

Sunday, October 12, 2014

Saturday, October 11, 2014

Cut - Ctrl-X

Thursday, October 02, 2014

Wednesday, October 01, 2014

Tuesday, September 30, 2014

Sunday, September 28, 2014

Friday, September 26, 2014

Thursday, September 25, 2014

Tuesday, September 23, 2014

Wednesday, September 17, 2014

A. Input File :

Monday, September 08, 2014

1. Job Design :

Friday, September 05, 2014

1. Job design :

Tuesday, September 02, 2014

1. Design :

Monday, August 18, 2014

Saturday, August 09, 2014

Friday, August 08, 2014

How to find Correct WAS server using by your Information Server ?

Thursday, August 07, 2014

Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open.

Monday, August 04, 2014

Thursday, July 31, 2014

Tuesday, July 29, 2014

Friday, July 25, 2014

Thursday, July 24, 2014

Monday, July 21, 2014

Friday, July 18, 2014

Updating the Master Reports List (MRL) in Datastage

Thursday, July 17, 2014

Goal : Reverse the Column

Tuesday, July 15, 2014

Wednesday, July 09, 2014

Monday, July 07, 2014

Saturday, July 05, 2014

Design :

Wednesday, June 25, 2014

Monday, June 23, 2014

Saturday, June 21, 2014

Thursday, June 19, 2014

Tuesday, June 17, 2014

Sunday, June 15, 2014

Tuesday, June 10, 2014

Wednesday, June 04, 2014

Tuesday, June 03, 2014

Saturday, May 31, 2014

Friday, May 30, 2014

Wednesday, May 28, 2014

Tuesday, May 27, 2014

Monday, May 26, 2014

Tuesday, May 20, 2014

Sunday, May 18, 2014

Friday, May 16, 2014

Thursday, May 08, 2014

Wednesday, May 07, 2014

Requirement analysis:

Please Join
https://groups.google.com/d/forum/datastage4you

Python is powerful... and fast;
plays well with others;
runs everywhere;
is friendly & easy to learn;
is Open.