Nuts & Bolts of DataStage: Sort stage to remove duplicate

Thursday, March 21, 2013

Sort stage to remove duplicate

1)what is the advantage of using sort stage over remove duplicate stage in removing duplicates.
2) Is there any way in which we can specify which record to retain(like retaining the last record or retaining the first) when we remove duplicate using transform stage, similarly in sort stage also.

Ans :

1)The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture the duplicate records whereas remove duplicate stage does not.

2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.

********
Yes indeed...
Use a sort stage with the option "Create Key Change Column" set to True.
Thil will create a column Keychange at the output of the sort as below.

Input
-----
col_a
------
1
1
1
2
2
3
4
5
6
6

output
-------
col_a Keychange
------------------
1 1
1 0
1 0
2 1
2 0
3 1
4 1
5 1
6 1
6 0

For the first occurrence of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.
Now you can use a transformer with constraint Keychange=0 to capture duplicate records.

*********
Can u tell me what is create cluster key change value in sort stage ... if my requirement is that i have to remove duplicate, i dont want to capture the duplicate records. Is that fine to remove duplicate using input tab sort ,unique option of any stage .
******

the options spcd above for retaining duplicate values using sort stage will obviously work........but as far as I remember there is an option in the properties of the sort stage namely ::ALLOW DUPLICATES.If this property is set to true I think it will satisfy the requirement(provided that u r using datastage 8.x.

If yhe property "cluster key change column" is set to true it will eventually create a cluster key change column in the output.concentrate on the following example::

i/p_col cluster _key _change_col
1 1
1 0
1 0
2 1
2 0
3 1
1 1
2 1
6 1
6 0

the logic is that every i/p value will check the value immediete above of it.If it finds a match the o/p is 0 else 1.doesn't matter if it is appeared previously or not.