1)what is the advantage of using sort stage over remove duplicate stage in removing duplicates.
2) Is there any way in which we can specify which record to retain(like
retaining the last record or retaining the first) when we remove
duplicate using transform stage, similarly in sort stage also.
Ans :
1)The advantage of using sort stage over remove duplicate stage is that
sort stage allows us to capture the duplicate records whereas remove
duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in
ascending order and try to get the last rec. The same can be done using
sort stage by sorting in descending order to retain the first record.
********
Yes indeed...
Use a sort stage with the option "Create Key Change Column" set to True.
Thil will create a column Keychange at the output of the sort as below.
Input
-----
col_a
------
1
1
1
2
2
3
4
5
6
6
output
-------
col_a Keychange
------------------
1 1
1 0
1 0
2 1
2 0
3 1
4 1
5 1
6 1
6 0
For the first occurrence of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.
Now you can use a transformer with constraint Keychange=0 to capture duplicate records.
*********
Can u tell me what is create cluster key change value in sort stage ...
if my requirement is that i have to remove duplicate, i dont want to
capture the duplicate records. Is that fine to remove duplicate using
input tab sort ,unique option of any stage .
******
the options spcd above for retaining duplicate values using sort stage
will obviously work........but as far as I remember there is an option
in the properties of the sort stage namely ::ALLOW DUPLICATES.If this
property is set to true I think it will satisfy the requirement(provided
that u r using datastage 8.x.
If yhe property "cluster key change column" is set to true it will
eventually create a cluster key change column in the output.concentrate
on the following example::
i/p_col cluster _key _change_col
1 1
1 0
1 0
2 1
2 0
3 1
1 1
2 1
6 1
6 0
the logic is that every i/p value will check the value immediete above
of it.If it finds a match the o/p is 0 else 1.doesn't matter if it is
appeared previously or not.
Something about DataStage, DataStage Administration, Job Designing,Developing, DataStage troubleshooting, DataStage Installation & Configuration, ETL, DataWareHousing, DB2, Teradata, Oracle and Scripting.