The
article talks about DataStage debugging techniques. This can be applied to a
job which
- is
not producing proper output data or
- to
a job that is aborting or generating warnings
- Use
the Data Set Management utility, which is available in the Tools menu of
the DataStage Designer or the DataStage Manager, to examine the schema,
look at row counts, and delete a Parallel Data Set. You can also view the
data itself.
- Check
the DataStage job log for warnings or abort messages. These may
indicate an underlying logic problem or unexpected data type conversion.
Check all the messages. The PX jobs almost all the times, generate a lot
of warnings in addition to the problem area.
- Run
the job with the message handling (both job level and project
level) disabled to find out if there are any warning that are
unnecessarily converted to information messages or dropped from logs.
- Enable
the APT_DUMP_SCORE using which you would be able see how different stages
are combined. Some errors/logs mentioned the error is in
APT_CombinedOperatorController stages. The stages that form the part of
the APT_CombinedOperatorController can be found using the dump score
created after enabling this env variable.
This environment variable causes the DataStage to add one log entry which tells how stages are combined in operators and what virtual datasets are used. It also tells how the operators are partitioned and how many no. of partitions are created.
- One
can also enable APT_RECORD_COUNTS environment variables. Also enable
OSH_PRINT_SCHEMAS to ensure that a runtime schema of a job matches the
design-time schema that was expected.
- Sometimes
the underlying data contains the special characters (like null characters)
in database or files and this can also cause the trouble in the execution.
If the data is in table or dataset, then export it to a sequential file
(using DS job). Then use the command “cat –tev” or “od –xc”
to find out the special characters.
- Once
can also use “wc -lc filename”, displays the number of lines and
characters in the specified ASCII text file. Sometime this is also useful.
- Modular
approach:
If the job is very bulky with many stages in it and you are unable to
locate the error, the one option is to go for modular approach. In this
approach, one has to do the execution step by step. E.g. If a job has 10
stages, then create a copy of the job. Just keep say first 3 stages and
run the job. Check the result and if the result is fine, then add some
more stages (may be one or two) and again run the job. This has to be done
till one is unable to locate the error.
- Partitioned
approach with data: This approach is very useful if
the job is running fine for some set of data and failing for other set of
data, or failing for large no. of rows. In this approach, one has to run
the jobs on selected no .of rows and/or partitions using the DataStage
@INROWNUM (and @PARTITIONNUM in Px). E.g. a job when run with 10K rows
works fine and is failing with 1M rows. Now one can use @INROWNUM and run
the job for say first 0.25 million rows. If the first 0.25 million are
fine, then from 0.26 million to 0.5 million and so on.
Please note, if the job parallel job then one also has to consider the no. of partitions in the job.
- Other
option in such case is – run the job only one node (may be by
setting using APT_EXECUTION_MODE to sequential or using the config file
with one node.
- Execution
mode:
Sometime if the partitions are confusing, then one can run the job in
sequential mode. There are two ways to achieve this:
- Use
the environment variable APT_EXECUTION_MODE and set it to sequential
mode.
- Use
a configuration file with only one node.
- A
parallel Job fails and error do not tell which row it has failed for:
In this case, if this job is simple we should try to build the server job
and run it. The server jobs can report the errors along with the rows
which are in error. This is very useful in case when DB errors like
primary/unique violation or any other DB error is reported by PX job.
- Sometimes
when dealing when DB and if the rows are not getting loaded as expected,
adding the reject links to the DB stages can help us locating the rows
with issues.
- In
a big job, adding some intermediate datastes/peek stages to find out the
data values at certain levels can help. E.g. if there 10 stages and after
that it is going to dataset. Now there may be different operations done at
different stages. After 2/3 stages, add peek stages or send data to
datasets using copy stages. Check the values after at these intermediate
points and see if they can shed some light on the issue.