Nuts & Bolts of DataStage: Tips & Tricks for debugging a DataStage job

The article talks about DataStage debugging techniques. This can be applied to a job which

Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

Check the DataStage job log for warnings or abort messages. These may indicate an underlying logic problem or unexpected data type conversion. Check all the messages. The PX jobs almost all the times, generate a lot of warnings in addition to the problem area.

Run the job with the message handling (both job level and project level) disabled to find out if there are any warning that are unnecessarily converted to information messages or dropped from logs.

Enable the APT_DUMP_SCORE using which you would be able see how different stages are combined. Some errors/logs mentioned the error is in APT_CombinedOperatorController stages. The stages that form the part of the APT_CombinedOperatorController can be found using the dump score created after enabling this env variable.
This environment variable causes the DataStage to add one log entry which tells how stages are combined in operators and what virtual datasets are used. It also tells how the operators are partitioned and how many no. of partitions are created.

One can also enable APT_RECORD_COUNTS environment variables. Also enable OSH_PRINT_SCHEMAS to ensure that a runtime schema of a job matches the design-time schema that was expected.

Sometimes the underlying data contains the special characters (like null characters) in database or files and this can also cause the trouble in the execution. If the data is in table or dataset, then export it to a sequential file (using DS job). Then use the command “cat –tev” or “od –xc” to find out the special characters.

Once can also use “wc -lc filename”, displays the number of lines and characters in the specified ASCII text file. Sometime this is also useful.

Modular approach: If the job is very bulky with many stages in it and you are unable to locate the error, the one option is to go for modular approach. In this approach, one has to do the execution step by step. E.g. If a job has 10 stages, then create a copy of the job. Just keep say first 3 stages and run the job. Check the result and if the result is fine, then add some more stages (may be one or two) and again run the job. This has to be done till one is unable to locate the error.

Partitioned approach with data: This approach is very useful if the job is running fine for some set of data and failing for other set of data, or failing for large no. of rows. In this approach, one has to run the jobs on selected no .of rows and/or partitions using the DataStage @INROWNUM (and @PARTITIONNUM in Px). E.g. a job when run with 10K rows works fine and is failing with 1M rows. Now one can use @INROWNUM and run the job for say first 0.25 million rows. If the first 0.25 million are fine, then from 0.26 million to 0.5 million and so on.
Please note, if the job parallel job then one also has to consider the no. of partitions in the job.

Other option in such case is – run the job only one node (may be by setting using APT_EXECUTION_MODE to sequential or using the config file with one node.

Execution mode: Sometime if the partitions are confusing, then one can run the job in sequential mode. There are two ways to achieve this:

A parallel Job fails and error do not tell which row it has failed for: In this case, if this job is simple we should try to build the server job and run it. The server jobs can report the errors along with the rows which are in error. This is very useful in case when DB errors like primary/unique violation or any other DB error is reported by PX job.

Sometimes when dealing when DB and if the rows are not getting loaded as expected, adding the reject links to the DB stages can help us locating the rows with issues.

In a big job, adding some intermediate datastes/peek stages to find out the data values at certain levels can help. E.g. if there 10 stages and after that it is going to dataset. Now there may be different operations done at different stages. After 2/3 stages, add peek stages or send data to datasets using copy stages. Check the values after at these intermediate points and see if they can shed some light on the issue.

Tuesday, January 22, 2013