Nuts & Bolts of DataStage: DataStage Documentation Best Practices

Introduction

This document contains the Data Stage Best practises and recommendations which could be used to improve the quality of data stage jobs. This document would further be enhanced to include the specific Data Stage problems and there troubleshooting.

Recommendations

Part - 1 -------->> Here

• Data Stage Version

• DataStage Job Naming Conventions

• DataStage Job Stage and Link Naming Conventions

• DataStage Job Descriptions

• DataStage Job Complexity

• DataStage Job Design

• Error and/or Reject Handling

• Process Exception Handling

• Standards

Part - 2 -------->> Here

• Development Guidelines

• Component Usage

• DataStage Data Types

• Partitinong Data

• Collecting Data

• Sorting

• Stage Specific Guidelines

Data Stage Version

Use DataStage Enterprise edition v 8.1 over Server edition for any future production development. Enterprise edition has a more robust set of Job stages, provides better performance through parallel processing, and more future flexibility through its scalability.

DataStage Job Naming Conventions

For DataStage job development, a standard naming convention should be utilized for all job names. This could include various components including the type of job, source of data, type of data, category, etc.

Eg: jbCd_Aerodrome_Type_Code_RDS

jbDtl_Prty_Address_CAAMS

The job naming conventions for the conversion jobs don’t really need to change at this point. These jobs are likely executed only during the initial conversion and not used again after that. If any of these are to become part of the production process, then changing the job name to a standard format would be preferred.

DataStage Job Stage and Link Naming Conventions

It is recommended that a standard naming convention be adopted and adhered to for all jobs and sequences. Although the stated items above are only minor variations and oversights, there should be consistency and completeness in naming stages and links.

Again, the conversion jobs are only to be executed once and addressing any inconsistencies does not make sense at this point. Future development should adhere to the defined standard.

DataStage Job Descriptions

Descriptions should be included in every DataStage job and Job stage. This is facilitated for the Job in the Job Properties window, allowing both a Short and Long Job Descriptions. It is also found in the Stage Properties for each stage. Descriptions allow other developers, and those reviewing the code, to better understand the purpose of the job, how it accomplishes it, and any special processing. The more complex the Job or Stage is, the more detail that should be included. Even simple self explanatory Jobs or Stages require some sort of description.

DataStage Job Complexity

Production jobs on the other hand shouldn’t be this complex. They should complete a specific task with a minimal number of Stages. Typically data processing is broken up into Extraction, Staging, Validation, Transformation, Load Ready, and Load jobs. Each job, of each category, typically deals with one source or target table at a time, with DataSets used to pass data between jobs. The end result is that many more DataStage jobs are required to complete the same process,

DataStage Job Design

Continue this design approach for any new development where there are similarities between jobs. It is always quicker to develop a new job if a similar job can be leveraged as a starting point. In addition, there is an opportunity to create Shared Containers with common code that can be reused across a number of jobs. This simplifies the development of each similar job, and only requires changes/maintenance of one version of the common code (Shared Container). Any new development should consider job designs that allow Shared Containers to utilized for common coding elements.

Error Handling

Implement error handling to manage records that cannot be processed for various reasons. This includes records with bad data, missing data (not null attributes), orphaned child records, missing code table entries, and other business rules required for excluding specific data or complete UOW (units of work). A reject process should also be considered if records are to be reprocessed at later date, such as when code tables get updated, or when the missing Parent records are finally processed. The staging area can be used to maintain record status so that successful and failed records can be tracked.

Process Exception Handling

Exception handling should be implemented in any production sequence job called by another sequence. Since the conversion process is likely a manual process, any failure could be dealt with manually as required.

On the other hand, in a production environment, dependant job sequences should not be executed if their predecessor job sequences do not complete successfully. The called sequence should include the Exception Handler and Terminator stages to prevent further processing when a job fails. This allows the problem to be addressed and the sequence restarted with fewer issues.

Standards

It is important to establish and follow consistent standards in:

Directory structures for installation and application support directories

Naming conventions, especially for DataStage Project categories, stage names, and links.

All DataStage jobs should be documented with the Short Description field, as well as Annotation fields.

It is the DataStage developer’s responsibility to make personal backups of their work on their local workstation, using DataStage's DSX export capability. This can also be used for integration with source code control systems.

We are continue in Part 2 ------>> Here

njoy the simplicity.......
Atul Singh

Saturday, July 28, 2012

DataStage Documentation Best Practices