Nuts & Bolts of DataStage: ETL Job Design Standards

When using an off-the-shelf ETL tool, principles for software development do not change: we want our code to be reusable, robust, flexible, and manageable. To assist in the development, a set of best practices should be created for the implementation to follow. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.

Although these standards are listed as taking place in ETL Physical Design, it is ideal that they be done before the prototype if possible. Once they are established once, they should be able to be re-used for future increments and only need to be reviewed.

Listed below are some standard best practice categories that should be identified on a typical project.

Naming Conventions that will be used across the ETL integration environment.

Release Management: The ETL version control approach that will be used; including version control within the tool itself.

Environments: How the ETL environment will be physically deployed in development, testing and production. This will generally be covered in the Solution Architecture.

Failover and Recovery: the strategy for handling load failures. This will include recommendations on whether milestone points and staging will not be required for restarts.

Error Handling: proposed standards for error trapping of jobs. This should be at a standards level, with detail of the design covered explicitly in a separate section of the physical design.

Process Reporting: status and row counts of jobs should be retrieved for accurate process reporting.

Notification: Identification is the manner in which information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders.

Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.

Optimization: Standards for improving performance such as parallelism or hash files. The more detailed design aspects of this approach is a separate section of the physical design.

Reusability: Standards around simplified design and use of shared components.

Metadata Management: Standards around metadata management as they apply to the ETL design.

Listed below are some of the major standards that apply to each of these categories.

Contents

1 Naming Conventions

1.1 Job Naming
1.2 Stage Naming
1.3 Link Naming
1.4 Database Action Types

2 Parameter Management Standards
3 Performance Optimization Design Standards
4 Reuse Standards for Common Jobs
5 Data Sourcing Standards
6 Data Loading Standards
7 Exception Handling Standards
8 Process Reporting and Job Statistics Standards
9 Notification Standards

Naming Conventions

There are a number of types of naming conventions to be used across the ETL environment. ETL naming conventions are important for giving all projects a consistent look and feel. A naming convention makes metadata reporting more successful by making it easy to determine data lineage and to identify ETL stages within metadata reports and job diagrams.

Typically ETL is executed as a set of jobs, each job processing a single source data entity and writing it to one or more output entities. A job is made up of stages and links. A stage carries out an action on data and a link transfers the data to the next stage.

Below is a suggested set of naming standards. Vendor-specific considerations could dictate variations from this set.

Job Naming

The job name uses underscores to identify different labels to describe the job. The following job naming template shows all the types of labels that can build a job name:

JobType_SourceSystem_TargetSystem_SourceEntity_TargetEntity_Action

The number of labels used depends on the specific requirements of the project and the nature of the particular job.

JobType indicates what type of job depending on what ETL tool is being used. Some example job types include Server, Parallel and Sequence. In this instance the job types can be abbreviations such as ser_, par_ and seq_.

SourceSystem and TargetSystem indicate which database or application or database type owns the source or target entity. This is typically a code or abbreviation that uniquely identifies the system. These are optional labels and are usually included to make job names unique across all folders and projects in an enterprise.

SourceEntity is a plain English description of what is being extracted. If it is a table or file name the underscores can be removed to form an unbroken entity name. If the source table has a technical encoded name the job name describes it more descriptively.

TargetEntity is optional and is only used if one type of data entity is outputted from the job. When the ETL job splits data and writes to different tables this label becomes misleading.

Action is used for jobs that write to databases and describes the action of the database write. Action label is chosen from the list of Database Action Codes below.

Fully qualified job name examples where the job name identifies the transition between systems:

par_sap_staging_customers
par_sap_staging_sales
par_staging_ods_customers_insupd
par_staging_ods_customers_ldr

Entity-only job name examples where the name identifies what entity transformation is occurring:

customers_customertemp
par_customers
customers_customerhistory
customers_insupd
customers_ldr

In these examples the name of the project and the name of the folder the job resides in indicates the what source and target system is being affected. For example, the folder is named SAP to Staging Loads.

Stage Naming

The stage name consists of a prefix that identifies the stage type followed by a description of the stage.

The prefix is the first two letters of the stage type or the first two initials of the stage type if multiple words occur.
For source and target stages the stage name includes the name of the table or file being used.
For transformation stages the stage name includes the primary objective of the stage or an important feature of the stage.

Link Naming

Links are named after the content of the data going down that link. For links that write to a data target a suffix indicates what type of write action from the Database Action Types below.

Database Action Types

This list shows the abbreviations that describe an action against a target table or file. These abbreviations are used in job names and link names where appropriate.

Ins - Insert
Upd - Update
Ups - Upsert, performs either an update or an insert
Del - Delete
Aug - Augment
App - Append
Ldr - Database Load

A combination of action types can be included in a name if they are performed in different stages, e.g., Customer_InsUpd.

Continued..............
Part 2 --> ETL Job Design Standards - 2

Nuts & Bolts of DataStage

Tuesday, November 12, 2013

ETL Job Design Standards - 1