Nuts & Bolts of DataStage: ETL Job Design Standards

Part 1 --> ETL Job Design Standards - 1

Parameter Management Standards

This section defines standards to manage job parameters across environments. Jobs should use parameters liberally to avoid hard coding as much as possible. Some categories of parameters include:

Environmental parameters, such as directory names, file names, etc.
Database connection parameters
Notification email addresses
Processing options, such as degree of parallelism

The purposes of Parameter Management are:

Each environment’s staging files, parameter lists, etc. are isolated from other environments.
Components can be migrated between environments (Development, Test, Production, etc.) without requiring any modifications.
Environmental values, such as directory names and database connection parameters, can be changed with minimal effort.

Parameters must be stored in a manner which is easy to maintain yet easy to protect from inadvertent or malicious modification. Routines are created to read the parameters and set them for job executions.

Performance Optimisation Design Standards

Performance Optimisation defines standards for optimising performance, such as using parallelism or in-memory lookup files. These design aspects are usually vendor-specific.

Reuse Standards for Common Jobs

Reuse standards define the approach for using shared components to simplify design. Identification of common jobs during physical design is the next iteration on the logical design task of identifying common jobs. As we move to physical design, opportunities for re-use will become more apparent. At this stage, common job opportunities should be identified and the team should be made aware of their capabilities.

Data Sourcing Standards

This section defines standards related to Data Sourcing, which involves reading data from a source file or database table or collecting data from an API or messaging queue. For database sources, it can involve a join query which combines several tables to provide a flattened source. For example, when the data source is a database table the following is recommended:

Try to filter out rows that are not required. Database SQL filters can be very efficient and reduce the volume of data being brought onto the ETL server.
Where table joins and sorts are appropriate in a source query, it may be more efficient to have the database server do this processing rather than the ETL server.
Do not alter the metadata of the source data during this phase. Column renames and column functions in a sourcing ETL statement work but this technique can hide the derivations from metadata management and reporting and break the chain of data lineage.

When the data source is a text file consider the following:

Comma separated files can be unreliable where the source data contains free text fields. Data entry operators can add commas, quotes and even carriage-return characters into these fields which disrupts the formatting of the file.
Complex flat files from a mainframe usually require a definition file in order to be readable by ETL tools. Files from COBOL applications are defined by a COBOL definition file. ETL tools use the definition file to determine the formatting of the file.

Data Loading Standards

Data Loading Standards define a common approach for loading data into the target environment that impact performance, data integrity and error handling. For database targets there are multiple types of write actions:

Insert/Append
Update
Insert or Update
Update or Insert
Delete

Usually the stage also has options to clear or truncate the table prior to delivery. The order of update/insert insert/update is important from a performance point of view. For an insert/update action the job attempts to insert the row first, if it is already present it does an update. The order depends upon whether data is present or absent. Database-specific options could include things like transaction size, truncation before insert, etc. For non-database targets the standards for file, messaging, API or other output are defined.

Exception Handling Standards

This section outlines standards for common routines and procedures for trapping errors and handling them and reporting process statistics. The objectives are to:

Find all problems with a row, not just the first problem detected.
Avoid row leakage, where rows are dropped or rejected without notification.
Report all problems.
Report process statistics.
Interface with other processes which work to resolve exception issues.

It also defines what error detection mechanisms are required, and what is to be done when an exception is detected.

Process Reporting and Job Statistics Standards

Process Reporting Standards define how status and row counts of jobs are to be retrieved, formatted and stored for accurate exception and process reporting.

Notification Standards

Notification Standards define how information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders

Nuts & Bolts of DataStage

Friday, November 22, 2013

ETL Job Design Standards - 2