This section defines standards to manage job parameters
across environments. Jobs should use parameters liberally to avoid hard coding
as much as possible. Some categories of parameters include:
- Environmental parameters, such as directory names, file names, etc.
- Database connection parameters
- Notification email addresses
- Processing options, such as degree of parallelism
The purposes of Parameter Management are:
- Each environment’s staging files, parameter lists, etc. are isolated from other environments.
- Components can be migrated between environments (Development, Test, Production, etc.) without requiring any modifications.
- Environmental values, such as directory names and database connection parameters, can be changed with minimal effort.
Parameters must be stored in a manner which is easy to
maintain yet easy to protect from inadvertent or malicious modification. Routines
are created to read the parameters and set them for job executions.
Performance Optimisation Design Standards
Performance Optimisation defines standards for optimising
performance, such as using parallelism or in-memory lookup files. These design
aspects are usually vendor-specific.
Reuse Standards for Common Jobs
Reuse standards define the approach for using shared
components to simplify design. Identification of common jobs during physical
design is the next iteration on the logical design task of identifying common
jobs. As we move to physical design, opportunities for re-use will become more
apparent. At this stage, common job opportunities should be identified and the
team should be made aware of their capabilities.
Data Sourcing Standards
This section defines standards related to Data Sourcing,
which involves reading data from a source file or database table or collecting
data from an API or messaging queue. For database sources, it can involve a
join query which combines several tables to provide a flattened source. For
example, when the data source is a database table the following is recommended:
- Try to filter out rows that are not required. Database SQL filters can be very efficient and reduce the volume of data being brought onto the ETL server.
- Where table joins and sorts are appropriate in a source query, it may be more efficient to have the database server do this processing rather than the ETL server.
- Do not alter the metadata of the source data during this phase. Column renames and column functions in a sourcing ETL statement work but this technique can hide the derivations from metadata management and reporting and break the chain of data lineage.
When the data source is a text file consider the following:
- Comma separated files can be unreliable where the source data contains free text fields. Data entry operators can add commas, quotes and even carriage-return characters into these fields which disrupts the formatting of the file.
- Complex flat files from a mainframe usually require a definition file in order to be readable by ETL tools. Files from COBOL applications are defined by a COBOL definition file. ETL tools use the definition file to determine the formatting of the file.
Data Loading Standards
Data Loading Standards define a common approach for loading
data into the target environment that impact performance, data integrity and
error handling. For database targets there are multiple types of write actions:
- Insert/Append
- Update
- Insert or Update
- Update or Insert
- Delete
Usually the stage also has options to clear or truncate the
table prior to delivery. The order of update/insert insert/update is important
from a performance point of view. For an insert/update action the job attempts
to insert the row first, if it is already present it does an update. The order
depends upon whether data is present or absent. Database-specific options could
include things like transaction size, truncation before insert, etc. For
non-database targets the standards for file, messaging, API or other output are
defined.
Exception Handling Standards
This section outlines standards for common routines and
procedures for trapping errors and handling them and reporting process
statistics. The objectives are to:
- Find all problems with a row, not just the first problem detected.
- Avoid row leakage, where rows are dropped or rejected without notification.
- Report all problems.
- Report process statistics.
- Interface with other processes which work to resolve exception issues.
It also defines what error detection mechanisms are
required, and what is to be done when an exception is detected.
Process Reporting and Job Statistics Standards
Process Reporting Standards define how status and row counts
of jobs are to be retrieved, formatted and stored for accurate exception and
process reporting.
Notification Standards
Notification Standards define how information about
successful and unsuccessful runs is delivered to the administrator and relevant
stakeholders