Source : Navigating the many paths of metadata for DataStage 8
Looking at the methods for importing metadata table definitions into DataStage 8 ETL jobs.
All of the metadata import methods of DataStage 7 are in DataStage 8 and all execute in the same way. Developers familiar with previous versions will be right at home! What is tricky to come to terms with are all the new ways to get metadata into DataStage 8 and the Metadata Server. The Metadata Server can provide reporting on metadata from products outside of DataStage such as BI tools so in some cases you might be importing the metadata for reporting and not for ETL.
The list that follows covers just the techniques for importing metadata to be used by DataStage jobs.
Something about DataStage, DataStage Administration, Job Designing,Developing, DataStage troubleshooting, DataStage Installation & Configuration, ETL, DataWareHousing, DB2, Teradata, Oracle and Scripting.
Showing posts with label Vincent McBurney. Show all posts
Showing posts with label Vincent McBurney. Show all posts
Monday, July 21, 2014
Friday, July 13, 2012
DataStage tip for beginners: developer short cuts
Hi guys,
Last night I am googling for some shortcut to execute jobs, some tool
tricks. I found this another great Article by Vincent McBurney. So sharing that
here with you all…will join you soon with the DS command line tricks.
Import Export
* When you do an export cut and paste the export file name. When you go to your project and run an import paste the file name instead of having to browse for it. While export and import independently remember the last file name used they do not share that name between each other.
* When you switch
export type between category and individual job it is quick to switch the type,
close the export form and open it again. That way the job name or category you
have highlighted will be automatically picked.
There is an Export option to export by individual job name or export by category name. This is on the second tab in the export form. Often when you go to export something it is on the wrong option,
There is an Export option to export by individual job name or export by category name. This is on the second tab in the export form. Often when you go to export something it is on the wrong option,
Labels:
Administration
,
copy
,
DataStage
,
ETL
,
export
,
import
,
Job
,
modify
,
sequence
,
stages
,
switch
,
transformer
,
Utility
,
Vincent McBurney
Sunday, July 01, 2012
17 Mistakes that ETL Designers make with Very Large Data
Source : 17
Mistakes that ETL Designers make with Very Large Data -
Vincent McBurney
These are
the mistakes that ETL designers can make when processing scary high data
volumes.
Dan Lindstedt is a very large data
guru and he has a series of outstanding blog posts on very large databases, the
latest is ETL Engines: VLDW & Loading / Transforming. I first
came across Dan on ETL/ELT forums where he has over 20,000 forum
posts. He popped up on B-Eye-Network blogs. Dan has had no comments
on his latest post series yet as B-Eye don't have a reader friendly design
and it discourages reader participation. For example I just got three
comments over a weekend on three old archived ETL posts. ITToolbox is a
friendly place for reader participation.
My favorite part of Dan's latest post
is the 17 mistakes that ETL Designers make with very large data. In fact
that's the title I would have gone for! For some reason blog titles with
a number in them attract more hits. I've shown Dan's list of 17 below
with my own comments on how that impacts DataStage developers. I would
love to hear your own contributions for common ETL design mistakes.
1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
Agree 100%. I believe in have
at least three bands in your processing: Extract from source to file, process
file to load ready dataset and load load ready dataset. For a standard
target table load I would have an insert job, an update job, a delete job (if
needed), a bulk load job (for large volumes).
Friday, June 29, 2012
10 Ways to Make DataStage Run Slower
Source : 10 Ways to Make DataStage Run Slower - Vincent McBurney
Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog
has posted a list Top
Ways How Not To Scale Your Data Warehouse that is a great chat about bad
ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make
DataStage jobs slower! How do you puts the breaks on a DataStage job that
supposed to be running on a massively scalable parallel architecture.
1. Use the same
configuration file for all your jobs.
You may have two nodes
configured for each CPU on your DataStage server and this allows your high
volume jobs to run quickly but this works great for slowing down your small
volume jobs. A parallel job with a lot of nodes to partition across is a bit
like the solid wheel on a aerodrome racing bike, they take a lot of time to
crank up to full speed but once you are there they are lightning fast. If you
are processing a handful of rows the configuration file will instruct the job
to partition those rows across a lot of processes and then repartition them at
the end. So a job that would take a second or less on a single node can run for
5-10 seconds across a lot of nodes and a squadron of these jobs will slow down
your entire DataStage batch run!
2. Use a sparse database
lookup on high volumes.
This is a great way to slow
down any ETL tool, it works on server jobs or parallel jobs. The main
difference is that server jobs only do sparse database lookups - the only way
to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs
by default do cached lookups where the entire database table is moved into a
lookup fileset either in memory of if it's too large into scratch space on the
disk. You can slow parallel jobs down by changing the lookup to a sparse lookup
and for every row processed it will send a lookup SQL statement to the
database. So if you process 10 million rows you can send 10 million SQL
statements to the database! That will put the brakes on!
3. Keep resorting your
data.
Sorting is the Achilles
heel of just about any ETL tool, the average ETL job is like a busy restaurant,
it makes a profit by getting the diners in and out quickly and serving multiple
seatings. If the restaurant fits 100 people can feed several hundred in a
couple hours by processing each diner quickly and getting them out the door.
The sort stage is like having to waiting until every person who is going to eat
at that restaurant for that night has arrived and has been put in order of
height before anyone gets their food. You need to read every row before you can
output your sort results. You can really slow your DataStage parallel jobs down
by putting in more than one sort, or giving a job data that is already sorted
by the SQL select statement but sorting it again anyway!
4. Design single threaded
bottlenecks
This is really easy to do
in server edition and harder (but possible) in parallel edition. Devise a step
on the critical path of your batch processing that takes a long time to finish
and only uses a small part of the DataStage engine. Some good bottlenecks: a
large volume Server Job that hasn't been made parallel by multiple instance or
interprocess functionality. A script FTP of a file that keeps an entire
DataStage Parallel engine waiting. A bulk database load via a single update
stream. Reading a large sequential file from a parallel job without using
multiple readers per node.
5. Turn on debugging and
forget that it's on
In a parallel job you can
turn on a debugging setting that forces it to run in sequential mode, forever!
Just turn it on to debug a problem and then step outside the office and get run
over by a tram. It will be years before anyone spots the bottleneck!
6. Let the disks look after
themselves
Never look at what is
happening on your disk I/O - that's a Pandora's Box of better performance! You
can get some beautiful drag and slow down by ignoring your disk I/O as parallel
jobs write a lot of temporary data and datasets to the scratch space on each
node and write out to large sequential files. Disk striping or partitioning or
choosing the right disk type or changing the location of your scratch space are
all things that stand between you and slower job run times.
7. Keep Writing that Data
to Disk
Staging of data can be a
very good idea. It can give you a rollback point for failed jobs, it can give
you a transformed dataset that can be picked up and used by multiple jobs, it
can give you a modular job design. It can also slow down Parallel Jobs like no
tomorrow - especially if you stage to sequential files! All that repartitioning
to turn native parallel datasets into a stupid ASCII metadata dumb file and
then import and repartition to pick it up and process it again. Sequential
files are the Forest Gump of file storage, simple and practical but dumb as all
hell. It costs time to write to one and time to read and parse them so
designing an end to end process that writes data to sequential files repeatedly
will give you massive slow down times.
8. Validate every field
A lot of data comes from
databases. Often DataStage pulls straight out of these databases or saves the
data to an ASCII file before being processed by DataStage. One way to slow down
your job and slow down your ETL development and testing is to validate and
transform metadata even though you know there is nothing wrong with it. For
example, validating that a field is VARCHAR(20) using DataStage functions even
though the database defines the source field as VARCHAR(20). DataStage has
implicit validation and conversion of all data imported that validates that
it's the metadata you say it is. You can then do explicit metadata conversion
and validation on top of that. Some fields need explicit metadata conversion -
such as numbers in VARCHAR fields and dates in string fields and packed fields,
but most don't. Adding a layer of validation you don't need should slow those
jobs down.
9. Write extra steps in
database code
The same phrase gets
uttered on many an ETL project. "I can write that in SQL", or "I
can write that in Java", or "I can do that in an Awk script".
Yes, we know, we know that just about any programming language can do just
about anything - but leaving a complex set of steps as a prequel or sequel to
an ETL job is like leaving a turd on someones doorstep. You'll be long gone
when someone comes to clean it up. This is a sure fire way to end up with a
step in the end to end integration that is not scalable, is poorly documented,
cannot be easily modified and slows everything down. If someone starts saying
"I can write that in..." just say "okay, if you sign a binding
contract to support it for every day that you have left on this earth".
10. Don't do Performance
Testing
Do not take your highest
volume jobs into performance testing, just keep the default settings, default
partitioning and your first draft design and throw that into production and get
the hell out of there.
Atul Singh
Tuesday, June 26, 2012
Ten Reasons Why You Need DataStage 8.5
Source: it.toolbox.com - VincentI have taken a look through the new functions and capabilities of DataStage 8.5 and come up with a top ten list of why you should upgrade to it.
Information Server 8.5 came out a couple weeks ago and is currently available on IBM Passport Advantage for existing customers and from IBM PartnerWorld for IM partners. The XML pack described below is available as a separate download from the IBM Fix Central website.
This is a list of the ten best things in DataStage 8.5. Most of these are improvements in DataStage Parallel Jobs only while a couple of them will help Server Job customers as well.
The new XML read and transform stages are much better at reading large and complex XML files and processing them in parallel:

With looping inside a Transformer you can output multiple rows for each input row. In this example a record has a company name and four revenue sales figures for four regions – the loop will go through each column and output a row for each value if it is populated:

Transformer Remembering
DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL experts have been manually coding into DataStage for years using some well known workarounds. A key change in a DataStage job involves a group of records with a shared key where you want to process that group as a type of array inside the overall recordset.
I am going to make a longer post about that later but there are two new cache objects inside a Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve it later on to compare two or more records inside a Transformer.
There are new system variables for looping and key change detection - @ITERATION, LastRow() indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will change in the next record.
Here is an aggregation example where rows are looped through and an aggregate row is written out when the key changes:
Information Server 8.5 came out a couple weeks ago and is currently available on IBM Passport Advantage for existing customers and from IBM PartnerWorld for IM partners. The XML pack described below is available as a separate download from the IBM Fix Central website.
This is a list of the ten best things in DataStage 8.5. Most of these are improvements in DataStage Parallel Jobs only while a couple of them will help Server Job customers as well.
1. IT’S FASTER
Faster, faster, faster. A lot of tasks in DataStage 8.5 are at least 40% faster than 8.1 such as starting DataStage, opening a job, running a Parallel job and runtime performance have all improved.2. IT' IS NOW AN XML ETL TOOL
Previous versions of DataStage were mediocre at processing XML. DataStage 8.5 is a great XML processing tool. It can open, understand and store XML schema files. I did a longer post about just this pack in New Hierarchical Transformer makes DataStage great a XML Tool and if you have XML files without schemas you can follow a tip at the DataStage Real Time blog: The new XMLPack in 8.5….generating xsd’s….The new XML read and transform stages are much better at reading large and complex XML files and processing them in parallel:
3. TRANSFORMER LOOPING
The best Transformer yet. The DataStage 8.5 parallel transformer is the best version yet thanks to new functions for looping inside a transformer and performing transformations across a grouping of records.With looping inside a Transformer you can output multiple rows for each input row. In this example a record has a company name and four revenue sales figures for four regions – the loop will go through each column and output a row for each value if it is populated:
Transformer Remembering
DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL experts have been manually coding into DataStage for years using some well known workarounds. A key change in a DataStage job involves a group of records with a shared key where you want to process that group as a type of array inside the overall recordset.
I am going to make a longer post about that later but there are two new cache objects inside a Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve it later on to compare two or more records inside a Transformer.
There are new system variables for looping and key change detection - @ITERATION, LastRow() indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will change in the next record.
Here is an aggregation example where rows are looped through and an aggregate row is written out when the key changes:
4. EASY TO INSTALL
Easier to install and more robust. DataStage 8.5 has the best installer of any version of DataStage ever. Mind you – I jumped aboard the DataStage train in version 3.6 so I cannot vouch for earlier installers but 8.5 has the best wizard, the best pre-requisite checking and the best recovery. It also has the IBM Support Assistant packs for Information Server that make debugging and reporting of PMRs to IBM much easier. There is also a Guide to Migrating to InfoSphere Information Serve 8.5 that explains how to migrate from most earlier versions.See my earlier blog post Why Information Server 8.5 is Easier to Install than Information Server 8.1.
Patch Merge – that’s right, patch merge. The new installer has the ability to merge patches and fixes into the install for easier management of patches and fixes.
5. CHECK IN AND CHECK OUT JOBS
Check in and Check out version control. DataStage 8.5 Manager comes with direct access to the source control functions of CVS and Rational ClearCase in an Eclipse workspace. You can send artefacts to the source control system and replace a DataStage component from out of the source control system.DataStage 8.5 comes with out of the box menu integration with CVS and Rational ClearCase but for other source control systems you need to use the Eclipse source control plugins.
6. HIGH AVAILABILITY EASIER THAN EVER
High Availability – the version 8.5 installation guide has over thirty pages on Information Server topologies including a bunch of high availability scenarios across all tiers of the product. On top of that there are new chapters for the high availability of the metadata repository, the services layer and the DataStage engine.- Horizontal and vertical scaling and load balancing.
- Cluster support for WebSphere Application Server.
- Cluster support for XMETA repository: DB2 HADR/Cluster or Oracle RAC.
- Improved failover support on the engine.
7. NEW INFORMATION ARCHITECTURE DIAGRAMMING TOOL
InfoSphere Blueprint Direct – DataStage 8.5 comes with a free new product for creating diagrams of an information architecture and linking elements in the diagram directly into DataStage jobs and Metadata Workbench metadata. Solution Architects can draw a diagram of a data integration solution including sources, Warehouses and repositories.8. VERTICAL PIVOT
There are people out there who have been campaigning for vertical pivot for a long time – you know who you are! It is now available and it can pivot multiple input rows with a common key into output rows with multiple columns. Key based groups, columnar pivot and aggregate functions.You can also do this type of vertical pivoting in the new Transformer using the column change detection and row cache – but the Vertical pivot stage makes it easier as a specialised stage.
9. Z/OS FILE STAGE
Makes it easier to process complex flat files by providing native support for mainframe files. Use it for VSAM files – KSDS, ESDS, RRDS. Sequential QSAM, BDAM, BSAM. Fixed and variable length records. Single or multiple record type files.10. BALANCED OPTIMIZER COMES HOME
In DataStage 8.5 the Balanced Optimizer has been merged into the Designer and it has a number of usability improvements that turns DataStage into a better ETLT or ELT option. Balanced Optimizer looks at a normal DataStage job and comes up with a version that pushes some of the steps down onto a source or target database engine. IE it balances the load across the ETL engine and the database engines.Version 8.5 has improved logging, improved impact analysis support and easier management of optimised versions of jobs in terms of creating, deleting, renaming, moving, compiling and deploying them.
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
njoy the simplicity.......
Atul Singh
Subscribe to:
Posts
(
Atom
)