Showing posts with label Vincent McBurney. Show all posts

Monday, July 21, 2014

Navigating the many paths of metadata for DataStage 8

Source : Navigating the many paths of metadata for DataStage 8

Looking at the methods for importing metadata table definitions into DataStage 8 ETL jobs.
All of the metadata import methods of DataStage 7 are in DataStage 8 and all execute in the same way. Developers familiar with previous versions will be right at home! What is tricky to come to terms with are all the new ways to get metadata into DataStage 8 and the Metadata Server. The Metadata Server can provide reporting on metadata from products outside of DataStage such as BI tools so in some cases you might be importing the metadata for reporting and not for ETL.
The list that follows covers just the techniques for importing metadata to be used by DataStage jobs.

DataStage tip for beginners: developer short cuts

Hi guys,

Last night I am googling for some shortcut to execute jobs, some tool tricks. I found this another great Article by Vincent McBurney. So sharing that here with you all…will join you soon with the DS command line tricks.

Import Export

* When you do an export cut and paste the export file name. When you go to your project and run an import paste the file name instead of having to browse for it. While export and import independently remember the last file name used they do not share that name between each other.

* When you switch export type between category and individual job it is quick to switch the type, close the export form and open it again. That way the job name or category you have highlighted will be automatically picked.
There is an Export option to export by individual job name or export by category name. This is on the second tab in the export form. Often when you go to export something it is on the wrong option,

17 Mistakes that ETL Designers make with Very Large Data

Source : 17 Mistakes that ETL Designers make with Very Large Data - Vincent McBurney

These are the mistakes that ETL designers can make when processing scary high data volumes.

Dan Lindstedt is a very large data guru and he has a series of outstanding blog posts on very large databases, the latest is ETL Engines: VLDW & Loading / Transforming. I first came across Dan on ETL/ELT forums where he has over 20,000 forum posts. He popped up on B-Eye-Network blogs. Dan has had no comments on his latest post series yet as B-Eye don't have a reader friendly design and it discourages reader participation. For example I just got three comments over a weekend on three old archived ETL posts. ITToolbox is a friendly place for reader participation.

My favorite part of Dan's latest post is the 17 mistakes that ETL Designers make with very large data. In fact that's the title I would have gone for! For some reason blog titles with a number in them attract more hits. I've shown Dan's list of 17 below with my own comments on how that impacts DataStage developers. I would love to hear your own contributions for common ETL design mistakes.

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.

Agree 100%. I believe in have at least three bands in your processing: Extract from source to file, process file to load ready dataset and load load ready dataset. For a standard target table load I would have an insert job, an update job, a delete job (if needed), a bulk load job (for large volumes).

10 Ways to Make DataStage Run Slower

Source : 10 Ways to Make DataStage Run Slower - Vincent McBurney

Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!

The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.

1. Use the same configuration file for all your jobs.

You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a aerodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!

2. Use a sparse database lookup on high volumes.

This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!

3. Keep resorting your data.

Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!

4. Design single threaded bottlenecks

This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.

5. Turn on debugging and forget that it's on

In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!

6. Let the disks look after themselves

Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.

7. Keep Writing that Data to Disk

Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.

8. Validate every field

A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.

9. Write extra steps in database code

The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".

10. Don't do Performance Testing

Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.

njoy the simplicity.......
Atul Singh

Tuesday, June 26, 2012

Ten Reasons Why You Need DataStage 8.5

Source: it.toolbox.com - VincentI have taken a look through the new functions and capabilities of DataStage 8.5 and come up with a top ten list of why you should upgrade to it.
Information Server 8.5 came out a couple weeks ago and is currently available on IBM Passport Advantage for existing customers and from IBM PartnerWorld for IM partners. The XML pack described below is available as a separate download from the IBM Fix Central website.
This is a list of the ten best things in DataStage 8.5. Most of these are improvements in DataStage Parallel Jobs only while a couple of them will help Server Job customers as well.

1. IT’S FASTER

Faster, faster, faster. A lot of tasks in DataStage 8.5 are at least 40% faster than 8.1 such as starting DataStage, opening a job, running a Parallel job and runtime performance have all improved.

2. IT' IS NOW AN XML ETL TOOL

Previous versions of DataStage were mediocre at processing XML. DataStage 8.5 is a great XML processing tool. It can open, understand and store XML schema files. I did a longer post about just this pack in New Hierarchical Transformer makes DataStage great a XML Tool and if you have XML files without schemas you can follow a tip at the DataStage Real Time blog: The new XMLPack in 8.5….generating xsd’s….
The new XML read and transform stages are much better at reading large and complex XML files and processing them in parallel:

3. TRANSFORMER LOOPING

The best Transformer yet. The DataStage 8.5 parallel transformer is the best version yet thanks to new functions for looping inside a transformer and performing transformations across a grouping of records.
With looping inside a Transformer you can output multiple rows for each input row. In this example a record has a company name and four revenue sales figures for four regions – the loop will go through each column and output a row for each value if it is populated:

Transformer Remembering
DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL experts have been manually coding into DataStage for years using some well known workarounds. A key change in a DataStage job involves a group of records with a shared key where you want to process that group as a type of array inside the overall recordset.
I am going to make a longer post about that later but there are two new cache objects inside a Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve it later on to compare two or more records inside a Transformer.
There are new system variables for looping and key change detection - @ITERATION, LastRow() indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will change in the next record.
Here is an aggregation example where rows are looped through and an aggregate row is written out when the key changes:

4. EASY TO INSTALL

Easier to install and more robust. DataStage 8.5 has the best installer of any version of DataStage ever. Mind you – I jumped aboard the DataStage train in version 3.6 so I cannot vouch for earlier installers but 8.5 has the best wizard, the best pre-requisite checking and the best recovery. It also has the IBM Support Assistant packs for Information Server that make debugging and reporting of PMRs to IBM much easier. There is also a Guide to Migrating to InfoSphere Information Serve 8.5 that explains how to migrate from most earlier versions.
See my earlier blog post Why Information Server 8.5 is Easier to Install than Information Server 8.1.
Patch Merge – that’s right, patch merge. The new installer has the ability to merge patches and fixes into the install for easier management of patches and fixes.

5. CHECK IN AND CHECK OUT JOBS

Check in and Check out version control. DataStage 8.5 Manager comes with direct access to the source control functions of CVS and Rational ClearCase in an Eclipse workspace. You can send artefacts to the source control system and replace a DataStage component from out of the source control system.

DataStage 8.5 comes with out of the box menu integration with CVS and Rational ClearCase but for other source control systems you need to use the Eclipse source control plugins.

6. HIGH AVAILABILITY EASIER THAN EVER

High Availability – the version 8.5 installation guide has over thirty pages on Information Server topologies including a bunch of high availability scenarios across all tiers of the product. On top of that there are new chapters for the high availability of the metadata repository, the services layer and the DataStage engine.

Horizontal and vertical scaling and load balancing.
Cluster support for WebSphere Application Server.

Cluster support for XMETA repository: DB2 HADR/Cluster or Oracle RAC.
Improved failover support on the engine.

7. NEW INFORMATION ARCHITECTURE DIAGRAMMING TOOL

InfoSphere Blueprint Direct – DataStage 8.5 comes with a free new product for creating diagrams of an information architecture and linking elements in the diagram directly into DataStage jobs and Metadata Workbench metadata. Solution Architects can draw a diagram of a data integration solution including sources, Warehouses and repositories.

8. VERTICAL PIVOT

There are people out there who have been campaigning for vertical pivot for a long time – you know who you are! It is now available and it can pivot multiple input rows with a common key into output rows with multiple columns. Key based groups, columnar pivot and aggregate functions.
You can also do this type of vertical pivoting in the new Transformer using the column change detection and row cache – but the Vertical pivot stage makes it easier as a specialised stage.

9. Z/OS FILE STAGE

Makes it easier to process complex flat files by providing native support for mainframe files. Use it for VSAM files – KSDS, ESDS, RRDS. Sequential QSAM, BDAM, BSAM. Fixed and variable length records. Single or multiple record type files.

10. BALANCED OPTIMIZER COMES HOME

In DataStage 8.5 the Balanced Optimizer has been merged into the Designer and it has a number of usability improvements that turns DataStage into a better ETLT or ELT option. Balanced Optimizer looks at a normal DataStage job and comes up with a version that pushes some of the steps down onto a source or target database engine. IE it balances the load across the ETL engine and the database engines.
Version 8.5 has improved logging, improved impact analysis support and easier management of optimised versions of jobs in terms of creating, deleting, renaming, moving, compiling and deploying them.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.