Source : 10 Ways to Make DataStage Run Slower - Vincent McBurney
Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog
has posted a list Top
Ways How Not To Scale Your Data Warehouse that is a great chat about bad
ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make
DataStage jobs slower! How do you puts the breaks on a DataStage job that
supposed to be running on a massively scalable parallel architecture.
1. Use the same
configuration file for all your jobs.
You may have two nodes
configured for each CPU on your DataStage server and this allows your high
volume jobs to run quickly but this works great for slowing down your small
volume jobs. A parallel job with a lot of nodes to partition across is a bit
like the solid wheel on a aerodrome racing bike, they take a lot of time to
crank up to full speed but once you are there they are lightning fast. If you
are processing a handful of rows the configuration file will instruct the job
to partition those rows across a lot of processes and then repartition them at
the end. So a job that would take a second or less on a single node can run for
5-10 seconds across a lot of nodes and a squadron of these jobs will slow down
your entire DataStage batch run!
2. Use a sparse database
lookup on high volumes.
This is a great way to slow
down any ETL tool, it works on server jobs or parallel jobs. The main
difference is that server jobs only do sparse database lookups - the only way
to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs
by default do cached lookups where the entire database table is moved into a
lookup fileset either in memory of if it's too large into scratch space on the
disk. You can slow parallel jobs down by changing the lookup to a sparse lookup
and for every row processed it will send a lookup SQL statement to the
database. So if you process 10 million rows you can send 10 million SQL
statements to the database! That will put the brakes on!
3. Keep resorting your
data.
Sorting is the Achilles
heel of just about any ETL tool, the average ETL job is like a busy restaurant,
it makes a profit by getting the diners in and out quickly and serving multiple
seatings. If the restaurant fits 100 people can feed several hundred in a
couple hours by processing each diner quickly and getting them out the door.
The sort stage is like having to waiting until every person who is going to eat
at that restaurant for that night has arrived and has been put in order of
height before anyone gets their food. You need to read every row before you can
output your sort results. You can really slow your DataStage parallel jobs down
by putting in more than one sort, or giving a job data that is already sorted
by the SQL select statement but sorting it again anyway!
4. Design single threaded
bottlenecks
This is really easy to do
in server edition and harder (but possible) in parallel edition. Devise a step
on the critical path of your batch processing that takes a long time to finish
and only uses a small part of the DataStage engine. Some good bottlenecks: a
large volume Server Job that hasn't been made parallel by multiple instance or
interprocess functionality. A script FTP of a file that keeps an entire
DataStage Parallel engine waiting. A bulk database load via a single update
stream. Reading a large sequential file from a parallel job without using
multiple readers per node.
5. Turn on debugging and
forget that it's on
In a parallel job you can
turn on a debugging setting that forces it to run in sequential mode, forever!
Just turn it on to debug a problem and then step outside the office and get run
over by a tram. It will be years before anyone spots the bottleneck!
6. Let the disks look after
themselves
Never look at what is
happening on your disk I/O - that's a Pandora's Box of better performance! You
can get some beautiful drag and slow down by ignoring your disk I/O as parallel
jobs write a lot of temporary data and datasets to the scratch space on each
node and write out to large sequential files. Disk striping or partitioning or
choosing the right disk type or changing the location of your scratch space are
all things that stand between you and slower job run times.
7. Keep Writing that Data
to Disk
Staging of data can be a
very good idea. It can give you a rollback point for failed jobs, it can give
you a transformed dataset that can be picked up and used by multiple jobs, it
can give you a modular job design. It can also slow down Parallel Jobs like no
tomorrow - especially if you stage to sequential files! All that repartitioning
to turn native parallel datasets into a stupid ASCII metadata dumb file and
then import and repartition to pick it up and process it again. Sequential
files are the Forest Gump of file storage, simple and practical but dumb as all
hell. It costs time to write to one and time to read and parse them so
designing an end to end process that writes data to sequential files repeatedly
will give you massive slow down times.
8. Validate every field
A lot of data comes from
databases. Often DataStage pulls straight out of these databases or saves the
data to an ASCII file before being processed by DataStage. One way to slow down
your job and slow down your ETL development and testing is to validate and
transform metadata even though you know there is nothing wrong with it. For
example, validating that a field is VARCHAR(20) using DataStage functions even
though the database defines the source field as VARCHAR(20). DataStage has
implicit validation and conversion of all data imported that validates that
it's the metadata you say it is. You can then do explicit metadata conversion
and validation on top of that. Some fields need explicit metadata conversion -
such as numbers in VARCHAR fields and dates in string fields and packed fields,
but most don't. Adding a layer of validation you don't need should slow those
jobs down.
9. Write extra steps in
database code
The same phrase gets
uttered on many an ETL project. "I can write that in SQL", or "I
can write that in Java", or "I can do that in an Awk script".
Yes, we know, we know that just about any programming language can do just
about anything - but leaving a complex set of steps as a prequel or sequel to
an ETL job is like leaving a turd on someones doorstep. You'll be long gone
when someone comes to clean it up. This is a sure fire way to end up with a
step in the end to end integration that is not scalable, is poorly documented,
cannot be easily modified and slows everything down. If someone starts saying
"I can write that in..." just say "okay, if you sign a binding
contract to support it for every day that you have left on this earth".
10. Don't do Performance
Testing
Do not take your highest
volume jobs into performance testing, just keep the default settings, default
partitioning and your first draft design and throw that into production and get
the hell out of there.
Atul Singh
No comments :
Post a Comment