We have moved to www.dataGenX.net, Keep Learning with us.

Sunday, July 01, 2012

17 Mistakes that ETL Designers make with Very Large Data


Source : 17 Mistakes that ETL Designers make with Very Large Data - Vincent McBurney  


These are the mistakes that ETL designers can make when processing scary high data volumes.
Dan Lindstedt is a very large data guru and he has a series of outstanding blog posts on very large databases, the latest is ETL Engines: VLDW & Loading / Transforming.  I first came across Dan on ETL/ELT forums where he has over 20,000 forum posts.  He popped up on B-Eye-Network blogs.  Dan has had no comments on his latest post series yet as B-Eye don't have a reader friendly design and it discourages reader participation.  For example I just got three comments over a weekend on three old archived ETL posts.  ITToolbox is a friendly place for reader participation.
My favorite part of Dan's latest post is the 17 mistakes that ETL Designers make with very large data.  In fact that's the title I would have gone for!  For some reason blog titles with a number in them attract more hits.  I've shown Dan's list of 17 below with my own comments on how that impacts DataStage developers.  I would love to hear your own contributions for common ETL design mistakes.

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
Agree 100%.  I believe in have at least three bands in your processing: Extract from source to file, process file to load ready dataset and load load ready dataset.  For a standard target table load I would have an insert job, an update job, a delete job (if needed), a bulk load job (for large volumes). 



2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
I can see this working well for smaller volumes - and the combination of DataStage accessing data from multiple systems via a Federation Server plugin is intriguing, but this type of cross database joining would be nasty on very large volumes.  Pulling the smaller data volumes into target tables or lookup datasets would be faster.

3) Targeting more than 1 or 2 target tables
Agree 100%.  You could take it to two tables if there is a very close parent/child relationship like a master table and an attribute table, or a header table and a detail table.  But that's the exception, most large volume ETL jobs should be preparing data for one target table.

4) moving rows that are too wide through a single process
I don't know how you get around this one.  If you have a lot of columns then you gotta get it in there!

5) loading very large data sets to targets WITH INDEXES ON
DataStage makes this easy to manage, on a standard database tab (insert or bulk load activities) you can use the before-SQL tab to turn indexes off and the after-SQL tab to turn them back on.  The statements on those tabs are run just once for each job (and not per row).  You don't need indexes and keys if you have other ways to check your referential integrity.

6) not running a cost-based optimizer in the database

7) not keeping statistics up to date in the database

8) not producing the correct indexes on the sources / lookups that need to be accessed
Examples of why you need to be in the good books with your DBAs! 
"My DataStage job is running slow, I think there is something wrong with the database table."
"You sure it's not DataStage?"
"Can you check the table, maybe run a trace?"
"What is DataStage?"
"Could you just check the table, I've got the job running right now!"
"I ran a query on the table and it's running fine."
"Could you maybe get off Facebook and do a friggen trace!"

9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
DataStage Parallel Jobs need a lot of RAM for the lookup stage.  The IBM Information Server Blade starts with 4G of RAM per blade for two dual core CPUs on each blade.

10) running on a 32 bit environment which causes significant OS swapping to occur


11) running on a 32 bit environment which causes significant OS swapping to occur


12) running on a 32 bit environment which causes significant OS swapping to occur
I think the treblification of this one is an OS swapping joke.  I am going to be talking about 64 bit processing with the IBM architects when I meet them at IoD 2007 next month.  The Information Server can run on some 64 bit environments but it will be interesting to find out what plans IBM have for expanding this.

13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
This is a tricky one - kind of like rows that are too wide - sometimes you need to do it and you hope the massively parallel architecture is up to it.  DataStage can make almost every type of stage work in parallel so it can get away with extra steps, however this does make the job harder to debug and sometimes you get those random out of resource errors...

14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
I'm not sure what Dan means by multi-passing and I'll ask in his comments thread.

15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
More detail about this is in Dan's post.  If you try to trap database rejects in a Server Job you use a reject link from a Transformer prior to the database stage.  Dan points out that each reject row slows down the job by a factor of 4 as the ETL job stops processing to handle that reject row. 
Parallel Jobs are more efficient as they use a reject link out of the database stage and on a parallel architecture can push the handling of those rows into a new process.  I haven't seen any overhead in this design if you don't get rejects, and database rejects should be under 0.01% of your rows or else your design is faulty. 
I've never try to trap bulk load or multi load or tpump or any other type of native database load errors back in DataStage and I don't think the stages can accept them anyway, I let the database handle them.  On smaller volume jobs I use the database reject link and insert instead of bulk load a lot for a more robust auditing of rows.

16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
A bit hard to avoid in a row-by-row ETL tool!  The parallel architecture and caching and memory sharing and a bunch of other things make it fast.

17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).
Fortunately this is very hard to do in a DataStage job or people would misuse it!  In DataStage looping on a per row basis can be done via lookups to database stored procedures or custom code modules but most standard DataStage stages do an action per row.

Think outside the box.
A lot of ETL sites have templates for building new jobs or design standards or example jobs.  These are excellent for 90% of your ETL work however very large data jobs may need custom designs.  You should look at what has been done in other jobs but also be ready to take a job into performance testing to try out dozens of other configurations.  As Dan says in his post - "performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines".




njoy the simplicity.......
Atul Singh