We have moved to www.dataGenX.net, Keep Learning with us.

Wednesday, March 13, 2013

All about 000 - 421 : DataStage Certification Exam Test Preparation

1. DataStage v8 Configuration (5%)
  1. Describe how to properly configure DataStage V.8.0.
    1. This is kind of vague but focus on how DataStage 8 gets attached to a Metadata Server via the Metadata Console and how security rights are set up. 
    2. Read up on configuring DB2 and Oracle client and ODBC. 
    3. Get to know the dsenv file.  Read the DataStage Installation Guide for post-installation steps.
  2. Identify tasks required to create and configure a project to be used for V.8.0 jobs.
    1. DataStage Parallel Job Advanced Developer Guide has a section on project specific environment variables such as reporting, c++ compile options, tracing and optimisation.  You do not have to memorise them but at least read through each setting. 
    2. Get to know the DataStage Administrator Create Project and Project Settings.
  3. Given a configuration file, identify its components and its overall intended purpose.
    1. You can practice with different configuration files with one, two and four nodes and view Job Monitoring to see how the job is impacting.  Turn on job scoring to see what the job really looks like.  Read up on node pools and memorise what is in the manuals.
2. MetaData (5%)
  1. Demonstrate knowledge of Orchestrate schema.
    1. An orchestrate schema is just a table definition in a file format.  Try saving a couple to file and get to know how they are put together.
    2. Try reading and writing to a dataset using a schema file definition.
  2. Identify the method of importing metadata.
    1. Read my blog post Navigating the many paths of metadata for DataStage 8 - there are actually quite a few different ways to import metadata in version 8.
    2. Try imports through the metadata bridges, DataStage Designer plugin import and the DataStage Designer Connector import.
  3. Given a scenario, demonstrate knowledge of runtime column propagation.
    1. Try an exercise where you drop and regain columns in a job using propagation.  For example generate a set of columns and push them through three copy stages.  Remove two columns from the middle copy stage and add them back again in the third copy stage with column propagation turned on.
3. Persistent Storage (10%)
  1. Given a scenario, explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target).
    1. Get to know the Sequential File Stage, the Column Import Stage and the External Source Stage.  All three do roughly the same thing but have different options.  They all provide a stream of data to the same import function that turns data into orchestrate parallel data.
  2. Given a scenario, describe proper use of a sequential file.
    1. Read a sequential file.
    2. Read up on Sequential Files stage options such as multiple readers, first row is column names, reject rows.
    3. Compare using a file name to using a file wildcard.
    4. Try reading two files at once by configuring the stage with two file names.
  3. Given a scenario, describe proper usage of CFF (native not plug-in).
    1. It can be hard to practice with a CFF stage if you do not have CFF data!  Read up on it and see if you can find a Cobol Copy Book - try Googling for one, if you can get your hands on one import it as a CFF file definition and try loading the metadata into the CFF stage and pretend you are using it to extract data.  That will show you how the stage works.
    2. Have a look at all the tabs on the CFF stage for data formats, record filters etc.
  4. Describe proper usage of FileSets and DataSets.
    1. These are the hash files of Enterprise Edition.  Try comparing a Lookup Fileset to a Dataset and watch what happens in the temp directory and node scratch directory for each type of job as the job is running. 
    2. See what files are created behind the scenes. 
    3. Compare a very large lookup to a very small lookup to see how filesets and dataset lookups are different.
  5. Describe use of FTP stage for remote data.
    1. You can just read up on this.  Basically DataStage wraps FTP and Sequential File read into a single stage so you read the file as it is transferred.
  6. Identify importing/exporting of XML data.
    1. It is very easy to get your hands on XML data - lots of it around.  Try importing it into DataStage and read the XML Input stage and look for the XML tutorials on my blog post 40 DataStage Learning, Tutorial and Certification Online Resources.  The trick is that the XML Import Stage is looking to receive either a huge chunk of XML data or an XML file name passed into it from a preceding stage.  This means you have to pass it the data from something like a Sequential File Stage or the External Source stage.
    2. Have a look at the XPATH statements created in the Description field of a Table Definition from an XML metadata import.
    3. Read up on how the KEY field of an XML Import determines how an XML file gets flattened to a flat table.
4. Parallel Architecture (10%)
  1. Given a scenario, demonstrate proper use of data partitioning and collecting.
    1. The DataStage Parallel Developers Guide has a very good description of data partitioning that all developers should read.
  2. Given a scenario, demonstrate knowledge of parallel execution.
    1. A good way to learn this is to create a job that generates a sequence of numbers in an ID field and then run that on multiple nodes, use the Monitor tool to watch what happens to Transformers, Lookups and Sequential file stages in this job.
5. Databases (10%)
  1. Given a scenario, demonstrate proper selection of database stages and database specific stage properties.
    1. The questions are likely to be about Oracle, DB2 and SQL Server.  Most likely the first two.  You can read up on them in the Developers Guide and sometimes there is an additional PDF.  You can also add these stages to a job even if you do not have access to the database itself and play with the properties of the stage.
  2. Identify source database options.
    1. Add each type of database stage to a job and play around with the different settings. 
  3. Given a scenario, demonstrate knowledge of target database options.
    1. Compare inserts and updates to loads and bulk loads. 
    2. Compare generated SQL from the different types of upsert options (insert then update, update then insert, replace). 
    3. Use www.dsxchange.com to search for discussions on array size and transaction size for database stages.
  4. Given a scenario, describe how to design v.8.0 ETL job that will extract data from a DBMS, combine with data from another source and load to another DBMS target.
  5. Demonstrate knowledge of working with NLS database sources and targets.
    1. I did not get an NLS question in my version 7.5 exam but it may pay to read the NLS guide at least once so you are familiar with it.  Focus on NLS with databases as this seems to be the only section that mentions NLS.
6. Data Transformation (10%)
  1. Given a scenario, demonstrate knowledge of default type conversions, output mappings, and associated warnings.
    1. Read the DataStage Developers Guide and understand the difference between implicit conversion - where you map a field of one data type to a different data type without any specific conversion code.
    2. Read up on explicit conversion where you use a function like StringtoTimestamp to change the metadata during the mapping. 
    3. Try putting a string field into a numeric, putting an invalid date into a date field, putting a null into a not null to see what job warnings you get.
  2. Given a scenario, demonstrate proper selections of Transformer stage vs. other stages.
    1. Read my blog post Is the DataStage parallel transformer evil? for a discussion on what the Transformer does.
    2. For scenarios when you need the Modify stage try DataStage Tutorial: How to become a Modify Stage Zen Master.
  3. Given a scenario, describe Transformer stage capabilities (including: stage variables, link variables, DataStage macros, constraints, system variables, link ordering, @PART NUM, functions.
    1. A good way to understand partitioning functions is to try and build your own Transformer Parallel Job Counter.
    2. Go into a Transfomer and click in a derivation field and explore every part of the Right Mouse Click menu - try adding job start time, job name, @NULL, @DAY and various other macros to an output link and send it to the Peek stage to read the results.
  4. Demonstrate the use of Transformer stage variables (e.g., to identify key grouping boundaries on incoming data).
    1. Try a Stage Variable scenario where you compare values in the current row to values in the previous row of data by storing values in stage variables.  Have a look at this DSXChange Vertical Pivot thread for ideas.
    2. Try a Stage Variable scenario where you do null handling of input columns in stage variables and then use the results in derivation functions.
  5. Identify process to add functionality not provided by existing DataStage stages. (e.g., wrapper, buildops, user def functions/routines).
    1. Read the DataStage Advanced Parallel Job Developers Guide for a description of wrappers and buildops.
    2. Step through the parallel function tutorial in my post 40 DataStage Learning, Tutorial and Certification Online Resources.  It's good to know how parallel routines work.
  6. Given a scenario, demonstrate proper use of SCD stage
    1. Read up on the SCD stage and have a look at the SCD online tutorial in my post 40 DataStage Learning, Tutorial and Certification Online Resources.  It can be tricky understanding the SCD stage if you have never done star schemas and I wouldn't spend a lot of time trying to learn.  Just learn the dynamics of the stage and save your learning of star schemas for another time.
  7. Demonstrate job design knowledge of using RCP (modify, filter, dynamic transformer).
    1. This is a tough one - dynamic transformer?  I am not even sure what that is. 
    2. Practice column propagation through the modify, filter and transformer stages.
7. Job Components (10%)
  1. Demonstrate knowledge of Join, Lookup and Merge stages.
    1. There is a table in the Parallel Job Developers Guide that compares the three stages.  Try to get to know the differences.
  2. Given a scenario, demonstrate knowledge of SORT stage.
    1. Read my blog post Sorts to the left of me, sorts to the right to understand all the different ways to control sorting in a parallel job.
  3. Given a scenario, demonstrate understanding of Aggregator stage.
    1. Read up on the stage and practice with it.
    2. Try doing a sum and a count at the one time!  Look for the work around where you create a column set to the number 1.
  4. Describe proper usage of change capture/change apply.
    1. Change Capture splits a data stream into inserts, updates, deletes and unchanged rows by adding a tag field by comparing a before set of data to an after set of data.  Change Apply lets you make those changes to a dataset. 
    2. I don't really use Change Apply that much since I'm usually applying changes to a target database table.
  5. Demonstrate knowledge of Real-time components.
    1. This is kind of vague.  I am guessing they are referring to the stages in the Real Time folder such as XML Input and WISD input.  Read up on them and add them to a job to view the properties.
8. Job Design (10%)
  1. Demonstrate knowledge of shared containers.
    1. Create a big job and turn part of it into a shared container. 
    2. Try using that container in two different jobs and examine how job parameters are shared.
    3. Try a shared container that works on a small number of rows and uses column propagation to pass through the other rows.  Use this in multiple jobs.
  2. Given a scenario, describe how to minimize SORTS and repartitions.
    1. Read my blog post Sorts to the left of me, sorts to the right to understand all the different ways to control sorting in a parallel job.
    2. Add several stages to a job that need sorted data - Remove Duplcates, Join and Sequential File.  Compare what happens to the sort symbols on links when you configure these stages and choose different key fields.
  3. Demonstrate knowledge of creating restart points and methodologies.
    1. This one moves into methodology more than tool technical details.  Do a search on www.dsxchange.com for discussions on vertical banding and rollback.
    2. Remember that in most job abort situations you cannot restart a job from where it left off - it is hard to calculate what row number a parallel partitioning job was up to and its usually safest to rollback all changes and start again or overwrite the target file/dataset.
  4. Given a scenario, demonstrate proper use of standards.
    1. Try reading the DataStage standards chapter from IBM's own Excellent DataStage Documentation and Examples in New 660 Page IBM RedBook.
  5. Explain the process necessary to run multiple copies of the source (job multi-instance).
    1. Not many people use multiple instance jobs with parallel jobs since they are able to partition data on the fly.  Read up on Multiple Instance Jobs in the DataStage Designer Guide.
  6. Demonstrate knowledge of real-time vs. batch job design.
    1. Read through the Information Server Services Director just once to get the gist of turning a DataStage job into a Web Service.  The main difference is that a real time job will tend to input and output data as XML so it can share data with other SOA services.
9. Monitoring and Troubleshooting (10%)
  1. Given a scenario, demonstrate knowledge of parallel job score.
    1. See the Advanced Parallel Developers Guide for a description on how to turn on job scoring.
    2. Turn it on and leave it on as you practice for the exam and keep looking at job scores until you understand how it works.
  2. Given a scenario, identify and define environment variables that control DataStage v.8.0 with regard to added functionality and reporting.
    1. DataStage Parallel Job Advanced Developer Guide has a section on project specific environment variables such as reporting, c++ compile options, tracing and optimisation.
  3. Given a process list (scenario), identify conductor, section leader, and player process.
    1. Practice a job on four nodes and watch the scoring to see what happens to that job when it runs.
  4. Given a scenario, identify areas that may improve performance (e.g., buffer size, repartitioning, config files, operator combination, etc.).
    1. Try re-designing a job by adding and removing sorts, remove duplicates, copies, transformers and database stages and keep an eye on the partitioning and repartitioning symbols that show up on the job links to see how they are impacted.
    2. Try a job with a lookup stage and a transformer together and put a reject stage on the transformer.  See what happens when the lookup stage doesn't find a lookup row.  Switch combine operators on and off to compare the difference.
  5. Demonstrate knowledge of runtime metadata analysis and performance monitoring.
    1. Try out the various DataStage performance reports and use the Monitor tool to see how a job is progressing.
10. Job Management and Deployment (10%)
  1. Demonstrate knowledge of advanced find.
    1. Try finding column names, table names and wild card searching of parts of job names.
  2. Given a scenario, demonstrate knowledge and the purpose of impact analysis.
    1. Try some impact analysis search and reporting from inside the DataStage Designer.
  3. Demonstrate knowledge and purpose of job compare.
    1. Try comparing jobs in different projects that have minor changes.
  4. Given a scenario, articulate the change control process.
    1. Since IBM do not have a change control process in DataStage 8 this is kind of a stupid criteria.  IBM decommissioned the change control tool from version 7.5, the new change control tool isn't ready yet and an Excel based change control tool developed internally is not being released to the public.  The only change control in version 8 is the export and import of components manually between projects.
    2. Read up on how to lock a project to be read only.  Lock down a project and see how you can still set job parameter defaults.
11. Job Control and Runtime Management (10%)
  1. Demonstrate knowledge of message handlers.
    1. Find a warning message and an error message in a DataStage job.  Right mouse click on the message and add it to a job message handler and downgrade it to a lower level.  Re-run the job to see what happens.
    2. Read up on the difference between job message handlers and project message handlers.
  2. Identify the use of dsjob command line utility.
    1. Read the DataStage Server Job Developers Guide for a description of the dsjob command. 
    2. Do a search for dsjob on www.dsxchange.com to see all the sample code on how DSJob works.
    3. Use the DataStage Director to create a new DataStage Batch job - add a job to the batch code using the combo box provided to generate your own dsjob code.
  3. Given a scenario, demonstrate ability to use job sequencers (e.g., exception hunting, re-startable, dependencies, passing return value from routing, parameter passing and job status).
    1. Try a scenario where you have several jobs in a row and one of them fails - see what happens when you restart the sequence job.  See what happens when you reset the sequence job.
    2. Compare an aborted sequence job to an aborted parallel job.

Ref :  Vincent McBurney