Nuts & Bolts of DataStage: Managing and Deleting Persistent Data Sets within IBM InfoSphere Datastage

Data Sets sometimes take up too much disk space. This technote describes how to obtain information about datasets and how to delete them.

Data sets can be managed using the Data Set Management tool, invoked from the Tools > Data Set Management menu option within DataStage Designer (DataStage Manager in the 7.5 releases.) Alternatively, the 'orchadmin' command line program can be used to perform the same tasks.

The files which store the actual data persist in the locations identified as resource disks in the configuration files. These files are named according to the pattern below:

descriptor.user.host.ssss.pppp.nnnn.pid.time.index.random

descriptor: Name of the data set descriptor file.
user: Your user name.
host: Hostname from which you invoked the job which created the data set.
ssss: 4-digit segment identifier (0000-9999)
pppp: 4-digit partition identifier (0000-9999)
nnnn: 4-digit file identifier (0000-9999) within the partition
pid: Process ID of the job on the host from which you invoked the jop that creates the data set.
time: 8-digit hexadecimal time stamp in seconds.
index: 4-digit number incremented for each file.
random: 8 hexadecimal digits containing a random number to insure unique file names.

For example, suppose that your configuration file contains the following node definitions:

{
node node0
{
fastname "host1"
pools ""
resource disk "/orch/s0" {pools ""}
resource scratchdisk "/scratch" {pools ""}
}
node node1
{
fastname "host1"
pools ""
resource disk "/orch/s0" {pools ""}
resource scratchdisk "/scratch" {pools ""}
}
}

A data set named mydata.ds created by a job using this configuration file will contain data in two partitions, one for each processing node declared in the configuration file. Because each processing node contains only a single disk specification, each partition of data would be stored in a single file on each processing node. Following the naming convention shown above, the data file for partition 0 would be located on the host1 machine, in the /orch/s0 filesystem, and the file would be named:

/orch/s0/mydata.ds.user1.host1.0000.0000.0000.1fa98.b61345a4.0000.88dc5aef

The data file for partition 1 data would be similarly named:

/orch/s0/mydata.ds.user1.host1.0000.0001.0000.1fa98.b61345a4.0001.8b3cb144

It is important to understand that the file referenced in the job, called mydata.ds in our example, does not contain any actual data. It is a data set descriptor file, and it contains information about how the data set is constructed. In order for DataStage jobs to access the data, both the descriptor and the actual segment files must exist.

Cleaning up Data Sets

A good plan for managing data sets is to identify the Data Sets that are no longer required, and to use the Data Set Management tool to delete them. If you have the jobs that reference the data sets, you can open each of the data set descriptor files using the Data Set Management tool and then view and delete the data set. If you do not have the jobs, another possible method is to look in the resource disk locations for segment files with very old modification dates. Once you have identified the segment files, you can determine what the data set descriptor file name was.

/orch/s0/mydata.ds.user1.host1.0000.0000.0000.1fa98.b61345a4.0000.88dc5aef

In this example segment file shown above, the highlighted "mydata.ds" is the file name of the data set descriptor. You can then locate this file in your computer with the find command.

find /my_projects/datasets/ -name "mydata.ds" -print

Once you have located the descriptor file, you can then use the Data Set Management tool to view and delete the data set. If someone has already deleted the descriptor file, then the segments have been orphaned. There is no utility or function to recreate the descriptor file. In this situation, you can safely delete all the segment files named with the "mydata.ds" in the file name.

Cleaning up Data Sets from the command line

It is also possible to use the orchadmin executable program to delete data sets. This program is located in $APT_ORCHHOME/bin.

To delete a data set using orchadmin, the environment has to be setup properly, and the descriptor file has to exist. Follow these steps to use delete a data set.

$ cd $DSHOME
$ . ./dsenv
$ LD_LIBRARY_PATH=$APT_ORCHHOME/lib:$LD_LIBRARY_PATH; export LD_LIBRARY_PATH
$ APT_CONFIG_FILE=<config file path>; export APT_CONFIG_FILE
$ APT_ORCHHOME=$DSHOME/../PXEngine; export APT_ORCHHOME
$ PATH=$APT_ORCHHOME/bin:$PATH; export PATH
$ $DSHOME/../PXEngine/bin/orchadmin delete <full path to descriptor file>

Note: adjust the steps accordingly for your platform, for example use LIBPATH instead of LD_LIBRARY_PATH on the AIX platform.

Friday, September 06, 2013

Managing and Deleting Persistent Data Sets within IBM InfoSphere Datastage

Data Sets sometimes take up too much disk space. This technote describes how to obtain information about datasets and how to delete them.