Buffering is a technique used in the Datastage jobs to ensure a constant and uninterrupted flow of data to and from stages in such a way that there is no potential dead lock or any fork join problems. It is been implemented in Datastage keeping in mind the fact that the data has to keep moving in the process with an optimized use of the memory in the server. As mentioned by IBM the ideal scenario is when the data flows through the stages without being written on the disk. As in the case of buffering in any system, the upstream operators should to wait for the downstream operators to consume their input before starting to create their records. This is the intention in Datastage too.
In Datastage buffering is inserted automatically in the jobs on the links connecting the different stages. The buffer behaves in such a way that it always tries aptly transfer data between links and prevents data from being pushed onto the disk. For instance if the downstream operator is no longer getting the data from the upstream operator at a decent rate or not getting it at all , the buffer operator slows down the incoming data for the upstream stage so that the buffer does not fill itself to an extent that data needs to be written on the disk. Ideally in most projects the default buffering policy is all that you require for running your jobs in the optimal manner. The default policy will ensure that data doesn’t spill out onto the disk once the buffer space has been filled up in any part of the job. You can see where the buffering is inserted by simply observing the job score.
Buffering can be controlled from the administrator by setting the appropriate value for the APT_BUFFERING_POLICY variable. In addition to this you can also modify the buffering setting for your stage in the advanced tab of the stage. By default the Buffering policy is AUTOMATIC_BUFFERING which will insert buffers on links to avoid deadlocks as and when required. The other two buffering options are ‘FORCE BUFFERING’ which will buffer all links and ‘NO BUFFERING’ which will not insert any buffering. In case you decide to override the default buffering policy, you can do it through the Datastage administrator. This requires us to set the following environment variables
The available environment variables are as follows:
- APT_BUFFER_MAXIMUM_MEMORY. This variable contains the value for the maximum amount of virtual memory, in bytes, that will be used per buffer. The default size is 3145728 (3 MB). So this means that your buffer has a maximum size of 3 MB per buffer. So if your job requires 3 buffers you will be having 9MB of buffer space. So if in the runtime of the job if your buffer gets filled to the limit of 3MB then the remaining data is written to the disk
- APT_BUFFER_DISK_WRITE_INCREMENT. This variable sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default size is 1048576 (1 MB.) So if going by the above example if the buffer limit of 3MB has been hit then blocks of data will start to get written to the disk each of 1MB size. Changing these values has advantages as well as disadvantages. Increasing the block size reduces the number of times the buffer operator has to write to the disk, but might decrease performance whenever data has to be read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of times the disk has to be accessed to write the data.
- APT_BUFFER_FREE_RUN. This is normally specified as a percentage value of the maximum buffer size. This value indicates the amount of available in-memory buffer to consume before the buffer offers resistance to any new data being read by it. So as long as the percentage of buffer used is less than the percentage specified in this variable, the data will move at the normal speed but as soon as the percentage point is crossed the buffer will start restricting the data flow. The default percentage is 0.5 (50% of Maximum memory buffer size which in this case will be 1.5 MB). The values can change from 0.0 to 1.0
Similar options will also be available in the stage editor’s advanced tab for customizing the buffering on the link of your choice. I hope this gives you a better understanding of the buffering options in Datastage and the meaning of each variable and its affect on the job.