We have moved to www.dataGenX.net, Keep Learning with us.

Tuesday, October 30, 2012

ETL process and concepts

ETL is an abbreviation of the three words Extract, Transform and Load. It is an ETL process to extract data, mostly from different types of systems, transform it into a structure that’s more appropriate for reporting and analysis and finally load it into the database and or cube(s).
As ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

E – Extract from source
In this step we extract data from different internal and external sources, structured and/or unstructured. Plain queries are sent to the source systems, using native connections, message queuing, ODBC or OLE-DB middleware. The data will be put in a so-called Staging Area (SA), usually with the same structure as the source. In some cases we want only the data that is new or has been changed, the queries will only return the changes. Some ETL tools can do this automatically, providing a changed data capture (CDC) mechanism.

T – Transform the data
Once the data is available in the Staging Area, it is all on one platform and one database. So we can easily  join and union tables, filter and sort the data using specific attributes, pivot to another structure and make business calculations. In this step of the ETL process, we can check on data quality and cleans the data if necessary. After having all the data prepared, we can choose to implement slowly changing dimensions. In that case we want to keep track in our analysis and reports when attributes changes over time, for example a customer moves from one region to another.

L – Load into the data warehouse
Finally, data is loaded into a data warehouse, usually into fact and dimension tables. From there the data can be combined, aggregated and loaded into datamarts or cubes as is deemed necessary.

The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation.
The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts.

Two other things which is integrated with ETL..

Data profiling and data quality control
Profiling the data, wil give direct insight in the data quality of the source systems. It can display how many rows have missing or invalid values, or what the distribution is of the values in a specific column. Based on this knowledge, one can specify business rules in order to cleanse the data, or keep really bad data out of the data warehouse. Doing data profiling before designing your ETL process, you are better able to design a system that is robust and has a clear structure.

Meta data management & ETL
Information about all the data that is processed, from sources to targets by transformations, is often put into a metadata repository; a database containing all the metadata. The entire ETL process can be ‘managed’ with metadata management, for example one can query how a specific target attribute is built-up in the ETL process, called data lineage. Or, you want to know what the impact of a change will be, for example the size of the order identifier (id) is changed, and in which ETL steps this attribute plays a role.

ETL Tools
At present the most popular and widely used ETL tools and applications on the market are:

No.       List of ETL Tools                                                ETL Vendors
1.         Oracle Warehouse Builder (OWB)                      Oracle
2.         Data Services                                                       SAP Business Objects
3.         IBM Information Server (Datastage)                   IBM
4.         SAS Data Integration Studio                                SAS Institute
5.         PowerCenter Informatica                                     Informatica
6.         Elixir Repertoire                                                   Elixir
7.         Data Migrator                                                       Information Builders
8.         SQL Server Integration Services (SSIS)              Microsoft
9.         Talend Open Studio for Data Integration             Talend
10.       DataFlow Manager                                               Pitney Bowes Business Insight
11.       Pervasive Data Integrator                                     Pervasive Software
12.       Open Text Integration Center                               Open Text
13.       Oracle Data Integrator (ODI)                               Oracle
14.       Data Manager/Decision Stream                            IBM (Cognos)
15.       Clover ETL                                                           Javlin
16.       Centerprise                                                           Astera
17.       DB2 Infosphere Warehouse Edition                     IBM
18.       Pentaho Data Integration                                      Pentaho
19         Adeptia Integration Suite                                    Adeptia
20.       DMExpress                                                          Syncsort
21.       Expressor Data Integration                                  QlikTech

njoy the simplicity.......
Atul Singh