What is ETL?
ETL stands for extract, transform and load.
Most important role in nearly every business operation is data, which to be most usable and valuable you need a way to collect it from miscellaneous sources like source systems, organize it together and centralize it into another area for analysis or operational system for support of the organizations business process. So the complete process of extracting the data, transforming the data and loading the data into warehouse is called extract, transform and load.
It is not a one-time process for loading the data to warehouse because the source data in the operational systems keeps on changing so this is a regular process, where the data has been regularly fetched from the source system and loaded into a warehouse after transformations are complete.
ETL has three phases – first is the extraction path, second is the transformation and third is the loading of the extractor and the transform data into the warehouse. Let’s take a look into each of these three stages in detail.
There are a couple of data extraction strategy:
- Full Extract – all the data from the operational systems and all the source systems gets extracted into a staging area. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes.
- Partial Extract – sometimes we get the notification from the source systems that which data has been updated, which has been deleted and which data is the new data; so this is also called the Delta and in this strategy we only extract the data which has been modified and it is easy and quick as compared to the full expression.
- Incremental Extract – with this strategy we do not extract the full data set from the system. When source systems are unable to provide notification that an update has occurred, but they can identify which records were modified, and provide an extract of only those records; one thing we need to know while extracting the data from source system is that it should be designed in a way it does not negatively affect the system performance because source system is generally where all the actual business and transaction happens, so we should design or schedule our system to run at that time that it should not negatively affect the system performance.
The data extracted into a staging server from the source system is in a raw format and we can’t use that data as it is. Therefore it needs to be cleansed, mapped and transformed as to the requirement before it is finally ready for loading into a Data Warehouse. So let’s look at the regular transformation tasks:
- Selection – we select the data which is required to be loaded into the Data Warehouse.
- Matching – we will look up the data from various local files and then match the data that needs to be transformed.
- Cleansing – the data is not cleans and standardized in our source systems because we are fetching the data from more than one source system so it has to be standardized or normalized.
- Consolidation or Aggregation – we consolidate and aggregate the data from the source system, because we do not want to load same data from system into warehouse, hence we consolidate summarize and aggregate the data from the system.
- Splitting or merging fields – we split and merge fields; split it from a single field to multiple fields and then we combine some fields as well and this all happens based on the requirements.
- Standardizing Data – as mentioned earlier data has been taken from various sources and then it needs to be standardized before loading into a warehouse.
- Handling Character Encodings and Conversions – we also need to convert the data into a define encoding in the Data Warehouse because the system may or may not have a seam encoding.