ETL tools are basically used to migrate data from one place to another by performing three functions:
- Extract data from sources like ERP or CRM applications : In the extract step, data has to be collected from several source systems and in multiple file formats, like the flat files with (csv) delimiters and files with XML extensions. There may also be a need to get the data from legacy systems, which store data in formats that are understood by very few people and nobody else uses it further.
- Transform that data into a format that matches other data in the warehouse : The transformation process includes many data manipulation steps, like moving, splitting, translating, merging, sorting, pivoting, and many more.
- Loading the data into the data warehouse for analysis: This process can be performed through batch files or row by row, in real time.
All the above processes sound simple but take days to complete the process.
“Power of hadoop with ETL”
Hadoop brings at least two major advantages to traditional ETL:
- Ingesting huge amounts of data without having to specify a schema on ‘Write’.
A prime property of Hadoop is the “no schema on-write”. This implies that you don’t have to pre-define the data schema before loading data into HDFS. This holds true for both structured data (such as point-of-sale transactions, details of call records, ledger transactions, and even the call centre transactions),as well as for unstructured data (like comments from users, doctor’s notes, descriptions on insurance claims, and web logs) and social media data (from websites like Facebook, LinkedIn and Twitter). Irrespective of whether your input data has explicit or implicit structure, one can quickly load it into HDFS, which will then be ready for downstream analytic further processing.
- Unload the transformation of input data by parallel processing at scale.
Once the data is loaded in Hadoop you can perform the traditional ETL tasks like cleansing, aligning, normalizing and combining data by employing the massive scalability of MapReduce function. Hadoop also permits you to keep away from the transformation bottleneck in the old and typical ETLT, by unloading the ingestion, transformation, and integration of unstructured data into the data warehouse. Since Hadoop allows you to use more data types than ever before, it enriches your data warehouse which otherwise would not be feasible. Due to its scalable performance, you can appreciably speed up the ETLT jobs. Adding on, since data saved in Hadoop persists for a much longer period, one can provide more granular details of the data via EDW for high-fidelity analysis.