No doubt that the new wave of big data is creating new opportunities but at the same time it is also creating new challenges for businesses across all the industries. The data integration is one of the important challenges which many IT Engineers are currently facing. The major problem is to incorporate the data from social media and other unstructured data into a traditional BI environment.
Here we have discovered a robust solution to overcome data related challenges.
We are talking about “Hadoop”, a cost-effective and scalable platform for BigData analysis. Using Hadoop system instead of Traditional ETL (extraction, transformation and loading) processes gives you better results in less time. Running of Hadoop Cluster efficiently implies selecting an optimal framework of servers, storage systems, networking devices and soft wares.
Generally, a typical ETL process will extract data from multiple sources, then cleanses, formats, and loads it into a data warehouse for analysis. When the nature source data sets is large in size, fast growing, and not in structured format, traditional ETL can become the bottleneck, because of its complex, expensive and time consuming process to develop, operate and execute.
Fig #1: Depicts the Traditional ETL Process
Fig#2: Depicts ETL offload Hadoop.
Apache Hadoop for Big Data
Hadoop is an open source framework which is based on java programming model that supports processing and storing of large data sets in a distributed computing environment. It runs on a cluster of commodity machines. Hadoop allows you to store petabytes of data reliably on large number of servers while increasing performance cost-effectively, by just adding inexpensive nodes to the cluster. The reason for the scalability of Hadoop is the distributed processing framework known as “MapReduce”.
MapReduce is a method to process large sums of data in parallel while the developer only has to write two codes which are “Mapper” and “Reduce”. In the mapping phase, MapReduce takes the input data and assigns every data element to the mapper. In the reducing phase, the reducer combines all the partial and intermediate outputs from all the mappers and produces a final result. MapReduce is an important advanced programming model because it allows engineers to use parallel programming constructs without having to know about the complex details of intra-cluster communication, monitoring the tasks, and handling failures.
The system breaks the input data-set into multiple chunks, each one of them is assigned a map task that processes the data in parallel. The map function will read the input in the form of (key, value) pairs and produce a transformed set of (key, value) pairs as the output. During the process outputs of the map tasks are shuffled and sorted and the intermediate (key, value) pairs will be sent to the reduce tasks, which will group the outputs into the final results. To perform processing using MapReduce, the JobTracker and TaskTracker mechanisms are used to schedule, monitor and restart any of the tasks that fail.
Hadoop framework includes the Hadoop Distributed File System (HDFS) that is a specially designed file system with streaming access pattern and fault tolerance capability. HDFS stores large amount of data. It divides the data into blocks (usually 64 or 128 MB) and replicates the blocks on the cluster of machines. By default three replications are maintained. Capacity and performance can be increased by adding Data Nodes, and a single Name Node mechanism.