How to think big with Hadoop (Part I)

Author: Sachin Gupta | Categories: Big Data, Business Intelligence

Big Data, the word of the year from 2012, has now become synonymous to all that’s relevant in the tech world today. Despite its depth and scope, very few actually understand its actual usage. This blog post shares why the world of tomorrow is so hopeful about big data. We would also dig deep to understand the Hadoop implementation.

We are living a world encroached and enabled by data; so much that 90% of the data in the world today has been created in the last two years alone. The looming question remains – how do we draw meaning from this vast pool of data? Almost every industry wants to deliver the best experience to attract new customer and retain existing customers. Analyzing this data can help organizations to better understand the customer behavior, trend, hidden concerns, etc.

Big data is not only about quantity; it is an opportunity to find insights in new and emerging data and to make your business agiler. Big data helps find answers to complex business problems that were previously considered beyond reach. As the companies grow, data also grows at a rapid pace, and makes it difficult to analyze, report and react. Let us cite an example:

Consider a telecom company that wants to understand and analyze the actual reason to why a customer is moving to another provider. Reason could be that the subscriber is –

1) Dissatisfied with the plan

2) Dissatisfied with the services

3) Dissatisfied with the contract.

It is essential to analyze the reason. You can gather feedback from the customer at the time he surrenders the contract, but the feedback might be off the mark. Now the question is how to analyze the data correctly?

The data is humongous, in a disarray and in different systems and formats. Organizations today need a dedicated system that integrates data from different sources, processes them and generates reports. In the example above, you can integrate customer’s transaction data, call logs (with whom customers frequent talk), service center queries and contacts, query resolution etc. By analyzing data, you may arrive at a summary like – ‘Peer recommendation’, the friends from customer’s social networks were moving to a different provider and thus influenced the user decision to switch.

Problems like these can be solved through big data solution, and Hadoop is one of such solutions.

You might be thinking that same can be solved through database approach or data warehousing using BI tools with ETL process. We suggest you to run through the blog post and you would rest assured that Hadoop is not the replacement of any existing solution but rather compliments them with a new repository where structured, non-structured complex data can be combined easily.

Before we move over to understand Hadoop, let’s first clearly understand how to define big data.

“Big data is high-volume, high-velocity and high variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” – Gartner   

big data

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” – O’Reilly

Data volume is huge, say about few hundred Terabyte – Petabyte – Exabyte – Zettabyte

Another important aspect that defines big data infrastructure is the context in which we label it:

  • Traditional RDBMS
  • NoSQL Database Systems
  • Hadoop, MapReduce and massively parallel computing

Hadoop History – Inspired by Google

In order to overcome the problem of how to analyze the ever increasing data collected from various websites, Google designed and built a new data processing solution with two main components:

  • Google File System (GFS): Fault-tolerant, reliable and scalable storage
  • MapReduce: Split work among large numbers of servers and perform processing in parallel.

Open Source developer Doug Cutting was working on a web crawler called Nuch2 and was facing issues with high volumes and indexing speed. At the same time, Google released a paper describing its work to manage large data volumes. Soon after Doug Cutting replaced the data processing technique, basing his new implementation on MapReduce. He named the new software Hadoop, after a stuffed elephant toy that belonged to his young son.

What is Hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for large parallel computing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data.

Hadoop consists of two major components –

1) HDFS (Hadoop Distributed File System)

HDFS provides scalable, fault-tolerant storage at low cost. The HDFS software detects and compensates for hardware issues, including disk problems and server failure.

HDFS stores files across a collection of servers in a cluster. Files are decomposed into blocks, and each block is written to more than one of the servers. This replication provides both fault-tolerance and performance.

HDFS ensures data availability by continually monitoring the servers in a cluster.

Why is the file system distributed?

Hadoop uses a distributed file system so that the processing takes place on many distributed servers at once (in parallel). Because Hadoop is distributing the processing task, it can take advantage of cheap commodity hardware and provide good performance on huge data.

Note: In computing, a file system (or files system) is used to control how data is stored and retrieved. Without a file system, information placed in a storage area would be one large body of data with no way to tell where one piece of information stops and the next begins. – Wikipedia

2) MapReduce

Hadoop operates on massive datasets by horizontally scaling the processing across very large numbers of servers through an approach called MapReduce.

MapReduce splits up a problem, sends the sub-problems to different servers, and lets each server solve its sub-problem in parallel. It then merges all the sub-problem solutions together and writes out the solution into files which may in turn be used as inputs into additional MapReduce steps.

MapReduce Software Framework offers clean abstraction between data analysis tasks and the underlying system challenges involved in ensuring reliable large-scale computation.

  • Process large jobs in parallel across nodes and combines results
  • Eliminates the bottlenecks imposed by monolithic storage systems
  • Results are collated and digested into a single output after each piece has been analyzed

The job scheduler (Component of MapReduce) is responsible for choosing the servers that will run each user job, and for scheduling execution of multiple user jobs on a shared cluster.

Go ahead and check out part 2 in this blog series.