How to think big with Hadoop (Part II)
Big Data, Business Intelligence
In part one of this blog series, we introduced you to Hadoop. In this follow-up post, we will be covering the characteristics, architecture and the need for Hadoop.
Characteristics of Hadoop
- Highly flexible and can process both structured as well as unstructured data.
- Provides a faster response on the principle of write once and read multiple times.
- Hadoop has a built-in fault tolerance. Data is replicated across multiple nodes, and if a node goes down, the required data can be read from another node which has the copy of that data.
- Provides a reliable shared storage and analysis system.
- Cost effective as it can work with commodity hardware and does not require expensive high-end hardware.
- Highly scalable
- Optimized for large and ‘very large’ data sets.
HDFS is based on the master and slave node concept architecture. An HDFS cluster consists of a single NameNode (a master server that manages the file system namespace) and Multiple DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
NameNode stores metadata and DataNode stores data blocks.
Namenode holds the information about:
- All the other nodes in the Hadoop Cluster
- Files and their locations in the cluster
- Any other information useful for the operation of the Hadoop Cluster.
Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
Why is Hadoop inexpensive?
- Open source – no proprietary licensing fee
- Hadoop doesn’t require special processor or expensive hardware. It can give excellent performance on nodes built on cheap x86 servers
Why is there a need of Big Data platforms like Hadoop?
Big Data solutions are needed due to exponential growth of unstructured and semi-structured data
Hadoop implementation is useful where data -
- is high velocity in nature
- Combines structure, semi-structured and unstructured data
- Includes enormous volumes
- Involves complexity in data distribution and synchronization
Common Terms associated with Hadoop ecosystem – A ready reckoner
- Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig is a high-level data-flow language and execution framework for parallel computation.
- HBase is inspired from Google’s BigTable. It is a non-relational, scalable and fault-tolerant database that is layered on top of HDFS. HBase is written in Java.
- ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used by HBase and can be used by MapReduce programs.
- Avro - Data serialization system
Go ahead and check out part 3 in this blog series.