How to think big with Hadoop (Part III)
Big Data, Business Intelligence
Here is the final blog post on the Hadoop introduction. In this blog post, we will cover the use cases, trends and much more.
Google BigQuery and Hadoop – know the difference
Hadoop is an open-source framework for distributed data processing, specifically an implementation of the MapReduce framework. There are tools built on top of Hadoop that allow for fast querying over large datasets (Impala).
BigQuery is a hosted service that allows you to run queries over massive datasets via an API. BigQuery allows users to run full table aggregate ad-hoc queries over big datasets (i.e. terabytes), and it is possible to run regular expression match queries, with sums and aggregations, over several terabytes of data in under a minute.
Hadoop uses a standard distributed file system the HDFS (Hadoop Distributed File Systems) while Google MapReduce uses GFS (Google File System)
Hadoop is implemented in java whereas Google MapReduce uses C++.
How large organizations are going to market with Hadoop?
- Oracle has launched BigData machine. Also based on Cloudera, this server is dedicated to storage and usage of the -non-structured content.
- Informatica has a tool called HParser to complete PowerCenter. This tool is built to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers.
- Microsoft has got a dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure, their cloud solution, and a big native integration with SQL Server 2012.
- Some very large database solutions like EMC Greenplum (partnering with MapR), HP Vertica (partnering with Cloudera), Teradata Aster Data (partnering with HortonWorks) or SAP Sybase IQ are able to connect directly to HDFS.
The top 9 Hadoop Solution providers and their products. Customers can choose appropriate vendors based on their implementation requirement. The following table lists down the prominent BigData vendors and their products:
Hadoop: Use Cases / Scenarios
- Some large websites use Hadoop to analyze usage patterns from log files or click-stream data that is generated by hundreds or thousands of their web servers.
- The intelligence community needs to analyze vast amounts of data gathered by server farms monitoring phone, email, instant messaging, travel, shipping, etc. to identify potential terrorist threats.
- E-commerce companies need to analyze vast variety and amount of product and customer data. Flipkart.com is undergoing implementation of Hadoop and Cassandra (NoSQL DB). eBay (with 250TB of storage, 6 billion+ writes and 5 billion+ reads per day) has already implemented Hadoop and Cassandra
- In order to identify and eliminate fraudulent activity by financial services, organizations analyze terabytes of data daily to improve the effectiveness of their trade execution.
When not to use Hadoop
There are few scenarios in which Hadoop is not the right fit. Following are some of them:
- Multiple writes scenario on files.
- Low-latency or near real-time data access.
- Smaller Dataset, a large number of small files to be processed. As Namenode holds the file system metadata in memory and as the number of files increases, the amount of memory required to hold the metadata increases.
Disclaimer – Some of the images and/or content abstracts are referenced from online sources.
Join the big data conversation
Share your viewpoints on Big data in the comment thread below and have direct interaction with edynamic’s big data experts.