Blog Archives

Hadoop Cluster

7/22/2014

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

These clusters run Hadoop's distributed processing software on low-cost commodity computers. Typically one machine in the cluster is designated to work as the NameNode and another machine the as JobTracker which are known as the masters and rest of the machines in the cluster act as both DataNode and TaskTracker which are called slaves. Hadoop clusters are referred as shared nothing systems because the only thing that is shared between nodes is the network that connects them.

Hadoop clusters boost the speed of data analysis applications. They also are highly scalable because as volume of data increases nodes could be included in a cluster. Hadoop clusters also are highly resistant to failure because each piece of data is copied onto other cluster nodes which ensures that the data is not lost if one node fails.

Facebook is having the largest Hadoop cluster in the world.

0 Comments

Hadoop Distributed File System

7/22/2014

0 Comments

The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop.

HDFS provides high-performance access to data across Hadoop clusters. HDFS has become a key tool for managing pools of big data and supporting big data analytics applications.

HDFS is mostly deployed on low-cost commodity hardware therefore server failures are common. Therefore this system is designed to be highly fault-tolerant by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure even when numerous nodes fail.

HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster allowing for parallel processing. The file system copies each piece of data multiple times and distributes the copy to individual nodes, placing at least one copy on a different server rack as a result the data on nodes that crash can be found within a cluster which allows processing to continue without any halt.

0 Comments

Hadoop

7/5/2014

0 Comments

Apache Hadoop is an open source framework for writing and running distributed application that process large amounts of data.
Their are some key distinction of Hadoop which give it an edge over others.

Accessible- Hadoop runs on large clusters of commodity machines or simply we can say on cloud computing services.

Robust- Hadoop is architected with the assumption of frequent hardware malfunctions and can handle most failure gracefully

Scalable- Hadoop scales linearly to handle larger data by adding more nodes to the cluster.

Simple- Hadoop allow users to quickly write efficient parallel code.

Hadooop's accessibility and simplicity give it an edge over writing and running large distributed programs and on other hand its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook.

Hadoop is divided into two main parts

HDFS
Map Reduce

0 Comments

NoSQL

7/5/2014

0 Comments

NoSQL(also known as "Not Only SQL") represents a completely different framework of databases that allows high-performance, agile processing of information at massive scale i.e. it is a database infrastructure that has been very well-adapted to the heavy demands of big data.

The efficiency of NoSQL can be achieved because unlike relational databases that are highly structured, NoSQL databases are unstructured in nature which accounts for its speed & agility. NoSQL revolves around the concept of distributed databases where unstructured data may be stored across multiple processing nodes and often across multiple servers, this distributed architecture allows NoSQL databases to be horizontally scalable as data continues to explode by just adding more hardware without slowing down its performance. The NoSQL distributed database infrastructure has been the solution to handle some of the biggest data warehouses in the world like Google, Amazon, and the CIA.

0 Comments

Relational Database Management System (RDBMS)

7/5/2014

0 Comments

Traditional RDBMS (relational database management system) have been the conventional standard for database management throughout the age of the internet. This is also known as Traditional row-column databases used for both transactional systems, reporting, and archiving. The architecture behind RDBMS is such that data is organized in a highly-structured manner, following the relational model. RDBMS is now considered as a declining database technology. While the organization of the data keeps the warehouse very "neat", the need for the data to be well-structured actually becomes a substantial burden at extremely large volumes which results in performance declines as size gets bigger. Thus RDBMS is generally not considered as a solution to meet the needs of big data.

Examples: Sql Server, MySql, Oracle, etc

0 Comments

Hadoop Cluster

Hadoop Distributed File System

Hadoop

NoSQL

Relational Database Management System (RDBMS)

Author

Archives

Categories