Yearly Archives: 2015

OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed […]

Oozie Notes

This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node […]

Zookeeper Commands

In Java development, a typical workflow involves restarting the server with every class change, and no one complains about it. That is a fact about Java development. We have worked like that since our first day with Java. But is Java class reloading that difficult to achieve? And could that problem […]

Advanced Java Class Tutorial: A Guide to Class Reloading

Below are a few Hadoop Real Time usecases with solutions. Usecase 1 Problem:- Data Description: This gives the information about the markets and the products available in different regions based on the seasons. You will find the below fields listed in that file. [crayon-5d122a0d3c4b3439729126/] Problem Statement: Select any particular county […]

Hadoop Real Time Usecases with Solutions

Cassandra Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure. NoSQL The primary objective of a NoSQL database is to have simplicity of design, horizontal scaling, and finer […]

Cassandra Interview Cheat Sheet

Hive SQL Based Datawarehouse app built on top of hadoop(select,join,groupby…..) It is a platform used to develop SQL type scripts to do MapReduce operations. PARTITIONING Partition tables changes how HIVE structures the data storage *Used for distributing load horizantally ex: PARTITIONED BY (country STRING, state STRING); A subset of a […]

Hadoop and Hive Interview Cheat Sheet

Cassandra Vnodes
Cassandra Overview Cassandra is another no-sql database. Similar to Hbase it is also distributed column-oriented database to handle big data workloads across multiple nodes but it can support both Local File system and HDFS, whereas in Hbase the underlying file system is also HDFS. It overcomes single point of failure […]

Cassandra Overview

This is quick touch on Impala commands and Functions. Impala accepts basic SQL syntax and below is the list of a few operators and commands that can be used inside Impala. This is just a quick cheat sheet. Databases In Impala, a database is a logical container for a group […]

Impala Commands Cheat Sheet

Impala Architecture 1
Cloudera provides a separate tool called Impala to overcome the slowness of Hive Queries. Syntactically Impala queries are more or less same as Hive Queries but they run very faster than Hive Queries. Impala provides high-performance, low-latency SQL queries. When we are dealing with medium sized data sets and we […]

Impala Introduction