Hadoop


 Installing the UDF Development Package [crayon-58acecff7a090998512385/] The output will be like below code. [cloudera@quickstart impala-udf-samples-master]$ cmake . — The C compiler identification is GNU 4.4.7 — The CXX compiler identification is GNU 4.4.7 — Check for working C compiler: /usr/bin/cc — Check for working C compiler: /usr/bin/cc — works — […]

Creating UDF and UDAF for Impala


To install the server locally use the command line and type [crayon-58acecff7a84c485400206/] To start off, we need to set the password of the PostgreSQL user (role) called “postgres”; we will not be able to access the server externally otherwise. As the local “postgres” Linux user, we are allowed to connect […]

Postgres Installation On Centos


impala-explain
Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST […]

Impala Best Practices



kafka+storm1
Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals [crayon-58acecff7aeeb060722660/] b. Download and installation commands for JZMQ:  [crayon-58acecff7aef3946840466/]   2. Download latest storm from http://storm.apache.org/downloads.html  [crayon-58acecff7aef6113408639/] Second start Storm Cluster by starting master […]

Apache Storm Integration With Apache Kafka


Oo
OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed […]

Oozie Notes


This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node […]

Zookeeper Commands



1
Below are a few Hadoop Real Time usecases with solutions. Usecase 1 Problem:- Data Description: This gives the information about the markets and the products available in different regions based on the seasons. You will find the below fields listed in that file. [crayon-58acecff7bfaa650256972/] Problem Statement: Select any particular county […]

Hadoop Real Time Usecases with Solutions



Hive SQL Based Datawarehouse app built on top of hadoop(select,join,groupby…..) It is a platform used to develop SQL type scripts to do MapReduce operations. PARTITIONING Partition tables changes how HIVE structures the data storage *Used for distributing load horizantally ex: PARTITIONED BY (country STRING, state STRING); A subset of a […]

Hadoop and Hive Interview Cheat Sheet



Hadoop Testing Tools MRUnit  – Java framework that helps developers unit test Hadoop Map reduce jobs. Mockito –  Java Framework, similar to MRUnit for unit testing Hadoop Map reduce jobs. PigUnit – Java framework that helps developers unit test Pig Scripts. HiveRunner – An Open Source unit test framework for hadoop hivequeries based […]

Hadoop Testing Tools



4
Hadoop Performance Tuning There are many ways to improve the performance of Hadoop jobs. In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning. There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of […]

Hadoop Performance Tuning



Hadoop Best Practices Avoiding small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file. Maintain Optimal HDFS Block size, generally >= 128 MB, to avoid tens of thousands of map tasks in processing large data sets. Usage of Combiners wherever applicable/suitable to […]

Hadoop Best Practices


2
Formula to calculate HDFS nodes Storage (H) Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster. H = C*R*S/(1-i) * 120% Where: C = Compression ratio. It depends on the type of compression used (Snappy, LZOP, …) and size of the data. When no […]

Formula to Calculate HDFS nodes storage


RStudio Example Run 4
In this post, we will briefly discuss about the steps for RHadoop Installation on Ubuntu 14.04 Machine with Hadoop-2.6.0 version. We also see the procedure for R & RStudio Installations on Ubuntu Machine. All these installations are done on single node hadoop machine. RStudio Installation on Hadoop Machine Before proceeding […]

RHadoop Installation on Ubuntu