Hadoop


In this post, we will discuss about introduction to hadoop streaming with term frequency and Inverse document frequency algorithm. Hadoop Streaming By default Mapreduce framework is written in Java and supports writing mapreduce programs in Java programming language but Hadoop provides API for writing mapreduce programs in other than Java […]

Introduction to Hadoop Streaming


Cloudera-vs-HortonWorks-vs-MapR
Most Popular Hadoop Distributions Currently there are lot of Hadoop distributions available in the big data market, but the major free open source distribution is from Apache Software Foundation. And even remaining hadoop distribution companies provide free versions of Hadoop, and also provide customized hadoop distributions suitable for client organization […]

Most Popular Hadoop Distributions




hadoop log process output 3
In the previous post we have discussed about the basic introduction on log files and the architecture of log analysis in hadoop. In this post, we will enter into much deeper details on processing logs in pig. As discussed in the previous post, there will be three types of log […]

Processing Logs in Pig


Log Analysis in Hadoop 3
In this post we will discuss about various log file types and Log Analysis in Hadoop. Log Files: Logs are computer-generated files that capture network and server operations data.They are useful  during various stages of software development, mainly for debugging and profiling purposes and also  for managing network operations. Need […]

Log Analysis in Hadoop


Dashboard 6
In this post we are going to discuss about basic details of Tableau software and Tableau Integration with hadoop. Tableau Overview What is Tableau? Tableau is a visualization tool based on breakthrough technology  that provides drag & drop features to analyze data on large amounts of data very easily and quickly. The […]

Tableau Integration with Hadoop



Healthy Cluster 20
In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to […]

Cloudera Manager Installation on Amazon EC2


Hunk Visualization 3
In this post we will discuss about the configuration required for Hive connectivity with Hunk, Hadoop flavor of Splunk, the famous visualization tool. Splunk Overview: Splunk tool captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, dashboards and visualizations. Splunk released a product […]

Hive Connectivity With Hunk (Splunk)


Tez Dag job output 4
Apache Tez Overview What is Apache Tez? Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework. Apache Tez Features: Tez provides, Performance gain over Map Reduce […]

Apache Tez – Successor of Mapreduce Framework



JSON View of avro file
This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS […]

Merging Small Files Into Avro File


SequenceFile Key Extractor 7
In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce. Need For […]

Merging Small Files into SequenceFile


hadoop fs text command
This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API. Writing Sequence File Example: As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and […]

Reading and Writing SequenceFile Example



Sequence File Format 2
In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples. Hadoop […]

Hadoop Sequence Files example


HDFS out 2
In this post, we will provide proof of concept for Flume Data collection into HDFS with Avro Serialization by using HDFS sink, Avro Serializer on Sequence Files with Snappy Compression. Also we will use the formatting escape sequences to store the events on HDFS Path. In this post, we will create […]

Flume Data Collection into HDFS with Avro Serialization


Flume agent Op 2
In this post, we will discuss about setup of an agent for Flume data collection into HDFS . In this post, we will setup an agent with Sequence Generator Source, HDFS Sink and Memory channel and start that agent and verify its functionality. Flume data collection into HDFS Flume Agent – Sequence Generator […]

Flume Data Collection into HDFS