In this post we will discuss about the basic details/introduction about Apache Pig.
Table of Contents
What is Apache Pig?
Pig is a scripting language for exploring huge data sets of size gigabytes or terabytes very easily. Pig provides an engine for executing data flows in parallel on Hadoop. Pig is made up of two things mainly.
- Pig Latin: Language for expressing data flows
- Pig Engine: Execution Environment to run Pig Latin programs. It has two modes
- Local Mode: Local execution in a single JVM, all files are installed and run using local host and file system.
- Mapreduce Mode: Distributed execution on a Hadoop cluster, it is the default mode.
Pig Latin Features:
- Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.)
- Pig Latin is extensible so that users can develop their own functions for reading, processing, and writing data.
- Pig Latin script is made up of a series of operations, or transformations, that are
applied to the input data to produce output
- Pig Latin programs can be executed either in Interactive mode through Grunt shell or in Batch mode via Pig Latin Scripts.
Pig Engine converts these Pig Latin operators or transformations into a series of MapReduce jobs
- Pig does not support random reads or queries in the order of tens of milliseconds.
- Pig does not support random writes to update small portions of data, all writes are bulk, streaming writes, just like MapReduce.
- Low latency queries are not supported in Pig, thus it is not suitable for OLAP and OLTP.
Apache Pig History:
The word “Pig” is named after a domestic animal, it is not any acronym. This entertaining nomenclature lead to some silly names in Pig project like Pig Latin for its language and Grunt for its Interactive shell.
Apache Pig is top level project in Apache Software foundation, prior to this it was started by Yahoo researchers and later contributed it to Apache Open source community in 2010.
The Pig architecture is shown in below screen:
The above picture conveys that,
- Pig Latin scripts or Pig commands from Grunt shell will be submitted to Pig Engine.
- Pig Engine parses, compiles, optimizes, and fires MapReduce statements.
- MapReduce accesses HDFS and returns the results.
One of the common use case of Pig is data pipelines. A common example is web companies bring in logs from their web servers, cleansing the data, and pre-computing common aggregates before loading it into their data warehouse. In this case, the data is loaded onto the grid, and then Pig is used to clean out records from bots and records with corrupt data. It is also used to join web event data against user databases so that user cookies can be connected with known user information.
What is the need for Pig when we already have Mapreduce?
Mapreduce is a low level data set processing paradigm where as the Pig provides the high level of abstraction for processing large data sets. Though, both Mapreduce and Pig are used for processing data sets and Pig transformations are converted into a series of Mapreduce jobs, below are the major differences of Mapreduce Processing Framework and Pig framework.
Pig Latin provides all of the standard data-processing operations, such as join, filter, group by, order by, union, etc. MapReduce provides the group by operation directly, but order by, filter, projection, join are not provided and must be written by the user.
|It is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain and reuse||High level Programming|
|Development cycle is very long. Writing mappers and reducers, compiling
and packaging the code, submitting jobs, and retrieving the results is a time
|In pig, no need of compiling or packaging of code. Pig operators will be converted into map or reduce tasks internally.|
|To extract small portion of data from large datasets using Mapreduce is preferable||Pig is not suitable small portions of data in a large dataset, since it is set up to
scan the whole dataset, or at least large portions of it
|Not Easily Extendable. We need to write functions starting from scratch.||UDFs tend to be
more reusable than the libraries developed for writing MapReduce programs
|We need MapReduce when we need very deep level and fine grained control on the way we want to process our data.||Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.|
|Performing Data set joins is very difficult||Joins are simple to achieve in Pig.|
Difference Between Hive and Pig:
Hive can be treated as competitor for Pig in some cases and Hive also operates on HDFS similar to Pig but there are some significant differences. HiveQL is query language based on SQL but Pig Latin is not a query language. It is data flow scripting language.
Since Pig Latin is procedural, it fits very naturally in the pipeline paradigm. HiveQL on the other hand is declarative.