Processing Logs in Pig 3

In the previous post we have discussed about the basic introduction on log files and the architecture of log analysis in hadoop. In this post, we will enter into much deeper details on processing logs in pig.

As discussed in the previous post, there will be three types of log files majorly.

  • Web Server Access Logs
  • Web Server Error Logs
  • Application Server Logs

All these log files will be in anyone of the below three formats.

  • Common Log File format
  • Combined Log File format
  • Custom Log File Format

In the following sections we will discuss more about these log file formats and processing these log formats separately in pig.

Common Log Format Files:

Common Log files format will be as shown in the below. This is also called as Apache’s Common Log Format as this format is originated from Apache Web Server logs.

Example Log line will be as shown below:

Sample Common Log file for testing —> common_access_log

As we already know about Load functions in Pig from previous post on Pig Load Functions, This section can be considered as the best example for Custom Load functions. Fortunately Piggybank, a repository of user-submitted UDF, contains a custom loader function CommonLogLoader to load Apache’s Common Log Format files into pig. This java class extends RegExLoader class which is custom UDF for Load function.

To know more about writing UDF’s for Custome Load function refer the post.

CommonLogLoader user below regular expression to parse the Common Log Format files:

Example Use case of CommonLogLoader:

Lets put the above Apache Common log file into HDFS location /in/ and Register piggybank jar file and define a temporary function for CommonLogLoader and use it to parse the Apache Common_access_log file successfully.

From Pig-0.13.0 release piggybank.jar file is included in pig lib directory only, so we do not need to register this jar file each time we login to pig grunt shell otherwise we need to register this jar file as shown below to use any UDFs from this piggybank.

REGISTER ‘/path_to_piggbank/piggybank.jar’;

Lets put the Pig Latin commands in a script file common_log_process_script.pig and process it.

We are trying to display the counts of addresses/host names from the log file:

Apache's Common Log procesing in pig

Below is the output of above pig script run:

Pig DUMP output

We can verify the count (270) of the above highlighted IP address ( in the input log file in local file system to confirm that log parsing is done correctly with the below command:

verify count

So, we can confirm that Apache’s Common log files are processed successfully in pig and these results can be stored into a hdfs file and can be feed to hive external table, So that we can enable this table available to any visualization tool like Hunk, Tableau to pull this data.

Combined Log Format Files:

Combined Log files format will be as shown in the below. It is having two extra fields referrer and User agent when compared to Common Log format.

Example Log line will be as shown below:

Sample Combined Log file for testing —> combined_access_log

Similar to CommonLogLoader, Piggybank, provides a custom loader function CombinedLogLoader to load Combined Log Format files into pig. This java class also extends RegExLoader class which is custom UDF for Load function.

CombinedLogLoader user below regular expression to parse the Combine Log Format files:

Example Use case of CommonLogLoader:

Similar to above CommonLogLoader Use case, Lets copy the above sample combined log format file into HDFS, /in/ location and lets execute below pig latin script file to get top 10 referrers and their counts in the access log file.

Push the sample combined log file into HDFS and execute the above pig script.

Verify Output:

Just for your information, the above combined log file provided is the access log file of this site for 17th Nov 2014. The below are the top 10 referrer pages in site on that day.

top 10 referrers in log file

Thus as shown in the above, we can use this log processing mechanism to get some real time statistics out of log files of any web server. In this example we figured out the top 10 referrers of site and similarly we can prepare many other statistics date wise no of user counts, date wise top 10 page counts, etc… And this data can be saved in a hive table with two columns (page: string, count: int) and connect this data to Tableau and provide visualization on top of this.

So this section gives an example for a real time project on processing log files and producing visualizations on the statistics prepared in hadoop. For Connectivity with Tableau on Hive data refer the post Tableau with Hive.

Custom Log Format Files:

Custom Log Format Files can contain log files in any of the format other than above two. Similar to the above regular expressions, we need to our custom regular expressions to parse these log files. One of the examples for these custom log files are hadoop logs generated by hadoop/yarn daemons on nodes.

Sample Custom Log file for testing —> hadooplogs

For processing custom log format files, Piggybank  provides MyRegExLoader class which extends RegExLoader class, It is similar to CommonLogLoader but we need to provide the Regular expression pattern to parse custom log files. This Regular expression should be passed as argument through pig latin.

Example Call to MyRegExLoader class:

This would parse lines like

into arrays like

Use Case: Parsing Hadoop Daemon Logs:

Lets upload the above provided hadooplogs file into HDFS directory /in/ location and list out only the WARN messages and ERROR messages to debug the error scenarios to improve the performance of hadoop nodes.

Save below pig latin script into hadoop_log_process_script.pig file.

hadoop log process in pig

Verify Output:

Lets verify the output in below two output files:

hadoop log process output

Thus we have successfully parsed hadoop logs in a single file and this can be used to process the entire cluster wide and bring the statistics on no of error messages and warning per day per node and based on that we can track the health status of a huge production hadoop cluster and can improve the cluster performance based on analysis on WARN messages and ERROR messages. This is also another real time use case for parsing logs and generating real time benefit out of this log processing.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

3 thoughts on “Processing Logs in Pig