This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API.
Table of Contents
Writing Sequence File Example:
As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and we will use append(key, value) method to insert each record into sequencefile.
In the below example program, we are reading contents from a text file (syslog) on local file system and writing it to sequence file on hadoop. Here, we are using integer counter as key and each line from input file as value in sequence file format’s <key, value>.
For verification of (key, value) pairs in sequence file, we are printing first 50 records onto console. Copy below code snippet into SequenceFileWrite.java program file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
import java.io.File; import java.io.IOException; import org.apache.commons.io.FileUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Metadata; import org.apache.hadoop.io.SequenceFile.Writer; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.compress.DefaultCodec; public class SequenceFileWrite { public static void main(String[] args) throws IOException { String uri = args[1]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); File infile = new File(args[0]); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(conf, Writer.file(path), Writer.keyClass(key.getClass()), Writer.valueClass(value.getClass()), Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)), Writer.replication(fs.getDefaultReplication()), Writer.blockSize(1073741824), Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()), Writer.progressable(null), Writer.metadata(new Metadata())); int ctr = 100; for (String line : FileUtils.readLines(infile)) { key.set(ctr++); value.set(line); if (ctr < 150) System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } } |
Compile this program and build jar file (Say Seq.jar) and we will use this jar file to run SequenceFileWrite program on hadoop.
Run it with below command.
1 2 |
$ hadoop jar Seq.jar SequenceFileWrite syslog /out/syslog.seq |
Verify Output:
Verify the output sequence file /out/syslog.seq file with hadoop fs -cat command. With this command we can see whether it is sequence file or not with first three bytes (SEQ) and we can know the writable classes of key and value and compression type and codec classes used in this sequence file.
From the below screen shot, we can understand that /out/syslog.seq is a sequence file as it has first three bytes as SEQ and
Key class is – org.apache.hadoop.io.IntWritable
Value class is – org.apache.hadoop.io.Text
Compression Codec – org.apache.hadoop.io.compress.DefaultCodec
Reading SequenceFile Example:
Now we will see how to read the above created sequence file through hadoop 2 API. We will create SequenceFile.Reader instance and use next(key, value) method to iterate over each record in the sequence file.
In the below program note that, we didn’t mention compression type or codec to the sequence file, that we used while creating it. By default reader instance will get these details from the file format itself and decompresses the file according to the codec found in the file format. Also note that, we have used getKeyClass() and getValueClass() methods on reader instance to retrieve the class names of (key,value) pairs in sequence file.
In the below program we are reading the contents of sequence file and printing them on console. Copy below code snippet into SequenceFileRead.java program file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Reader; import org.apache.hadoop.io.Writable; import org.apache.hadoop.util.ReflectionUtils; public class SequenceFileRead { public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); Path path = new Path(uri); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(conf, Reader.file(path), Reader.bufferSize(4096), Reader.start(0)); Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf); Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf); //long position = reader.getPosition(); //reader.seek(position); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : ""; System.out.printf("[%s]\t%s\t%s\n", syncSeen, key, value); } } finally { IOUtils.closeStream(reader); } } } |
Compile this program and build jar file (Say Seq.jar) and we will use this jar file to run SequenceFileRead program on hadoop.
Run it with below command.
1 2 |
$ hadoop jar Seq.jar SequenceFileRead /out/syslog.seq |
Below is the screenshot of first 10 lines of output from above command run.
Reading SequenceFile with Command-line Interface:
There is an alternative way for viewing the contents of sequence file from command line interface. Hadoop provides command hadoop fs -text to display the contents of sequence file in text format.
This command looks at a file’s magic number to detect the type of the file and
appropriately convert it to text. It can recognize gzipped files and sequence files, otherwise, it assumes the input is plain text.