In this post, we will discuss about basic introduction about Avro serialization.
Table of Contents
What is Avro Serialization? :
Avro is a one of the famous data serialization and deserialization frameworks that greatly integrates with almost all hadoop platforms. Avro framework is created by Doug Cutting, the creator of Hadoop and now it is full fledged project under Apache Software foundation.
Need for Avro Serialization:
Hadoop‘s native library provides Writables for data serialization (converting object data into byte stream) and deserialization (converting byte stream data to object data) and also it provides support for Sequence Files which will store the data in binary format. These are the only two mechanisms provided by hadoop for data serialization.
The main draw back of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.
So any of the files created in hadoop with above two mechanisms can’t be read by any other third language which makes hadoop as a limited box. To address this drawback, Doug cutting created Avro, which is a language independent data structure.
Avro Serialization Features:
- Avro is a language neutral data serialization system and it can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
- Avro creates binary structured format that is both compressible and splittable, So, it can be efficiently used as the input to hadoop MapReduce jobs.
- Avro provides rich data structures, for example, we can create a record that contains an array, an enumerated type, and a sub record. These can be created in any language and can be processed in hadoop and the results can be fed to a third language.
- Avro schemas are defined in JSON. This facilitates implementation in languages that already have JSON libraries.
- In an Avro data file along with avro data, even schema is stored in a metadata section, and it makes the file self-describing.
- Avro is also used in RPC (Remote Procedure Call) and in this, the client and server exchange schemas in the connection handshake.
- Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation.
Comparison with Other Cross-Language Serialization Frameworks:
There are other serialization frameworks which provide language independent serialization mechanism. They are Protocol buffers (by google) and Thrift (by Apache).
- These languages require code to be generated (for schema) to read or write the data files. However this is optional in Avro.
- Schema is not stored with data in Thrift or Protocol Buffers but in Avro, Since the schema is present when data is read, considerably less type information is need to be encoded with data.
- Avro has rich schema resolution capabilities. The schema used to read data need not be identical to the schema that was used to write the data. For example, a new, optional field may be added to a record by declaring it in the schema used to read the old data. New and old clients alike will be able to read the old data, while new clients can write new data that uses the new field. Conversely, if an old client sees newly encoded data, it will gracefully ignore the new field and carry on processing as it would have done with old data.
Avro Installation is a very simple and straight process. All that we need is downloading the required binary jar files into our cluster and and adding them to classpath. In this we will show the installation of Avro for java environment.
Avro mainly requires below jar files to be present in the classpath. These jar files contain all the classes for compiler, hadoop, mapred, mapreduce packages.
In this section, we will below steps to install Avro on Ubuntu machine.
- Download the latest stable versions of the above jar file from apache download mirrors. At the time writing this post, avro-1.7.7 was the latest stable version but the hadoop-2.3.0 had avro-1.7.4.jar version in its $HADOOP_HOME/share/hadoop/common/lib directory and the same 1.7.4 version is used to describe the installation process in this post.
- Copy the thisavro-mapred-x.y.z-hadoop2.jar into hadoop distribution folder, usually into $HADOOP_HOME/share/hadoop/tools/lib & $HADOOP_HOME/share/hadoop/common/lib folders which contains jar files for many other tools.
- Add the above folder to classpath in .bashrc file.
$ cp avro-mapred-1.7.4-hadoop2.jar avro-tools-1.7.4.jar $HADOOP_HOME/share/hadoop/tools/lib/
$ cp avro-mapred-1.7.4-hadoop2.jar $HADOOP_HOME/share/hadoop/common/lib/
$ gedit ~/.bashrc
# AVRO Setup in .bashrc
Note: Avro has dependencies with Paranamer & Jackson JSON but the jar files required for these (paranamer-*.jar, jackson-core-asl-*.jar & jackson-mapper-asl-*.jar) will already be included in $HADOOP_HOME/share/hadoop/tools/lib folder in hadoop 2 distribution. If not, we need to download them and need to place in this folder.
Available tools in Avro:
cat extracts samples from files
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
idl2schemata Extract JSON schemata of the types from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
random Creates a file with randomly generated instances of a schema.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, record per line or pretty.
totext Converts an Avro data file to a text file.
totrevni Converts an Avro data file to a Trevni file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.