Monthly Archives: February 2016

Spark RDD
What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, […]

Resilient Distributed Dataset

Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST […]

Impala Best Practices

Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals [crayon-5d07fede6bfd7694007146/] b. Download and installation commands for JZMQ:  [crayon-5d07fede6bfe5875533969/]   2. Download latest storm from  [crayon-5d07fede6bfea721990294/] Second start Storm Cluster by starting master […]

Apache Storm Integration With Apache Kafka

While developing Kafka, the main focus was to provide the following:   An API for producers and consumers to support custom implementation   Low overheads for network and storage with message persistence on disk   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log […]

Kafka Design