Siva


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Impala Conditions with Example Impala supports the following conditional functions for testing equality, comparison operators, and nullity: ‘Case’ Example: 1)  If else select case when 20 > 10 then 20 else 15 end; Output:  20 2) If else if select case when 9 > 10 then 20 when 1 > […]

Impala Miscellaneous Functions


PMD5
PMD (Programming Mistake Detector) What is PMD? PMD aka Programming Mistake Detector is Java Source Code Analyzer. It is used to clean erroneous code in our java projects based on predefined set of rules. PMD supports the ability to write custom rules. Issues reported by PMD may not be true […]

PMD (Programming Mistake Detector)


­HBase & Solr – Near Real time indexing and search Requirement: A. HBase Table B. Solr collection on HDFS C. Lily HBase Indexer. D. Morphline Configuration file Once Solr server ready then we are ready to configure our collection (in solr cloud); which will be link to HBase table. Add […]

HBase & Solr Search Integration



Spark RDD
What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, […]

Resilient Distributed Dataset


impala-explain
Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST […]

Impala Best Practices


kafka+storm1
Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals [crayon-5a31e45a24b97198189566/] b. Download and installation commands for JZMQ:  [crayon-5a31e45a24b9f395918503/]   2. Download latest storm from http://storm.apache.org/downloads.html  [crayon-5a31e45a24ba2858324996/] Second start Storm Cluster by starting master […]

Apache Storm Integration With Apache Kafka



kafka_arch
While developing Kafka, the main focus was to provide the following:   An API for producers and consumers to support custom implementation   Low overheads for network and storage with message persistence on disk   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log […]

Kafka Design



Production issue: when we are trying to write a select query with 8 lacks ids “in condition “. then we got faced below issue,    To solve the above exception, we used distributed calls in Java client as shown below, [crayon-5a31e45a25a6c447048369/] Few Production configurations in cassandra RetryPolicy Three scenarios you can control […]

Cassandra production scenarios/issues



dml_caching-reads_12
Storage engine Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can produce stall in read performance and other problems. Cassandra never re-writes or re-reads existing data, […]

Cassandra write and read process


arc_vnodes_compare
Cassandra is designed in such a way that, there will not be any single point of failure. There is no master- slave architecture in cassandra. cassandra addresses the problem of SPOF by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. In cassandra all […]

Cassandra Architecture




Oo
OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed […]

Oozie Notes


This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node […]

Zookeeper Commands