Article

Hive Get the Max/Min

Most of time, we need to find the max or min value of particular columns as well as other columns. For example, we have following employee table. DESC employee; +---------------+------------------------------+----------+--+ | col_name | data_type | comment | +---------------+------------------------------+----------+--+ | name | string | | | work_place | array<string> | | | gender_age | struct<gender:string,age:int>| | | skills_score | map<string,int> | | | depart_title | map<string,array<string>> | | +---------------+------------------------------+----------+--+ 5 rows selected (0.

Continue reading

Hive RowID Generation

Introduction It is quite often that we need a unique identifier for each single rows in the Apache Hive tables. This is quite useful when you need such columns as surrogate keys in data warehouse, as the primary key for data or use as system nature keys. There are following ways of doing that in Hive. ROW_NUMBER() Hive have a couple of internal functions to achieve this. ROW_NUMBER function, which can generate row number for each partition of data.

Continue reading

GIT Tips

1. Git Cheat Sheet 2. Check in Git Modified But Untracked Content Recently, I migrate this site to Hexo. I download the theme from github to the Hexo project folder. I also keep the source code in the github in case I lost the source code. However, when I run the git add . and git status. It shows error messages saying the theme folder is not tracked content. Most time, I did not check the git status - bad habit.

Continue reading

Apache Kafka Overview

The big data processing started by focusing on the batch processing. Distributed data storage and querying tools like MapReduce, Hive, and Pig were all designed to process data in batches rather than continuously. Recently enterprises have discovered the power of analyzing and processing data and events as they happen instead of batches. Most traditional messaging systems, such as RabbitMq, neither scale up to handle big data in realtime nor use friendly with big data ecosystem.

Continue reading

Scala Apply Method

The apply methods in scala has a nice syntactic sugar. It allows us to define semantics like java array access for an arbitrary class. For example, we create a class of RiceCooker and its method cook to cook rice. Whenever we need to cook rice, we could call this method. class RiceCooker { def cook(cup_of_rice: Rice) = { cup_of_rice.isDone = true cup_of_rice } } val my_rice_cooker: RiceCooker = new RiceCooker() my_rice_cooker.

Continue reading

One Platform Initatives for Spark

In the early of this September, the Chief Strategy Offer of Cloudera Mike Olson has announced that the next important initiatives for Couldera - One Platform to advance their investment on Apache Spark. The Spark is originally invented by few guys who started up the Databrick. Later, Spark catches most attention from big data communities and companies by its high-performance in-memory computing framework, which can run on top of Hadoop Yarn.

Continue reading

Constructor - Scala vs. Java

1. Constructor With Parameters Java Code public class Foo() { public Bar bar; public Foo(Bar bar) { this.bar = bar; } } Scala Code class Foo(val bar:Bar) 2. Constructor With Private Attribute Java Code public class Foo() { private final Bar bar; public Foo(Bar bar) { this.bar = bar; } } Scala Code class Foo(private val bar:Bar) 3. Call Super Constructor Java Code public class Foo() extends SuperFoo { public Foo(Bar bar) { super(bar); } } Scala Code

Continue reading

When to Disable Speculative Execution

Backgrounds This is the link from WikiMedia about what’s Speculative Execution. In Hadoop, the following parameters string are for this settings. And, they are true by default. mapred.map.tasks.speculative.execution mapred.reduce.tasks.speculative.execution When to Disable Most time, it helps. However, I am here to collect some scenario when we do not need it. Of course, when ever your cluster really in shortage of resource or for the purpose of experiment, we can disable them by setting them to false since “SE” really a big resource consumer It is generally advisable to turn off ”SE” for mapred jobs that use HBase as a source.

Continue reading