Blogs

HBase Shell Reference

We use this place to collect commonly used HBase shell command for reference. HBase shell is an HBase extensible jruby-based (JIRB) shell to execute some commands(each command represents one functionality) in HBase. HBase shell commands are mainly categorized into 6 parts as follows. Will keep adding more examples here. 1. General Information status Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status 'simple' hbase> status 'summary' hbase> status 'detailed' version Output this HBase version.

Continue reading

Naive Bayes Algorithm

Background It would be difficult and practically impossible to classify a web page, a document, an email or any other lengthy text notes manually. This is where Naïve Bayes Classifier machine learning algorithm comes to the rescue. A classifier is a function that allocates a population’s element value from one of the available categories. For instance, Spam Filtering is a popular application of Naïve Bayes algorithm. Spam filter here, is a classifier that assigns a label Spam or Not Spam to all the emails.

Continue reading

ML Overview

Background Machine learning is a field of computer science that gives computer systems the ability to “learn” with data, without being explicitly programmed. Machine learning can be broken down into three broad categories: Recommender, Classification, Clustering. Recommender—Recommender systems suggest items based on past behavior or interest. These items can be other users in a social network, or products and services in retail websites. There are some algorithm like Pearson correlation and euclidean distance.

Continue reading

Hive Get the Max/Min

Most of time, we need to find the max or min value of particular columns as well as other columns. For example, we have following employee table. DESC employee; +---------------+------------------------------+----------+--+ | col_name | data_type | comment | +---------------+------------------------------+----------+--+ | name | string | | | work_place | array<string> | | | gender_age | struct<gender:string,age:int>| | | skills_score | map<string,int> | | | depart_title | map<string,array<string>> | | +---------------+------------------------------+----------+--+ 5 rows selected (0.

Continue reading

Big Data Books Reviews

Learning Spark SQL Level Ent. Level Mid. Level Adv. Published in Sep. 2017. Start reading it. Learning Apache Flink Level Ent. Level Mid. There are very few books about Apache Flink. Besides offical document, this is a good one for people who wants to know Flink quicker. This book, published in the earlier of 2017, covers most of core topics for Flink with examples.

Continue reading

2017 Winter Release

Summary Before christmas, DataFibers has completed the winter release, which has more than 40+ changes requests applied. In this release, DataFibers is featured with first demo combined both data landing and transforming in real time with new web interface. In addition, the preview version of batch processing (by spark) is ready. Details Below is the list of key changes in this release. New Web admin UI released using ReactJs based AOR.

Continue reading

Hive RowID Generation

Introduction It is quite often that we need a unique identifier for each single rows in the Apache Hive tables. This is quite useful when you need such columns as surrogate keys in data warehouse, as the primary key for data or use as system nature keys. There are following ways of doing that in Hive. ROW_NUMBER() Hive have a couple of internal functions to achieve this. ROW_NUMBER function, which can generate row number for each partition of data.

Continue reading

GIT Tips

1. Git Cheat Sheet 2. Check in Git Modified But Untracked Content Recently, I migrate this site to Hexo. I download the theme from github to the Hexo project folder. I also keep the source code in the github in case I lost the source code. However, when I run the git add . and git status. It shows error messages saying the theme folder is not tracked content. Most time, I did not check the git status - bad habit.

Continue reading

Apache Kafka Overview

The big data processing started by focusing on the batch processing. Distributed data storage and querying tools like MapReduce, Hive, and Pig were all designed to process data in batches rather than continuously. Recently enterprises have discovered the power of analyzing and processing data and events as they happen instead of batches. Most traditional messaging systems, such as RabbitMq, neither scale up to handle big data in realtime nor use friendly with big data ecosystem.

Continue reading

Scala Apply Method

The apply methods in scala has a nice syntactic sugar. It allows us to define semantics like java array access for an arbitrary class. For example, we create a class of RiceCooker and its method cook to cook rice. Whenever we need to cook rice, we could call this method. class RiceCooker { def cook(cup_of_rice: Rice) = { cup_of_rice.isDone = true cup_of_rice } } val my_rice_cooker: RiceCooker = new RiceCooker() my_rice_cooker.

Continue reading