Blogs

Big Data Books Reviews

in review

January 10, 2018

Learning Spark SQL Level Ent. Level Mid. Level Adv. Published in Sep. 2017. Start reading it. Learning Apache Flink Level Ent. Level Mid. There are very few books about Apache Flink. Besides offical document, this is a good one for people who wants to know Flink quicker. This book, published in the earlier of 2017, covers most of core topics for Flink with examples.

2017 Winter Release

in release

December 22, 2017

Summary Before christmas, DataFibers has completed the winter release, which has more than 40+ changes requests applied. In this release, DataFibers is featured with first demo combined both data landing and transforming in real time with new web interface. In addition, the preview version of batch processing (by spark) is ready. Details Below is the list of key changes in this release. New Web admin UI released using ReactJs based AOR.

Hive RowID Generation

in article

November 2, 2017

Introduction It is quite often that we need a unique identifier for each single rows in the Apache Hive tables. This is quite useful when you need such columns as surrogate keys in data warehouse, as the primary key for data or use as system nature keys. There are following ways of doing that in Hive. ROW_NUMBER() Hive have a couple of internal functions to achieve this. ROW_NUMBER function, which can generate row number for each partition of data.

GIT Tips

in article

November 1, 2017

1. Git Cheat Sheet 2. Check in Git Modified But Untracked Content Recently, I migrate this site to Hexo. I download the theme from github to the Hexo project folder. I also keep the source code in the github in case I lost the source code. However, when I run the git add . and git status. It shows error messages saying the theme folder is not tracked content. Most time, I did not check the git status - bad habit.

Apache Kafka Overview

in article

October 5, 2017

The big data processing started by focusing on the batch processing. Distributed data storage and querying tools like MapReduce, Hive, and Pig were all designed to process data in batches rather than continuously. Recently enterprises have discovered the power of analyzing and processing data and events as they happen instead of batches. Most traditional messaging systems, such as RabbitMq, neither scale up to handle big data in realtime nor use friendly with big data ecosystem.

Scala Apply Method

in article

September 15, 2017

The apply methods in scala has a nice syntactic sugar. It allows us to define semantics like java array access for an arbitrary class. For example, we create a class of RiceCooker and its method cook to cook rice. Whenever we need to cook rice, we could call this method. class RiceCooker { def cook(cup_of_rice: Rice) = { cup_of_rice.isDone = true cup_of_rice } } val my_rice_cooker: RiceCooker = new RiceCooker() my_rice_cooker.

2017 Summer Release

in release

August 31, 2017

Summary A little bit data, but DataFibers has completed the summer release of 2017 about right time. In this release, we have applied 30+ changes requests. In this release, DataFibers is featured with a preview of new web interface. In addition, couple of connectors are added/updated to preparing the demo later. Details Below is the list of key changes in this release. Support Flink Table API and SQL API support Flink upgrade to v1.

Simplify Big Data Streaming

in training

July 20, 2017

Here is our free training offered during 2017 summer meetup in Toronto, Canada.

Spark Word Count Tutorial

in tutorial

July 1, 2017

It is quite often to setup Apache Spark development environment through IDE. Since I do not cover much setup IDE details in my Spark course, I am here to give detail steps for developing the well known Spark word count example using scala API in Eclipse. Environment Apache Spark v1.6 Scala 2.10.4 Eclipse Scala IDE Download Software Needed Download the proper scala version and install it Download the Eclipse scala IDE from above link Create A Scala Project Open Scala Eclipse IDE.

Hive Get the Max/Min

Big Data Books Reviews

2017 Winter Release

Hive RowID Generation

GIT Tips

Apache Kafka Overview

Scala Apply Method

2017 Summer Release

Simplify Big Data Streaming

Spark Word Count Tutorial

Search

Categories

Tags