Spark

Spark SQL in Depth

In this article, we’ll look at how Spark SQL working on data queries in depth. Checking Execution Plan Data Preparing create database if not exists test; create table if not exists test.t_name (name string); insert into test.t_name values ('test1'),('test2'),('test3'); Test Code Preparing Below Scala code is used with testing with blocking at the standard input at the end. In this case, we can see more details from Spark WebUI.

Continue reading

Apache Spark 3.1.1 Released :)

Apache Spark 3.1.1 is released on March 2, 2021. It is milestone release for Spark in 2021. This version of spark keeps making it more efficient and stable. Below are highlighted new features and changes. Python usability ANSI SQL compliance Query optimization enhancements Shuffle hash join improvements History Server support of structured streaming Project Zen Project Zen was initiated in this release to improve PySpark’s usability in these three ways:

Continue reading

Spark SQL Read/Write HBase

Apache Spark and Apache HBase are very commonly used big data frameworks. In many senarios, we need to use Spark to query and analyze the big volumn of data in HBase. Spark has wider support to read data as dataset from many kinds of data source. To read from HBase, Spark provides TableInputFormat, which as following disadvantages. There is only on scan triggerred in each task to read from HBase TableInputFormat does not support BulkGet Cannot leverage the optimization from Spark SQL catalyst Considering the above points above, there is another choice by using Hortonworks/Cloudera Apache Spark—Apache HBase Connector short for (SHC).

Continue reading

Naive Bayes Algorithm

Background It would be difficult and practically impossible to classify a web page, a document, an email or any other lengthy text notes manually. This is where Naïve Bayes Classifier machine learning algorithm comes to the rescue. A classifier is a function that allocates a population’s element value from one of the available categories. For instance, Spam Filtering is a popular application of Naïve Bayes algorithm. Spam filter here, is a classifier that assigns a label Spam or Not Spam to all the emails.

Continue reading

Hive Get the Max/Min

Most of time, we need to find the max or min value of particular columns as well as other columns. For example, we have following employee table. DESC employee; +---------------+------------------------------+----------+--+ | col_name | data_type | comment | +---------------+------------------------------+----------+--+ | name | string | | | work_place | array<string> | | | gender_age | struct<gender:string,age:int>| | | skills_score | map<string,int> | | | depart_title | map<string,array<string>> | | +---------------+------------------------------+----------+--+ 5 rows selected (0.

Continue reading

Big Data Books Reviews

Learning Spark SQL Level Ent. Level Mid. Level Adv. Published in Sep. 2017. Start reading it. Learning Apache Flink Level Ent. Level Mid. There are very few books about Apache Flink. Besides offical document, this is a good one for people who wants to know Flink quicker. This book, published in the earlier of 2017, covers most of core topics for Flink with examples.

Continue reading

Simplify Big Data Streaming

Here is our free training offered during 2017 summer meetup in Toronto, Canada.

Continue reading

Spark Word Count Tutorial

It is quite often to setup Apache Spark development environment through IDE. Since I do not cover much setup IDE details in my Spark course, I am here to give detail steps for developing the well known Spark word count example using scala API in Eclipse. Environment Apache Spark v1.6 Scala 2.10.4 Eclipse Scala IDE Download Software Needed Download the proper scala version and install it Download the Eclipse scala IDE from above link Create A Scala Project Open Scala Eclipse IDE.

Continue reading

One Platform Initatives for Spark

In the early of this September, the Chief Strategy Offer of Cloudera Mike Olson has announced that the next important initiatives for Couldera - One Platform to advance their investment on Apache Spark. The Spark is originally invented by few guys who started up the Databrick. Later, Spark catches most attention from big data communities and companies by its high-performance in-memory computing framework, which can run on top of Hadoop Yarn.

Continue reading