Spark

In this article, we’ll look at how Spark SQL working on data queries in depth. Checking Execution Plan Data Preparing create database if not exists test; create table if not exists test.t_name (name string); insert into test.t_name values ('test1'),('test2'),('test3'); Test Code Preparing Below Scala code is used with testing with blocking at the standard input at the end. In this case, we can see more details from Spark WebUI.

Apache Spark 3.1.1 Released :)

in digest

March 10, 2021

Apache Spark 3.1.1 is released on March 2, 2021. It is milestone release for Spark in 2021. This version of spark keeps making it more efficient and stable. Below are highlighted new features and changes. Python usability ANSI SQL compliance Query optimization enhancements Shuffle hash join improvements History Server support of structured streaming Project Zen Project Zen was initiated in this release to improve PySpark’s usability in these three ways:

Spark SQL Read/Write HBase

in digest

January 1, 2020

Apache Spark and Apache HBase are very commonly used big data frameworks. In many senarios, we need to use Spark to query and analyze the big volumn of data in HBase. Spark has wider support to read data as dataset from many kinds of data source. To read from HBase, Spark provides TableInputFormat, which as following disadvantages. There is only on scan triggerred in each task to read from HBase TableInputFormat does not support BulkGet Cannot leverage the optimization from Spark SQL catalyst Considering the above points above, there is another choice by using Hortonworks/Cloudera Apache Spark—Apache HBase Connector short for (SHC).

Naive Bayes Algorithm

in digest

March 10, 2018

Background It would be difficult and practically impossible to classify a web page, a document, an email or any other lengthy text notes manually. This is where Naïve Bayes Classifier machine learning algorithm comes to the rescue. A classifier is a function that allocates a population’s element value from one of the available categories. For instance, Spam Filtering is a popular application of Naïve Bayes algorithm. Spam filter here, is a classifier that assigns a label Spam or Not Spam to all the emails.

Hive Get the Max/Min

in article

February 2, 2018

Big Data Books Reviews

in review

January 10, 2018

Learning Spark SQL Level Ent. Level Mid. Level Adv. Published in Sep. 2017. Start reading it. Learning Apache Flink Level Ent. Level Mid. There are very few books about Apache Flink. Besides offical document, this is a good one for people who wants to know Flink quicker. This book, published in the earlier of 2017, covers most of core topics for Flink with examples.

Simplify Big Data Streaming

in training

July 20, 2017

Here is our free training offered during 2017 summer meetup in Toronto, Canada.

Spark Word Count Tutorial

in tutorial

July 1, 2017

It is quite often to setup Apache Spark development environment through IDE. Since I do not cover much setup IDE details in my Spark course, I am here to give detail steps for developing the well known Spark word count example using scala API in Eclipse. Environment Apache Spark v1.6 Scala 2.10.4 Eclipse Scala IDE Download Software Needed Download the proper scala version and install it Download the Eclipse scala IDE from above link Create A Scala Project Open Scala Eclipse IDE.

One Platform Initatives for Spark

in article

June 24, 2017

In the early of this September, the Chief Strategy Offer of Cloudera Mike Olson has announced that the next important initiatives for Couldera - One Platform to advance their investment on Apache Spark. The Spark is originally invented by few guys who started up the Databrick. Later, Spark catches most attention from big data communities and companies by its high-performance in-memory computing framework, which can run on top of Hadoop Yarn.

Spark SQL in Depth

Apache Spark 3.1.1 Released :)

Spark SQL Read/Write HBase

Naive Bayes Algorithm

Hive Get the Max/Min

Big Data Books Reviews

Simplify Big Data Streaming

Spark Word Count Tutorial

One Platform Initatives for Spark

Search

Categories

Tags