Article

Apache Airflow Overview

What is Airflow? Airflow is a platform to programmaticaly author, schedule and monitor workflows or data pipelines. It composes Directed Acyclic Graph (DAG) with multiple tasks which can be executed independently. The Airflow scheduler executes the tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Continue reading

Flink Windows Explained

Overview Apache Flink supports data analysis over specific ranges in terms of windows. It supports two ways to create windows, time and count. Time window defines windows by specific time range. Count window defines windows by specifc number of envents. In addition, there are two windows time attributes. size: how long the window itsef last interval: how long between windows Whenever the window size = interval, this type of windows are called tumbling-window.

Continue reading

The Complete SQL Tuning

The most practice comes for MySQL server, but it applies to other relational database as well. Aviod full table scan and try to create index on the columns used after where or order by. Aviod check null after where clause. You set set null as default value when creating tables. However, mostly we should use not null value or use special value, such as 0 or -1 for instead.

Continue reading

Big Data Stack Compare

1. Batch Processing ETL + ELK ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes.

Continue reading

What Does Big Data Engineer Do?

As the the big data has become more matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Big data engineer or data engineer becomes more and more important role in big data orgnization. This role is quite like the ETL developer role in the data warehouse or database developer role in database development. However, it more focus on the senario in Applied Big Data.

Continue reading

All About Big Data Interviews

Quite often, we got chances to go for big data interviews or interview some candidates. Most of time, we could add some short questions in addition to the white board coding. Here, we collect a few aspects of areas we can focus during the interview or prepaing the coming interviews. Concept 1. What’s the reason to use Dequeue instead of Stack in Java. Dequeue has the ability to use streams convert to list with keeping LIFO concept applied while stack does not.

Continue reading

NoSQL Overview

Overview NoSQL (NoSQL = Not Only SQL) means “not just SQL”. Modern computing systems generate a huge amount of data every day on the network. A large part of these data are handled by relational database management systems (RDBMSs). Its matured relational theory foundation makes data modeling and application programming easier. However, with the wave of informationization and the rise of the Internet, traditional RDBMSs have started to experience problems in some paticular domain.

Continue reading

Run Hive 1 and 2 Together

Overview The latest HDP 2.6.x has both Hive version 1 and 2 installed together. However, it does not allow user to run hive version to command directly, but only use beeline. The lab_dev repository here provides an demo virtual box image to have both Hive version configured properly. Conf. Changes The trick thing to make both hive version working is do not add any setting in the .profile anymore. See below, I comments out all pervious hive settings.

Continue reading

HBase Shell Reference

We use this place to collect commonly used HBase shell command for reference. HBase shell is an HBase extensible jruby-based (JIRB) shell to execute some commands(each command represents one functionality) in HBase. HBase shell commands are mainly categorized into 6 parts as follows. Will keep adding more examples here. 1. General Information status Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status 'simple' hbase> status 'summary' hbase> status 'detailed' version Output this HBase version.

Continue reading

ML Overview

Background Machine learning is a field of computer science that gives computer systems the ability to “learn” with data, without being explicitly programmed. Machine learning can be broken down into three broad categories: Recommender, Classification, Clustering. Recommender—Recommender systems suggest items based on past behavior or interest. These items can be other users in a social network, or products and services in retail websites. There are some algorithm like Pearson correlation and euclidean distance.

Continue reading