Big Data

Apache Airflow Overview

What is Airflow? Airflow is a platform to programmaticaly author, schedule and monitor workflows or data pipelines. It composes Directed Acyclic Graph (DAG) with multiple tasks which can be executed independently. The Airflow scheduler executes the tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Continue reading

The Complete SQL Tuning

The most practice comes for MySQL server, but it applies to other relational database as well. Aviod full table scan and try to create index on the columns used after where or order by. Aviod check null after where clause. You set set null as default value when creating tables. However, mostly we should use not null value or use special value, such as 0 or -1 for instead.

Continue reading

Big Data Stack Compare

1. Batch Processing ETL + ELK ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes.

Continue reading

What Does Big Data Engineer Do?

As the the big data has become more matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Big data engineer or data engineer becomes more and more important role in big data orgnization. This role is quite like the ETL developer role in the data warehouse or database developer role in database development. However, it more focus on the senario in Applied Big Data.

Continue reading

All About Big Data Interviews

Quite often, we got chances to go for big data interviews or interview some candidates. Most of time, we could add some short questions in addition to the white board coding. Here, we collect a few aspects of areas we can focus during the interview or prepaing the coming interviews. Concept 1. What’s the reason to use Dequeue instead of Stack in Java. Dequeue has the ability to use streams convert to list with keeping LIFO concept applied while stack does not.

Continue reading