Blogs

Overview Apache Flink supports data analysis over specific ranges in terms of windows. It supports two ways to create windows, time and count. Time window defines windows by specific time range. Count window defines windows by specifc number of envents. In addition, there are two windows time attributes. size: how long the window itsef last interval: how long between windows Whenever the window size = interval, this type of windows are called tumbling-window.

The Complete SQL Tuning

in article

September 2, 2019

The most practice comes for MySQL server, but it applies to other relational database as well. Aviod full table scan and try to create index on the columns used after where or order by. Aviod check null after where clause. You set set null as default value when creating tables. However, mostly we should use not null value or use special value, such as 0 or -1 for instead.

Big Data Stack Compare

in article

August 3, 2019

1. Batch Processing ETL + ELK ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes.

What Does Big Data Engineer Do?

in article

July 1, 2019

As the the big data has become more matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Big data engineer or data engineer becomes more and more important role in big data orgnization. This role is quite like the ETL developer role in the data warehouse or database developer role in database development. However, it more focus on the senario in Applied Big Data.

All About Big Data Interviews

in article

June 2, 2019

Quite often, we got chances to go for big data interviews or interview some candidates. Most of time, we could add some short questions in addition to the white board coding. Here, we collect a few aspects of areas we can focus during the interview or prepaing the coming interviews. Concept 1. What’s the reason to use Dequeue instead of Stack in Java. Dequeue has the ability to use streams convert to list with keeping LIFO concept applied while stack does not.

Use Redish Lock for SecKill

in digest

May 20, 2019

What is Seckill? When associated with online shopping, “seckill” refers to the quick sell out of newly-advertised goods. If you look at the transaction record, you will find that each of the transactions is made in seconds. It sounds inconceivable but is the naked truth. This is called “seckill”. A typical system for seckill has following features. * A large number of users will be shopping at the same time during the quick sell, and the web site traffic increses dramatically.

Apache Kafka Consumers

in digest

August 4, 2018

Kafka consumer is what we use quite often to read data from Kafka. Here, we use this article to explain some key concepts and topics regarding to consumer architecture in Kafka. Consumer Groups We can always group consumers into a consumer group by use case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-memory microservices while another consumer group is streaming those same records to Hadoop.

NoSQL Overview

in article

June 17, 2018

Overview NoSQL (NoSQL = Not Only SQL) means “not just SQL”. Modern computing systems generate a huge amount of data every day on the network. A large part of these data are handled by relational database management systems (RDBMSs). Its matured relational theory foundation makes data modeling and application programming easier. However, with the wave of informationization and the rise of the Internet, traditional RDBMSs have started to experience problems in some paticular domain.

Run Hive 1 and 2 Together

in article

May 30, 2018

Overview The latest HDP 2.6.x has both Hive version 1 and 2 installed together. However, it does not allow user to run hive version to command directly, but only use beeline. The lab_dev repository here provides an demo virtual box image to have both Hive version configured properly. Conf. Changes The trick thing to make both hive version working is do not add any setting in the .profile anymore. See below, I comments out all pervious hive settings.

HBase Shell Reference

in article

April 28, 2018

We use this place to collect commonly used HBase shell command for reference. HBase shell is an HBase extensible jruby-based (JIRB) shell to execute some commands(each command represents one functionality) in HBase. HBase shell commands are mainly categorized into 6 parts as follows. Will keep adding more examples here. 1. General Information status Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status 'simple' hbase> status 'summary' hbase> status 'detailed' version Output this HBase version.

Flink Windows Explained

The Complete SQL Tuning

Big Data Stack Compare

What Does Big Data Engineer Do?

All About Big Data Interviews

Use Redish Lock for SecKill

Apache Kafka Consumers

NoSQL Overview

Run Hive 1 and 2 Together

HBase Shell Reference

Search

Categories

Tags