Blogs

Apache Kafka Producers

Kafka producers send records to topics. The records are sometimes referred to as messages. The producer picks which partition to send a record to per topic. The producer can send records round-robin. The producer could implement priority systems based on sending records to certain partitions based on the priority of the record. Generally speaking, producers send records to a partition based on the record’s key. The default partitioner for Java uses a hash of the record’s key to choose the partition or uses a round-robin strategy if the record has no key.

Continue reading

Flink Windows Explained

Overview Apache Flink supports data analysis over specific ranges in terms of windows. It supports two ways to create windows, time and count. Time window defines windows by specific time range. Count window defines windows by specifc number of envents. In addition, there are two windows time attributes. size: how long the window itsef last interval: how long between windows Whenever the window size = interval, this type of windows are called tumbling-window.

Continue reading

The Complete SQL Tuning

The most practice comes for MySQL server, but it applies to other relational database as well. Aviod full table scan and try to create index on the columns used after where or order by. Aviod check null after where clause. You set set null as default value when creating tables. However, mostly we should use not null value or use special value, such as 0 or -1 for instead.

Continue reading

Big Data Stack Compare

1. Batch Processing ETL + ELK ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes.

Continue reading

What Does Big Data Engineer Do?

As the the big data has become more matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Big data engineer or data engineer becomes more and more important role in big data orgnization. This role is quite like the ETL developer role in the data warehouse or database developer role in database development. However, it more focus on the senario in Applied Big Data.

Continue reading

All About Big Data Interviews

Quite often, we got chances to go for big data interviews or interview some candidates. Most of time, we could add some short questions in addition to the white board coding. Here, we collect a few aspects of areas we can focus during the interview or prepaing the coming interviews. Concept 1. What’s the reason to use Dequeue instead of Stack in Java. Dequeue has the ability to use streams convert to list with keeping LIFO concept applied while stack does not.

Continue reading

Use Redish Lock for SecKill

What is Seckill? When associated with online shopping, “seckill” refers to the quick sell out of newly-advertised goods. If you look at the transaction record, you will find that each of the transactions is made in seconds. It sounds inconceivable but is the naked truth. This is called “seckill”. A typical system for seckill has following features. * A large number of users will be shopping at the same time during the quick sell, and the web site traffic increses dramatically.

Continue reading

Apache Kafka Consumers

Kafka consumer is what we use quite often to read data from Kafka. Here, we use this article to explain some key concepts and topics regarding to consumer architecture in Kafka. Consumer Groups We can always group consumers into a consumer group by use case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-memory microservices while another consumer group is streaming those same records to Hadoop.

Continue reading

NoSQL Overview

Overview NoSQL (NoSQL = Not Only SQL) means “not just SQL”. Modern computing systems generate a huge amount of data every day on the network. A large part of these data are handled by relational database management systems (RDBMSs). Its matured relational theory foundation makes data modeling and application programming easier. However, with the wave of informationization and the rise of the Internet, traditional RDBMSs have started to experience problems in some paticular domain.

Continue reading

Run Hive 1 and 2 Together

Overview The latest HDP 2.6.x has both Hive version 1 and 2 installed together. However, it does not allow user to run hive version to command directly, but only use beeline. The lab_dev repository here provides an demo virtual box image to have both Hive version configured properly. Conf. Changes The trick thing to make both hive version working is do not add any setting in the .profile anymore. See below, I comments out all pervious hive settings.

Continue reading