Blogs

Big Data Stack Compare

1. Batch Processing ETL + ELK ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes.

Continue reading

What Does Big Data Engineer Do?

As the the big data has become more matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Big data engineer or data engineer becomes more and more important role in big data orgnization. This role is quite like the ETL developer role in the data warehouse or database developer role in database development. However, it more focus on the senario in Applied Big Data.

Continue reading

All About Big Data Interviews

Quite often, we got chances to go for big data interviews or interview some candidates. Most of time, we could add some short questions in addition to the white board coding. Here, we collect a few aspects of areas we can focus during the interview or prepaing the coming interviews. Concept 1. What’s the reason to use Dequeue instead of Stack in Java. Dequeue has the ability to use streams convert to list with keeping LIFO concept applied while stack does not.

Continue reading

Use Redish Lock for SecKill

What is Seckill? When associated with online shopping, “seckill” refers to the quick sell out of newly-advertised goods. If you look at the transaction record, you will find that each of the transactions is made in seconds. It sounds inconceivable but is the naked truth. This is called “seckill”. A typical system for seckill has following features. * A large number of users will be shopping at the same time during the quick sell, and the web site traffic increses dramatically.

Continue reading

Apache Kafka Consumers

Kafka consumer is what we use quite often to read data from Kafka. Here, we use this article to explain some key concepts and topics regarding to consumer architecture in Kafka. Consumer Groups We can always group consumers into a consumer group by use case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-memory microservices while another consumer group is streaming those same records to Hadoop.

Continue reading

NoSQL Overview

Overview NoSQL (NoSQL = Not Only SQL) means “not just SQL”. Modern computing systems generate a huge amount of data every day on the network. A large part of these data are handled by relational database management systems (RDBMSs). Its matured relational theory foundation makes data modeling and application programming easier. However, with the wave of informationization and the rise of the Internet, traditional RDBMSs have started to experience problems in some paticular domain.

Continue reading

Run Hive 1 and 2 Together

Overview The latest HDP 2.6.x has both Hive version 1 and 2 installed together. However, it does not allow user to run hive version to command directly, but only use beeline. The lab_dev repository here provides an demo virtual box image to have both Hive version configured properly. Conf. Changes The trick thing to make both hive version working is do not add any setting in the .profile anymore. See below, I comments out all pervious hive settings.

Continue reading

HBase Shell Reference

We use this place to collect commonly used HBase shell command for reference. HBase shell is an HBase extensible jruby-based (JIRB) shell to execute some commands(each command represents one functionality) in HBase. HBase shell commands are mainly categorized into 6 parts as follows. Will keep adding more examples here. 1. General Information status Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status 'simple' hbase> status 'summary' hbase> status 'detailed' version Output this HBase version.

Continue reading

Naive Bayes Algorithm

Background It would be difficult and practically impossible to classify a web page, a document, an email or any other lengthy text notes manually. This is where Naïve Bayes Classifier machine learning algorithm comes to the rescue. A classifier is a function that allocates a population’s element value from one of the available categories. For instance, Spam Filtering is a popular application of Naïve Bayes algorithm. Spam filter here, is a classifier that assigns a label Spam or Not Spam to all the emails.

Continue reading

ML Overview

Background Machine learning is a field of computer science that gives computer systems the ability to “learn” with data, without being explicitly programmed. Machine learning can be broken down into three broad categories: Recommender, Classification, Clustering. Recommender—Recommender systems suggest items based on past behavior or interest. These items can be other users in a social network, or products and services in retail websites. There are some algorithm like Pearson correlation and euclidean distance.

Continue reading