Blogs

Databricks Under the Hood: Dissecting the Lakehouse Engine for Performance and Governance

Databricks Under the Hood: Dissecting the Lakehouse Engine for Performance and Governance Databricks has established itself as a cornerstone of modern data architectures, unifying data warehousing and data lakes into the powerful “Lakehouse” paradigm. But beyond the marketing and high-level promises, what truly powers Databricks? How does it deliver on its guarantees of performance, reliability, and governance? This deep dive will pull back the curtain, exploring its core architecture, underlying technologies, and practical operational patterns.

Continue reading

Harness Engineering: Deep Dive into Orchestration Logic with Harness CD

In the realm of modern software delivery, orchestration is king. As deployments become more complex, involving microservices, multi-cloud environments, and intricate rollback strategies, simply pushing code is no longer sufficient. This is where Harness Engineering, specifically its Continuous Delivery (CD) module, shines. This deep-dive will move beyond surface-level introductions and explore the architectural patterns, practical challenges, and “under-the-hood” mechanics of how Harness CD empowers sophisticated deployment orchestration. Beyond the GUI: Understanding Harness CD’s Core Abstractions While Harness boasts a powerful UI, its true strength lies in the declarative definition of deployment strategies.

Continue reading

Demystifying Databricks: Under the Hood of Delta Lake, Photon, and Unity Catalog

Databricks has become a cornerstone of modern data platforms, offering a unified approach to data engineering, machine learning, and analytics. While its intuitive notebooks and managed Spark clusters are widely appreciated, the true power of Databricks lies in its innovative underlying architecture. This deep dive will pull back the curtain on key components like Delta Lake, the Photon Engine, and Unity Catalog, revealing how they orchestrate to deliver performance, reliability, and governance.

Continue reading

Leading Cloud Tech. Stack Comparison

Introduction In today’s digital era, businesses are increasingly adopting cloud computing to scale their operations, enhance flexibility, and reduce costs. Among the major cloud service providers, Amazon Web Services (AWS), Microsoft Azure, Google Cloud, and Oracle Cloud have emerged as dominant players in the market. Each offers a comprehensive cloud technology stack tailored to meet different business needs. In this blog, we’ll conduct a thorough comparison of these leading cloud technology stacks to help you make an informed decision when choosing the best-fit cloud provider for your organization.

Continue reading

Embracing Kubernetes, Goodbye Spring Cloud

I believe many developers, after familiarizing themselves with microservices, realized that they thought they had successfully built a microservices architecture empired with Spring Cloud. But after the popular of kubernetes (K8S), they were curious and exciting of creating the cloud native microservices serivces. The Era of Spring Boot and Cloud In October 2012, Mike Youngstrom created a feature request in Spring Jira to support a containerless web application architecture in the Spring Framework.

Continue reading

Spark SQL in Depth

In this article, we’ll look at how Spark SQL working on data queries in depth. Checking Execution Plan Data Preparing create database if not exists test; create table if not exists test.t_name (name string); insert into test.t_name values ('test1'),('test2'),('test3'); Test Code Preparing Below Scala code is used with testing with blocking at the standard input at the end. In this case, we can see more details from Spark WebUI.

Continue reading

Apache Spark 3.1.1 Released :)

Apache Spark 3.1.1 is released on March 2, 2021. It is milestone release for Spark in 2021. This version of spark keeps making it more efficient and stable. Below are highlighted new features and changes. Python usability ANSI SQL compliance Query optimization enhancements Shuffle hash join improvements History Server support of structured streaming Project Zen Project Zen was initiated in this release to improve PySpark’s usability in these three ways:

Continue reading

Apache Superset:)

On January 21, 2021, Apache’s official announced that Apache® Superset™ has become a top-level project. Apache® Superset™ is a modern big data exploration and visualization platform that allows users to build dashboards quickly and easily using a simple code-free visualization builder and the most advanced SQL editor. The project was launched on Airbnb in 2015 and entered the Apache incubator in May 2017. Apache Superset is a big data-related BI visualization tool.

Continue reading

Spark SQL Read/Write HBase

Apache Spark and Apache HBase are very commonly used big data frameworks. In many senarios, we need to use Spark to query and analyze the big volumn of data in HBase. Spark has wider support to read data as dataset from many kinds of data source. To read from HBase, Spark provides TableInputFormat, which as following disadvantages. There is only on scan triggerred in each task to read from HBase TableInputFormat does not support BulkGet Cannot leverage the optimization from Spark SQL catalyst Considering the above points above, there is another choice by using Hortonworks/Cloudera Apache Spark—Apache HBase Connector short for (SHC).

Continue reading

Apache Airflow Overview

What is Airflow? Airflow is a platform to programmaticaly author, schedule and monitor workflows or data pipelines. It composes Directed Acyclic Graph (DAG) with multiple tasks which can be executed independently. The Airflow scheduler executes the tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Continue reading