3.5 Musketeers to Reshape Data Lake

8 min readMar 16, 2020

“Data Lake” has become a buzz word since 2016, and we even invented the phrase “Lakehouse” lately. Did we simply upload the web logs and dump tables from MPP database into HDFS or cloud storage, so we call that Data Lake? The recent rise of Data Lake is actually a wakeup response to yesterday’s marketing pitch, such as

“storage is very cheap, you can store anything; compute is super scalable, why bother updating or deleting a block in file, just simply make a new file.”
“schema and data modeling are unnecessarily rigid, JSON and schemaless is the future of big data.”
“everybody can direly use the raw data to derive insights, generate features, and train models; the big data has democratized data for everybody, so no more hassle with data engineering or data warehouse.”

Before I try to demystify them, let’s quickly clarify some storage patterns:

(* the Infiniband/RDMA part is not really a File System, but it is a critical piece for storage architecture. We may need to imagine how the Mellanox acquisition will supercharge the next-generation of NVIDIA GPUDirect cluster.)

Web logs and IoT events are huge and “typically” immutable, but after companies migrated from MPP analytical systems to Hadoop, they found that the mutable-yet-very-critical data (customer, product, order, inventory, shipping, balance…) couldn’t be handled properly in an immutable architecture. The dump from OLTP/RDBMS to HDFS is not scalable due to slow throughput from OLTP side; The dump from HBase/Casandra is complicated, and very few companies can hire the right engineers to tame such NoSQL beasts;

The {incremental CDC + on-the-fly merge + periodical compaction} became the pattern to support mutable datasets. Hudi is one of such kind, while Apple, LinkedIn and a few others have had similar solutions for years as well;
When a tiny portion of the gigantic web/IoT logs need to be mutable (either for deduplication or the infamous GDPR purpose), this “big data is all about immutable data” marketing pitch immediately fell apart, and Hive ACID suddenly gained a lot of adoption in 2018.

Secondly, many companies realized that even though Hadoop was cheap to setup initially, it required to hire much more expensive engineers to write the optimal codes, otherwise its user community could easily and quickly pour the inefficient workload to the big data system, and then that led to SLA, scalability and cost-to-serve challenges. It’s still cheaper than relying mainly on Redshift/Teradata/Exadata/Netezza/Vertica, but the rationale for Netflix to create Iceberg is obvious:
“1 month time-series metrics is stored across 2.7 million Parquet files in 2,688 partitions in S3 : it takes 9.6 minutes wall time to just plan the query.”
(* Don’t get me wrong, S3 is absolutely awesome, but it still lacks a few critical features to make big data processing and ML efficient. )

Thirdly, the schemaless myth over-indexed upon the fast-track to dump everything to HDFS and S3, but left all the deep complexity and messy pain to the reader/consumer side. Never forget that fact these data are supposed to be read hundreds of thousands of times. The cost to deal with schema evolution and data backfill had kept climbing ever since.

The innovation and adoption of Parquet, Arrow and ORC file format should have proven that JSON should probably stay in REST API; big data really needs the computation and storage efficiency instead of human readability.

Lastly, the unexpected super success of Spark in the ETL (a.k.a. data processing) area also shows that many paradigms in the traditional data warehouse are indeed still critical and useful. We can not yet give up them in order to simply fit into the myth of data lake architecture before 2017.
(* the Delta Lake vision below clearly resembles the 3-tier DW model
[staging : conformed : agg/feature] on top of the raw “data lake” )

Databrick’s Delta Lake: Staging->Conformed->Aggregate/Feature Tier

The useful and manageable Data Lake finally emerged in 2017. One of the key differences is: the new Data Lake brings some RDBMS and data warehouse features back. To name a few, mutable table with batch DML capability, simple ACID to maintain multiple versions to isolate read from write, much stronger metadata and schema management, much better attention and tooling to model and organize the derived data.

Delta Lake from Databricks is one of the leaders who identified the pitfalls of the old data lake, and put forward a solid solution to rescue. It truly reinstates the must-have traits of data warehouse. Since the word “data warehouse” has by-and-large been smeared by the Hadoop-based data lake marketing campaign, Databricks invents a new term Lakehouse to get our attention.

Similar to Tesla Model S vs. Model 3, the paid/enterprise version (bundled within Databricks Runtime) of Delta Lake is packed with more really powerful features; the open source version is slightly trimmed down, yet we can already achieve DML operation, streaming ingestion/read, and multiple snapshots to isolate batch read/write.

Iceberg does not receive the same level of public awareness as Delta Lake. But it has very elegant and open design, especially for its schema metadata management, partition expression, and bucketing table abstraction.

Iceberg stores all metadata in Avro file (while Delta Lake uses Parquet), but the biggest difference is that Delta Lake needs to use Spark session to read/write the metadata file, so it is a candy to stick users to Spark engine and DataFrame schema.
Iceberg’s partition does not rely on Hive-style partition_key=value directory structure, because Netflix pioneers and represents the best practice of big data on AWS, its hidden partition is highly-tailored to cloud-based blob storage and SQL filter notation.
Shuffle has taken the crown as the biggest performance bottleneck across Spark, M/R, Presto already. Bucketing (a.k.a. sorted hash partition) is the default data layout in pretty much each MPP system and HBase/Casandra, and Hive first brought it into Hadoop to reduce shuffle, and later implemented ACID DML on top of it. The sad news is, even today, we still can not store the data once and then take advance of quick join/lookup across the popular big data computation engines. Because HBase/Casandra/Kafka all has its own hash function (ingress/egress requires repartition), and Hive/Presto uses different hash function than Spark. One of Chinese internet giants even modified Spark source code in order to optimally read/write Hive bucketing table :-) Iceberg shows the world a truly open-minded bucket transform which should be seriously considered by Spark, IMHO.
Iceberg has not yet implemented DML or streaming.
Hudi and Delta Lake have pulled ahead in this track so far.

Hudi might not have the ambition or broad foundation as Delta or Iceberg, but it is rather a very sharp and precise spear to improve SLA of mutable datasets and support record-level deletion/purge for GDPR. It has one highlight, that convinces me the most - why Uber is a fast data company
- bulk insert optimization
- different storage and query options to optimize either for latency or query speed (there is no such quadrant to achieve both)
# low latency: write to separated delta file, merge-on-read to combine snapshot and delta on the fly
# high speed query: only query snapshot, copy-on-write to compact snapshot with delta

Over the years, Hudi is the only solution, I’ve observed, that has such clear API to emphasize the difference between latency and throughput. All others are very “intentionally” blurry about the read penalty of merge-on-read style. For a lot of big data batch processes, they only need source datasets up to a certain cut-off time, typically last hour, x hours ago, or last night. These batch jobs also need the so-call Z-Ordering or columnar segment skipping to drastically reduce wasteful I/O + deserialization, in order to achieve optimal container utilization and execution speed. The merge-on-read approach may have to sacrifice performance significantly without implementing a comprehensive set of push-down optimization APIs.

Before I get into the gravity analogy, I’d like to pay tribute to Hive ACID and MetaStore, whose relevance have been weakened lately, even though they both are working fine and they support streaming ingest/mutation. After the failure of MetaStore on HBase initiative, Hive missed several opportunities to embrace cloud and evolve MetaStore to manage large volume of file-level, segment-level, field-level, and multiple-versions metadata.

The rise of Iceberg, Hudi and Delta Lake is a kind of disappointment toward Hive’s sluggish response to the true Data Lake needs no matter on-premises or in-cloud. Migration away from Hive to Spark and Presto has become the priority of many companies already.

Momentum wise, Delta Lake is currently second to none, this mainly benefits from its overwhelming gravity of Spark in ETL & ML. Hadoop vendors created a zoo filled with too many animals (not tamed for enterprises mostly) which do not collaborate with each other well. Except for a few internet giants who can afford hiring savvy engineers to keep their Data Lake going, most companies were very frustrated until SparkSQL and Presto became viable. Such gravity may effectively determine the trend for the next 2~4 years:

Even though Hudi has some deeply-optimized implementations, that I feel excited about, its cool features can be easily absorbed by Databricks.
Academically, Iceberg’s design is cleaner, unselfish (for the sake of Spark’s gravity), and more forward-thinking than Delta Lake. Interestingly, similar to Flink, which is super popular in China yet only has small presence in Silicon Valley, Iceberg is only adopted by LinkedIn, Alibaba (as top choice for Flink to read/write Data Lake), and Tencent. The success of Iceberg outside Netflix still requires a tight-integration of a few strong and successful computation engines.
(* Google Dataproc supports all 3 already, but we all know what really gets promoted in that piece of cloud :-)
Hive still looks like a big mansion from the distance, but it may unfortunately fade away and remain as a monument soon.
Kudu was once a very promising project for mutable OLAP. Yet its adoption has been concerning; wondering what Cloudera can do if Databricks ever comes up with a LLVM-based mutable OLAP.
SnappyData was one of contenders for mutable big data before it was acquired by TIBCO.
CitusDB (Hyperscale) is part of Azure now. It leverages its gravity around Postgres HTAP and time-series analytics. Yet it can’t compete with Synapse from the mothership.
TiDB with its TiSpark + TiFlash is another new rising star for hundred-TB level mutable HTAP. It creates strong gravity from MySql eco system.
Most likely, Delta Lake will be the winning musketeer until the next round of evolution comes.

There’s one deep-dive comparison (in Chinese) among Delta Lake, Icerberg and Hudi by one of the Alibaba engineers, but I still can not find similar articles in English. My friend Junping Du will speak at Spark Summit 2020 NA soon about Tecent’s fresh benchmark against all 3 Musketeers. Domas Jautakis has kindly written a very detailed hands-on instruction, if you are interested in trying them out, I strongly recommend his post.

(* Disclaimer: The views expressed in this article are those of the author and do not reflect any policy or position of the employers of the author.)

Comparison matrix from the post by Alibaba’s engineer

3.5 Musketeers to Reshape Data Lake

Written by Eric Sun