3.5 Musketeers to Reshape Data Lake

Databrick’s Delta Lake: Staging->Conformed->Aggregate/Feature Tier
  • Iceberg stores all metadata in Avro file (while Delta Lake uses Parquet), but the biggest difference is that Delta Lake needs to use Spark session to read/write the metadata file, so it is a candy to stick users to Spark engine and DataFrame schema.
  • Iceberg’s partition does not rely on Hive-style partition_key=value directory structure, because Netflix pioneers and represents the best practice of big data on AWS, its hidden partition is highly-tailored to cloud-based blob storage and SQL filter notation.
  • Shuffle has taken the crown as the biggest performance bottleneck across Spark, M/R, Presto already. Bucketing (a.k.a. sorted hash partition) is the default data layout in pretty much each MPP system and HBase/Casandra, and Hive first brought it into Hadoop to reduce shuffle, and later implemented ACID DML on top of it. The sad news is, even today, we still can not store the data once and then take advance of quick join/lookup across the popular big data computation engines. Because HBase/Casandra/Kafka all has its own hash function (ingress/egress requires repartition), and Hive/Presto uses different hash function than Spark. One of Chinese internet giants even modified Spark source code in order to optimally read/write Hive bucketing table :-) Iceberg shows the world a truly open-minded bucket transform which should be seriously considered by Spark, IMHO.
  • Iceberg has not yet implemented DML or streaming.
    Hudi and Delta Lake have pulled ahead in this track so far.
  • Even though Hudi has some deeply-optimized implementations, that I feel excited about, its cool features can be easily absorbed by Databricks.
  • Academically, Iceberg’s design is cleaner, unselfish (for the sake of Spark’s gravity), and more forward-thinking than Delta Lake. Interestingly, similar to Flink, which is super popular in China yet only has small presence in Silicon Valley, Iceberg is only adopted by LinkedIn, Alibaba (as top choice for Flink to read/write Data Lake), and Tencent. The success of Iceberg outside Netflix still requires a tight-integration of a few strong and successful computation engines.
    (* Google Dataproc supports all 3 already, but we all know what really gets promoted in that piece of cloud :-)
  • Hive still looks like a big mansion from the distance, but it may unfortunately fade away and remain as a monument soon.
  • Kudu was once a very promising project for mutable OLAP. Yet its adoption has been concerning; wondering what Cloudera can do if Databricks ever comes up with a LLVM-based mutable OLAP.
    was one of contenders for mutable big data before it was acquired by TIBCO.
    CitusDB (Hyperscale) is part of Azure now. It leverages its gravity around Postgres HTAP and time-series analytics. Yet it can’t compete with Synapse from the mothership.
    TiDB with its TiSpark + TiFlash is another new rising star for hundred-TB level mutable HTAP. It creates strong gravity from MySql eco system.
  • Most likely, Delta Lake will be the winning musketeer until the next round of evolution comes.

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sales Planning Factors— Game of Analytics

“The Big Data” is Not All That Sexy

SafeGraph: Leveraging Geospatial Data to Improve Top and Bottom Line at the Enterprise Level

Open Road Calling You to Getaway But Not Sure it’s Safe Yet?

Where to validate incoming data?

Data Science Applications for Schools

Iraq and Afghanistan wars memorial unveiled

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eric Sun

Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

More from Medium

Operational Excellence

Who has control of your data pipeline ?

4 keys to data quality on a data lake (or lake of lakes)

A Google Earth screenshot of a portion of Yathkyed Lake in Nunavut, with a yellow arrow pointing at the recursive island described in the photo caption

Too many names for one customer or Master Data Management, part I