It is ByteDance. Hive plays a critical role in most Chinese internet companies, so their motivations (to migrate to SparkSQL without giving up some Hive features/investments) are way ahead of the US peers.

https://databricks.com/session_na20/bucketing-2-0-improve-spark-sql-performance-by-removing-shuffle is the reference in English. The tech blog from ByteDance in Chinese has the details since 2019 about what they’ve done.

Similarly, Facebook has solved the Hive bucketing table incompatibility issue internally as early as 2017, but Spark community has a lot of hesitation to move https://issues.apache.org/jira/browse/SPARK-19256 forward.

Both Facebook and ByteDance have huge amount of data to crunch, and they both heavily invested into both Spark and Hive (and Presto), so they have to deploy a bucketing mechanism across Spark, Hive and Presto. Databricks wanted to avoid the mod-based hash() in Hive bucketing table since day one, it is still the case.

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store