It is ByteDance.

1 min readMar 30, 2020

It is ByteDance. Hive plays a critical role in most Chinese internet companies, so their motivations (to migrate to SparkSQL without giving up some Hive features/investments) are way ahead of the US peers.

https://databricks.com/session_na20/bucketing-2-0-improve-spark-sql-performance-by-removing-shuffle is the reference in English. The tech blog from ByteDance in Chinese has the details since 2019 about what they’ve done.

Similarly, Facebook has solved the Hive bucketing table incompatibility issue internally as early as 2017, but Spark community has a lot of hesitation to move https://issues.apache.org/jira/browse/SPARK-19256 forward.

Both Facebook and ByteDance have huge amount of data to crunch, and they both heavily invested into both Spark and Hive (and Presto), so they have to deploy a bucketing mechanism across Spark, Hive and Presto. Databricks wanted to avoid the mod-based hash() in Hive bucketing table since day one, it is still the case.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Eric Sun

No responses yet