Eric Sun
1 min readMar 30, 2020

--

It is ByteDance. Hive plays a critical role in most Chinese internet companies, so their motivations (to migrate to SparkSQL without giving up some Hive features/investments) are way ahead of the US peers.

https://databricks.com/session_na20/bucketing-2-0-improve-spark-sql-performance-by-removing-shuffle is the reference in English. The tech blog from ByteDance in Chinese has the details since 2019 about what they’ve done.

Similarly, Facebook has solved the Hive bucketing table incompatibility issue internally as early as 2017, but Spark community has a lot of hesitation to move https://issues.apache.org/jira/browse/SPARK-19256 forward.

Both Facebook and ByteDance have huge amount of data to crunch, and they both heavily invested into both Spark and Hive (and Presto), so they have to deploy a bucketing mechanism across Spark, Hive and Presto. Databricks wanted to avoid the mod-based hash() in Hive bucketing table since day one, it is still the case.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Eric Sun
Eric Sun

Written by Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

No responses yet

Write a response