Eric Sun
Oct 19, 2020

Sure, smaller value for fs.s3a.readahead.range make sense. But you have to modify the Spark source code to switch to small value for any metadata read, and then switch back to big value for data block read, right?

Or you create dedicated metadata reader connection pool to separate from the data block access connections?

I assume the former approach. Since this issue is generally applicable to both ORC and Parquet, I am still wondering if there should have been a PR for both S3 and Azure Storage already to do exactly what you guys are doing.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Eric Sun
Eric Sun

Written by Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

No responses yet

Write a response