Member-only story

Are We Taking Only Half Of The Advantage Of Columnar File Format?

Eric Sun
8 min readMar 16, 2020

--

(* originally posted in LinkedIn in 2018 )

Columnar file formats have become the primary storage choice for big data systems, but when I Googled related topics this weekend, I just found that most articles were talking about the simple query benchmark and storage footprint comparisons between a particular columnar format vs. row formats. Sorting is also a critical feature of columnar formats, but its benefit and effective practice have not been emphasized or explained in detail so far. IMHO, using columnar formats without proper sorting is like to take only half of the advantage of the underlying file format. I’d like to share my insights about this topic.

Log/Track data generated from Internet and E-Commerce companies are the major reason for the rise of big data systems, and IoT will spread this pattern to wider industries. Compared with the highly-normalized data model, such log/track data is much easier to produce without invoking lookup procedure to normalize the dimensional attributes to id or surrogate key; sometimes the data models are super wide so that various types of loggers can produce into the same event/log format (in sparsely-populated way); thus the logger can have simpler codebase and can produce the data with extremely high QPS/TPS. Soon the footprint of such data have exceeded multi-hundred PB on S3/BigTable/HDFS. Then columnar formats (Dremel/Parquet/ORC/Arrow/CarbonData) came to rescue.

The first half of the advantage of columnar formats are: the values are clustered by column so the compression is more efficient (to shrink storage footprint), and query engine can push down column projections (to reduce read I/O from network and disk by skipping unwanted columns).
The second half of advantage are: the row order is properly sorted to push the compression ratio to the next level, and query engine can push down filter predicates (to reduce read I/O) and apply vectorization computation. You may think that converting from compressed CSV/Avro to Parquet/ORC should have realized 90% of the benefits of columnar formats already, but let me try to convince you that the overhead of sorting data is absolutely worth it.

Here is a case to explain. A typical website log dataset contains: uuid, timestamp

--

--

Eric Sun
Eric Sun

Written by Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

Responses (2)

Write a response