Are We Taking Only Half Of The Advantage Of Columnar File Format?
(* originally posted in LinkedIn in 2018 )
Columnar file formats have become the primary storage choice for big data systems, but when I Googled related topics this weekend, I just found that most articles were talking about the simple query benchmark and storage footprint comparisons between a particular columnar format vs. row formats. Sorting is also a critical feature of columnar formats, but its benefit and effective practice have not been emphasized or explained in detail so far. IMHO, using columnar formats without proper sorting is like to take only half of the advantage of the underlying file format. I’d like to share my insights about this topic.
Log/Track data generated from Internet and E-Commerce companies are the major reason for the rise of big data systems, and IoT will spread this pattern to wider industries. Compared with the highly-normalized data model, such log/track data is much easier to produce without invoking lookup procedure to normalize the dimensional attributes to id or surrogate key; sometimes the data models are super wide so that various types of loggers can produce into the same event/log format (in sparsely-populated way); thus the logger can have simpler codebase and can produce the data with extremely high QPS/TPS. Soon the footprint of such data have exceeded multi-hundred PB on S3/BigTable/HDFS. Then columnar formats (Dremel/Parquet/ORC/Arrow/CarbonData) came to rescue.