AirFlow and Prefect are probably the most popular schedulers in 2021. They are both more data-aware than the traditional orchestration softwares. This article will describe an additional service architecture to put the data dependency as the enabling pattern for an effective & efficient orchestration in the complex big data environment which may span multiple data centers, cloud vendors, and hybrid topology.

It’s common to schedule a data flow/DAG as
# “0 0 2,14 ? * *” : everyday at 2AM and 2PM
# “0 0 */6 ? * *” : every 6 hours after the 1st execution
Yet the flow owners sometimes notice that the upstream/input tables are not updated by the expected time. The staled data may not fail the flow directly, but the output data quality score and downstream metrics will eventually reveal the problem and trigger painful backfills. …

(* originally posted in LinkedIn in 2018 )

Columnar file formats have become the primary storage choice for big data systems, but when I Googled related topics this weekend, I just found that most articles were talking about the simple query benchmark and storage footprint comparisons between a particular columnar format vs. row formats. Sorting is also a critical feature of columnar formats, but its benefit and effective practice have not been emphasized or explained in detail so far. IMHO, using columnar formats without proper sorting is like to take only half of the advantage of the underlying file format. …

This week (2020–11–10) was really big for System on a Chip: first Apple M1, and then followed by MediaTek MT8195/MT8192. But why on earth these have anything to do with lego, microservices and even data model? It is a topic I have put off for a few years.

Probably, all of us have unanimously nominated lego as the most powerful tool or toy ever invented in history. In any survey or multiple choice quiz, as long as “lego” appears, its super flexibility and limitless creativity can always stand out. Whenever I joined a discussion about “next generation of data processing platform” or “what is the best big data architecture”, lego was often mentioned as the perfect analogy for the ideal design. …

Data Lake” has become a buzz word since 2016, and we even invented the phrase “Lakehouse” lately. Did we simply upload the web logs and dump tables from MPP database into HDFS or cloud storage, so we call that Data Lake? The recent rise of Data Lake is actually a wakeup response to yesterday’s marketing pitch, such as

“storage is very cheap, you can store anything; compute is super scalable, why bother updating or deleting a block in file, just simply make a new file.”

“schema and data modeling are unnecessarily rigid, JSON and schemaless is the future of big data.” …


Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store