AirFlow and Prefect are probably the most popular schedulers in 2021. They are both more data-aware than the traditional orchestration softwares. This article will describe an additional service architecture to put the data dependency as the enabling pattern for an effective & efficient orchestration in the complex big data environment which may span multiple data centers, cloud vendors, and hybrid topology.

It’s common to schedule a data flow/DAG as
# “0 0 2,14 ? * *” : everyday at 2AM and 2PM
# “0 0 */6 ? * *” : every 6 hours after the 1st execution
Yet the flow…

(* originally posted in LinkedIn in 2018 )

Columnar file formats have become the primary storage choice for big data systems, but when I Googled related topics this weekend, I just found that most articles were talking about the simple query benchmark and storage footprint comparisons between a particular columnar format vs. row formats. Sorting is also a critical feature of columnar formats, but its benefit and effective practice have not been emphasized or explained in detail so far. IMHO, using columnar formats without proper sorting is like to take only half of the advantage of the underlying file format…

This week (2020–11–10) was really big for System on a Chip: first Apple M1, and then followed by MediaTek MT8195/MT8192. But why on earth these have anything to do with lego, microservices and even data model? It is a topic I have put off for a few years.

Probably, all of us have unanimously nominated lego as the most powerful tool or toy ever invented in history. In any survey or multiple choice quiz, as long as “lego” appears, its super flexibility and limitless creativity can always stand out. Whenever I joined a discussion about “next generation of data processing…

Data Lake” has become a buzz word since 2016, and we even invented the phrase “Lakehouse” lately. Did we simply upload the web logs and dump tables from MPP database into HDFS or cloud storage, so we call that Data Lake? The recent rise of Data Lake is actually a wakeup response to yesterday’s marketing pitch, such as

“storage is very cheap, you can store anything; compute is super scalable, why bother updating or deleting a block in file, just simply make a new file.”

“schema and data modeling are unnecessarily rigid, JSON and schemaless is the future of big…

Eric Sun

Advocate best practice of big data technologies. Challenge the conventional wisdom. Peel off the flashy promise in architecture and scalability.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store