Published in The Startup·PinnedData Dependency Driven OrchestrationAirFlow and Prefect are probably the most popular schedulers in 2021. They are both more data-aware than the traditional orchestration softwares. This article will describe an additional service architecture to put the data dependency as the enabling pattern for an effective & efficient orchestration in the complex big data environment…Orchestration10 min read
Published in Analytics Vidhya·PinnedAre We Taking Only Half Of The Advantage Of Columnar File Format?(* originally posted in LinkedIn in 2018 ) Columnar file formats have become the primary storage choice for big data systems, but when I Googled related topics this weekend, I just found that most articles were talking about the simple query benchmark and storage footprint comparisons between a particular columnar…Columnar8 min read
Published in The Startup·Nov 22, 2020Lego vs SoC, Apple M1 + MT8195, Microservices and Big Data ModelThis week (2020–11–10) was really big for System on a Chip: first Apple M1, and then followed by MediaTek MT8195/MT8192. But why on earth these have anything to do with lego, microservices and even data model? It is a topic I have put off for a few years. Probably, all…Soc6 min read
Published in The Startup·Mar 16, 20203.5 Musketeers to Reshape Data Lake“Data Lake” has become a buzz word since 2016, and we even invented the phrase “Lakehouse” lately. Did we simply upload the web logs and dump tables from MPP database into HDFS or cloud storage, so we call that Data Lake? …Delta Lake8 min read