GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

Plus: Building a Data Engineering Project in 20 Minutes.


This issue is originally published at GroupBy newsletter.


NOTE

GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.

So, if you find my work valuable and want to receive a weekly issue, subscribe here:

👉 vutr.substack.com

It's FREE.


👋 Hi, my name is Vu Trinh, a data engineer currently working at a mobile game company. I enjoy reading good stuff (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet.

🎯 Side Project

Get you hand dirty.

⤷Building a Data Engineering Project in 20 Minutes

📖 | ✍ Simon Späti

Author’s note : the project uses a very old version, 0.9 of Dagster

So, you might find it a little bit difficult to get it up and running for the first time. But a little challenge can be fun, right?

(Yeah, Simon and I had a short conversation. 😄)

You'll learn web-scraping with real-estates, uploading them to S3, Spark and Delta Lake, adding Data Science with Jupyter, ingesting into Druid, visualising with Superset and managing everything with Dagster.

💡What you will learn

✅ Scraping with Beautiful Soup.

✅ Change Data Capture (CDC) with Scraping.

✅ How to use an S3-Gateway / Object Storage

✅ UPSERTs and ACID Transactions.

✅ Automatic Schema Evolution.

✅ Integrating Jupyter Notebooks - the right way

✅ Learning about Apache Druid.

✅ Open-Source dashboarding with Apache Superset.

✅ DevOps with Kubernetes.

✅ Introduction to features of Dagster.


🚀 Engineering

Engineering is the practice of using natural science, mathematics, and the engineering design process to solve technical problems, increase efficiency and productivity, and improve systems.wikipedia —

⤷ Dremio┆Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi

📖 | ✍ Alex Merced

⤷ Meta┆The future of the data engineer

📖 | ✍ Analytics at Meta

⤷Demystifying the Parquet File Format

📖 | ✍ Michael Berk

⤷ Udemy┆Introducing Hot and Cold Retries on Apache Kafka®

📖 | ✍ Berat Cankar

⤷ Metaphor┆The Grand Rewrite of DataHub

📖 | ✍ Mars Lan

⤷ Agoda┆Python GIL. Past and Future

📖 | ✍ Agoda Engineering


✏ Data

the paradox, the strange, the reality

The one thing that this job has taught me is that truth is stranger than fiction. — Predestination (2014) —

⤷ Preset┆Intro to Data Modeling

📺 | ✍ Shreesham Mukherjee

⤷Is Kimballs Dimensional Modelling dead in 2022? Is OBT (one big table) the way to go?

🤖 | Reddit discussion

⤷ Paypal┆The next generation of Data Platforms is the Data Mesh

📖 | ✍ Jean-Georges Perrin

⤷Data Entropy: More Data, More Problems?

📖 | ✍ Salma Bakouk

⤷ Airbnb┆Experiment Reporting Framework

📖 | ✍ AirbnbEng


🔥 Catch up

…Next Saturday night, we're sending you back to the future! — Dr. Emmett Brown, Back to the Future (1985) —

google cloud

⚡The BigQuery Data Transfer Service can now transfer data from Azure Blob Storage into BigQuery. [📖]

⚡BigQuery support for change data capture (CDC) by processing and applying streamed changes in real-time to existing. [📖]

⚡BigQuery user can use cached results from the same query issued by other users in the same project when using Enterprise or Enterprise Plus edition.

dbt

About dbt clone command [📖]

  • Additional article from dbt’s engineer.

mlflow

Release of MLflow 2.8.0 [📖]


Hasta la vista, baby. — T800, Terminator 2 (Judgment Day, 1991) —

I apologize, but I can't resist quoting from movies.



Some words from me

🚀 I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.

🚀 These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.

Then, I deliver it again on LinkedIn to make it more accessible to all of you.

So, if you want to learn and grow with me, subscribe to my Substack here:

👉 vutr.substack.com

😇 Which will motivate me a lot.


To view or add a comment, sign in

More articles by Vu Trinh

Insights from the community

Others also viewed

Explore topics