The Movie Database - Extract, Load & Orchestration Project
I am a Senior Business Analyst and Data Engineer who recently decided to pursue a more technical track in my career. To complement my expertise in Data Modeling, Data Warehousing, ELT pipelines and Advanced SQL, I have begun developing a wider array of technical skills, with a strong focus on cloud architecture. Leveraging my 15 years of professional experience, I devised a realistic data engineering use-case and set about solving it using a modern tech stack.
Overview
The purpose of this project is to connect to The Movie Database (TMDB) via provided APIs, and extract data as flat files to a central location. A second application reads those files and loads them to a database. Both of these applications are orchestrated by an Airflow instance through custom DAG jobs. Additional functionality and documentation can be found at the relevant GitHub repositories. Below is a high-level architecture diagram showing the interaction between the various components. Note that this is just one configuration, as the components are designed to be highly modularized and customizable.
Components
TMDB Data Retriever
Technologies: Python 3, Flask, Pandas, AWS S3, AWS RDS, Postgres, Docker
Functionality:
Accepts a number of command-line arguments to download various TMDB data sets, such as movies updated yesterday, weekly trending movies, cast details, movie keywords, etc. It also optionally connects to the target database to retrieve information like already-loaded titles, and missing cast members or keywords.
From the command-line, it can also be launched into an API listener mode via Flask and Waitress. It will then accept the same functions and arguments through the HTTP endpoint on the specified host server.
Dockerfile instructions are provided both for local and cloud-based deployment, which uses S3FS Fuse to link an S3 bucket to the application file system.
Flat File Loader
Technologies: Python 3, Flask, Pandas, SQL Alchemy, Openpyxl, AWS S3, AWS RDS, Postgres, Docker
Recommended by LinkedIn
Functionality:
Scans a folder for flat files to load to a Postgres database. Uses a number of configuration files, Excel specs and SQL scripts to customize each file load. Includes logging functionality and JSON-structured results summaries of each run. From the command-line, it can also be launched into an API listener mode via Flask and Waitress.
Includes companion functionality for analyzing a flat file, and generating the relevant spec, table creation and load scripts and other files for use by the program. Based on changes to the spec, such as to field names, data types, and primary key indications, associated scripts can be automatically rebuilt to reflect the changes.
Dockerfile instructions are provided both for local and cloud-based deployment, which uses S3FS Fuse to link an S3 bucket to the application file system.
Airflow Orchestration
Technologies: Python 3, Docker, Apache Airflow via Astro
Functionality:
Orchestration layer for scheduling and monitoring the data extraction and loading of the other two applications via their API interfaces. Designed to handle dependent tasks, such as downloading titles updated yesterday, loading them to the target database, determining missing cast, extracting them, loading the data, etc.
Nexus Utilities
Technologies: Python 3, Pandas, SQL Alchemy, Boto3, AWS Parameter Store
Functionality:
Collection of my custom utilities published to PyPI. Focus is on data engineering activities, such as flat file analysis and cleansing, database connection testing, string manipulation, and secure password retrieval via methods such as keyring, AWS Parameter Store, and AWS Secrets Manager.
Unique functionality includes cleaning and separating SQL statements, analyzing distinct values in a Pandas data frame, expressing time difference as a string, determining likely date format given a list of dates, and advanced string transformations (eg. “CompanySECFilings” to “company_sec_filings” or “Amount (USD)” to “AMOUNT_USD”).
Business Intelligence Analyst at CoreLogic
1yThanks for posting the link to this in Discord 👌
VP Finance / Accounting Dynamic, highly technical, hands-on Finance and Accounting Manager with a wealth of knowledge and experience analyzing business processes, implementing new technology/IT
1yVery impressive