🚀 Clean PySpark SQL Without Temp Views? Yes, Please.
If you're working with a Medallion Architecture (Bronze → Silver → Gold) in PySpark, you've probably used .createOrReplaceTempView() to run SQL over DataFrames.
But in PySpark 3.4+, there's a cleaner way: ✅ Use spark.sql() with DataFrames as named arguments – no temp views, no global state.
A basic event log pipeline:
Here’s a full pipeline using that technique 👇
🧱 Bronze Layer: Raw Events
bronze_df = spark.read.json("s3://your-bucket/bronze/events.json")
🥈 Silver Layer: Clean + Type Events
silver_df = spark.sql("""
SELECT
b.user_id,
b.event,
to_timestamp(b.ts) AS event_time
FROM bronze_data AS b
WHERE b.event IS NOT NULL AND b.user_id IS NOT NULL
""", bronze_data=bronze_df)
silver_df.write.mode("overwrite").parquet("s3://yourbucket/silver/events_cleaned/")
🥇 Gold Layer: Aggregate Metrics
gold_df = spark.sql("""
SELECT
user_id,
COUNT(*) AS total_events,
COUNT(DISTINCT event) AS unique_event_types,
MIN(event_time) AS first_event,
MAX(event_time) AS last_event
FROM silver_data
GROUP BY user_id
""", silver_data=silver_df)
gold_df.write.mode("overwrite").parquet("s3://yourbucket/gold/user_event_summary/")
📌 Why it matters:
👷♂️ Clean, simple, production-ready PySpark.
🔁 This approach is fully SQL-based, modular, and avoids global temp views, which helps with testability and scalability.
#PySpark #DataEngineering #SparkSQL #MedallionArchitecture #Lakehouse #ApacheSpark #ETL