🚀 Clean PySpark SQL Without Temp Views? Yes, Please.

🚀 Clean PySpark SQL Without Temp Views? Yes, Please.

If you're working with a Medallion Architecture (Bronze → Silver → Gold) in PySpark, you've probably used .createOrReplaceTempView() to run SQL over DataFrames.

But in PySpark 3.4+, there's a cleaner way: ✅ Use spark.sql() with DataFrames as named arguments – no temp views, no global state.

A basic event log pipeline:

  • Bronze: Raw JSON logs
  • Silver: Cleaned and typed
  • Gold: Aggregated metrics (e.g., event counts per user)

Here’s a full pipeline using that technique 👇


🧱 Bronze Layer: Raw Events

bronze_df = spark.read.json("s3://your-bucket/bronze/events.json")        

🥈 Silver Layer: Clean + Type Events

silver_df = spark.sql("""
    SELECT 
        b.user_id,
        b.event,
        to_timestamp(b.ts) AS event_time
    FROM bronze_data AS b
    WHERE b.event IS NOT NULL AND b.user_id IS NOT NULL
""", bronze_data=bronze_df)

silver_df.write.mode("overwrite").parquet("s3://yourbucket/silver/events_cleaned/")        

🥇 Gold Layer: Aggregate Metrics

gold_df = spark.sql("""
    SELECT 
        user_id,
        COUNT(*) AS total_events,
        COUNT(DISTINCT event) AS unique_event_types,
        MIN(event_time) AS first_event,
        MAX(event_time) AS last_event
    FROM silver_data
    GROUP BY user_id
""", silver_data=silver_df)

gold_df.write.mode("overwrite").parquet("s3://yourbucket/gold/user_event_summary/")        

📌 Why it matters:

  • 💡 No temp views = less coupling, better testability
  • 🔁 SQL pipelines become modular and isolated
  • ⚙️ No global state = fewer bugs in production

👷♂️ Clean, simple, production-ready PySpark.

🔁 This approach is fully SQL-based, modular, and avoids global temp views, which helps with testability and scalability.

#PySpark #DataEngineering #SparkSQL #MedallionArchitecture #Lakehouse #ApacheSpark #ETL

To view or add a comment, sign in

More articles by Deepak Kumar

Insights from the community

Others also viewed

Explore topics