How to Connect Apache Spark with PostgreSQL on an RDS Instance within the Same VPC in SageMaker

How to Connect Apache Spark with PostgreSQL on an RDS Instance within the Same VPC in SageMaker

If you're working in SageMaker and need to connect Spark to a PostgreSQL RDS instance in the same VPC, here's a step-by-step guide, including setting up the required JDBC driver.

Steps:

  1. Ensure VPC and Subnet Configuration: Make sure both your SageMaker instance and RDS PostgreSQL instance are in the same VPC and subnet. Check that the necessary security group rules are in place to allow traffic between SageMaker and RDS.
  2. Download and Place the JDBC Driver: Place the PostgreSQL JDBC driver in the same directory where your SageMaker environment is running. This ensures Spark can access the driver.
  3. Set Up the Spark Session: Configure Spark for optimal performance and set the necessary parameters for the connection.

Setting Up the JAR in SageMaker:

  1. Place the PostgreSQL JDBC JAR in the same directory as your SageMaker notebook or script.
  2. Reference the JAR in your Spark Configuration: Ensure the Spark session configuration points to the location of the JDBC driver.

Example Spark + PostgreSQL Setup in SageMaker:



# Initialize SparkSession in SageMaker
spark = SparkSession.builder \
    .appName("Read from PostgreSQL") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "4") \
    .config("spark.driver.memory", "8g") \
    .config("spark.driver.cores", "2") \
    .config("spark.driver.maxResultSize", "2g") \
    .config("spark.default.parallelism", "100") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.memory.offHeap.enabled", "false") \
    .config("spark.driver.extraClassPath", "<path_to_postgresql_driver_jar>") \
    .getOrCreate()

# JDBC URL and database properties
jdbc_url = "jdbc:postgresql://<your_rds_instance>:5432/<your_database>"
connectionProperties = {
    "user": "<your_username>",
    "password": "<your_password>",
    "driver": "org.postgresql.Driver"
}

# Queries to read from PostgreSQL
query = "(SELECT * FROM table) AS xyz"

# Read data from PostgreSQL
df = spark.read.jdbc(url=jdbc_url, table=query, properties=connectionProperties)
        




To view or add a comment, sign in

More articles by Hashir Khan

  • Deploying Llama2 Locally with Docker for OCR and Text Summarization

    Introduction In this article, we will guide you through the process of deploying Llama2 locally using Docker, enabling…

  • Neural Networks

    👨 💻📢 Explaining a Simplified Neural Network Code 🧠💻 Hey LinkedIn fam! Today, I want to share a simplified code…

    1 Comment

Insights from the community

Others also viewed

Explore topics