How to Connect Apache Spark with PostgreSQL on an RDS Instance within the Same VPC in SageMaker
If you're working in SageMaker and need to connect Spark to a PostgreSQL RDS instance in the same VPC, here's a step-by-step guide, including setting up the required JDBC driver.
Steps:
Setting Up the JAR in SageMaker:
Example Spark + PostgreSQL Setup in SageMaker:
# Initialize SparkSession in SageMaker
spark = SparkSession.builder \
.appName("Read from PostgreSQL") \
.config("spark.executor.memory", "8g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "8g") \
.config("spark.driver.cores", "2") \
.config("spark.driver.maxResultSize", "2g") \
.config("spark.default.parallelism", "100") \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.memory.offHeap.enabled", "false") \
.config("spark.driver.extraClassPath", "<path_to_postgresql_driver_jar>") \
.getOrCreate()
# JDBC URL and database properties
jdbc_url = "jdbc:postgresql://<your_rds_instance>:5432/<your_database>"
connectionProperties = {
"user": "<your_username>",
"password": "<your_password>",
"driver": "org.postgresql.Driver"
}
# Queries to read from PostgreSQL
query = "(SELECT * FROM table) AS xyz"
# Read data from PostgreSQL
df = spark.read.jdbc(url=jdbc_url, table=query, properties=connectionProperties)