Make Data Reports Easier with Amazon S3 and AWS Glue
This article empowers us to simplify data reports using Amazon S3. It provides practical steps on converting CSV files to Parquet, comparing their performance, and working with a sample dataset from a fictional B2B SaaS company. The article also provides the SaaS-sales.csv dataset in Comma Separated Values (CSV) format, which can be accessed through this link: https://static.us-east-1.prod.workshops.aws/public/17cd925c-9947-46e3-a4cb-be4803b06e6c/static/Data/different_datasets_and_their_use_case/data/SaaS-Sales.csv. The dataset is presented in a table format.
Now, let's dive into the process. We'll use the AWS Glue crawler to explore the dataset, transform the CSV file into Parquet, and compare the performance of querying Parquet and CSV data using AWS Glue and Amazon Athena. This process enables us to reduce the resources needed for querying and storing data confidently.
We will use Amazon S3 to store assets used and created in this article, Amazon Athena to query data stored in Amazon S3, and AWS Glue to discover and transform data.
Finally, we will automate data deletion using Amazon S3 Lifecycle configuration to optimize data storage resources.
This article is flexible to our needs. While it uses US East (N. Virginia) (us-east-1) as the region, other regions can be used if AWS Glue and Amazon Athena are available, giving us the freedom to adapt to our environment.
To get started, we must create a new Amazon S3 bucket to store the resources used and created in the article. We must also create an IAM (Identity and Access Management) role to permit AWS Glue to access our bucket. Here are the steps to create our bucket and upload our dataset:
1. Open the Amazon S3 console. We can do this by using the search bar or clicking on the "Services" link in the upper left-hand corner of the screen. Select S3 under the "Storage" heading.
2. Click the "Create Bucket" button. We will be taken to the "Create bucket" page. Our bucket name must be unique among all existing bucket names in Amazon S3. A straightforward way to make it unique is to add your initials and your organization's name to the name. For example, [your initials]—[your org]-sust-data-lab.
3. Select the same region where we want to deploy the lab from the "Region" drop-down list. Ensure that AWS Glue is available by checking the AWS Regional Services List.
4. On the "Block Public Access settings for this bucket" page, leave "Block all public access" checked.
5. leave the rest of the settings as the default and click on the "Create bucket" button.
6. We will return to the page displaying all our S3 buckets. Click on the name of the bucket we just created.
7. Create a folder for our dataset by clicking "Create folder".
8. Name our folder "csv" and click "Create folder".
9. Go to our new bucket's overview page, click on the folder we created, and then click "Upload" under the "Objects" tab.
10. Select our files for upload by clicking on the "Add Files" button. Upload the dataset file "SaaS-Sales.csv" from our device. After we have selected the file, click on the "Upload" dialogue.
11. When the upload is complete, we should see "Upload succeeded". Click on "Close" to return to the bucket overview page.
12. We can view information about the uploaded object, including the type of object (.csv), size in MB (1.6MB), and storage class (Amazon S3 Standard) on our new bucket overview page.
13. Return to the bucket, select the csv folder, and copy its path (e.g., s3://[BUCKET_NAME]/csv/). We will use this path later.
The second step is to set up the permissions required to scan, transform, and store data on Amazon S3 using AWS Glue. To achieve this, we must create a service role that will provide AWS Glue with the necessary permissions.
Follow these steps to create the service role:
1. Go to AWS IAM on the AWS Console or click here.
2. Select Policies on the left menu and click Create Policy.
3. Choose the JSON tab and replace the content with the policy statement below, replacing [BUCKET_NAME] with the name of our bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::[BUCKET_NAME]*"
]
}
]
}
This policy lets the principal get and put objects on our Amazon S3 bucket. Click Next.
4. Give our policy a name such as "module-1-s3-policy" and enter a description: "Read and write on sb-org-module-1-lab bucket." Click on Create policy.
5. Select Roles on the left menu and then click Create Role.
6. Keep AWS service selected and choose Glue from the Use cases for other AWS services drop-down menu, then click Next.
7. In the Filter policies by property or policy name and press enter field, search for AWSGlueServiceRole, and select the policy with that name.
8. Search again for the name we gave our new policy and select it. Click Next.
9. Set a name for our Role such as "AWSGlueRole-module-1". Make sure both policies are selected and click Create Role.
We have now created the necessary service role, which we will use in the next steps of the article.
:
Before exploring our data, we need to create an AWS Glue Crawler to scan it. In the following steps, we will find our SaaS-sales.csv dataset using AWS Glue and Amazon Athena. Our dataset is in Comma Separated Values (CSV) format.
AWS Glue is a tool for integrating data for analysis, machine learning, and application development. It allows us to easily prepare, discover, and combine data. We will use AWS Glue to set up a Crawler to scan our dataset and obtain its schema.
Amazon Athena is a serverless tool for analyzing data in Amazon Simple Storage Service (S3) using Python or standard SQL. It enables us to start analyzing data immediately. We do not even need to load our data into Athena; it works directly with data stored in Amazon S3.
To scan our data and understand its schema, we will use AWS Glue Crawlers and integrate it into our AWS Glue Data Catalog. The AWS Glue Crawler connects to a data store and extracts our data schema and other statistics. By crawling the dataset, we will create tables in Athena that can be used for querying.
To convert a dataset to Parquet and check the differences, follow these steps:
1. Go to Services and search for Glue. Click on AWS Glue or open the AWS console here.
2. On the Glue console left menu, expand Data Catalog, and select Crawlers.
3. Choose Create Crawler.
4. Enter a name for our Crawler, for example, "crawler-module-1" and click Next.
5. Click Add data source.
6. Paste the "bucket name" that we copied earlier, for example, "s3://[BUCKET_NAME]/csv/". Then, click Add an S3 data source and click Next.
7. From the dropdown menu, select the IAM role we created at the beginning of the lab: AWSGlueRole-module-1. This role allows our Crawler to scan the data on our bucket. Click Next.
8. Click Add Database. This will open a new window for Creating a glue database. Create a database with a name like "lab-module-1-database" and click Create.
9. Once created, return to the Glue Crawler tab, and choose our new table from the dropdown menu. Use the refresh button next to the database dropdown if our database is not shown on the list. Then, click Next.
10. Review the details and click Create Crawler.
11. Once successfully created, select the created Crawler, and click Run crawler.
12. Wait for the Crawler to finish crawling our dataset stored on Amazon S3. Once it is finished, we should see a new table on the Glue Data Catalog. In the left menu, click on Data Catalog > Databases. Then, select the database we created in step 8, "database-module-1-lab."
13. Click on our new table name to view details about our dataset, including its location (the S3 bucket where it is stored) and schema.
Convert data from CSV to Parquet:
Follow these steps to convert our data from CSV to Parquet using AWS Glue Studio. This conversion can improve performance and reduce Amazon Athena's query costs by 30–90 percent.
1. Click on ETL Jobs on the AWS Glue left bar.
2. Click Create after selecting Visual with a source and target.
3. Choose the data source, S3 bucket, and the Data Catalog table radio button. Select the table created by the crawler.
4. Select the Data target—S3 bucket node. Change the Format to Parquet in the dropdown, select Uncompressed from the Compression Type dropdown, and browse to the bucket we created for this lab. In the textbox, add parquet/ to the S3 URL.
5. Remove the Apply mapping step.
6. Enter a name for the job and select the role we created for the IAM Role.
7. Disable the Job bookmark and save.
8. Run the job and check the job run monitoring to see a list of running jobs, success/failure rates, and statistics.
Recommended by LinkedIn
9. Check our Amazon S3 bucket for the new Parquet file.
10. To include the new dataset in our Glue Data Catalog, repeat the steps in Section 2 to create a new crawler.
11. Once the crawler finishes crawling our dataset stored on Amazon S3, we should see a new table in the Glue Data Catalog under the database we created.
Comparing CSV and Parquet:
Here is a summary of what we have done so far. We uploaded our CSV dataset to Amazon S3, defined a database on AWS Glue, and used a crawler to explore data in an Amazon S3 bucket. Then, we created a table from it. Next, we transformed the CSV file into Parquet.
Now that we have the same dataset in both CSV and Parquet formats let us compare them in terms of storage and performance.
Storage :
Let us compare the size of the SaaS-Sales.csv object and its Parquet transformation. If we go to our Amazon S3 bucket and explore both objects, we can see that the Parquet dataset is significantly smaller than the CSV dataset.
Here are the sizes of both objects:
- CSV dataset size: 1.6 MB
- Parquet dataset size: 614.6 KB
Using columnar format, customers could save on storage resources, reducing their infrastructure footprint.
Query Performance :
Let us use Amazon Athena to compare how these two formats perform when querying the data.
Amazon Athena is an interactive analytics service that makes it easier to analyze data in Amazon Simple Storage Service (S3) using Python or standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and we can start analyzing data immediately. (“Amazon Athena FAQs – Serverless Interactive Query Service – Amazon Web ...”) We do not need to load our data into Athena; it works directly with data stored in Amazon S3.
Here is how we can compare the two formats:
1. Open the Athena Console by typing Athena on the AWS Console search bar or by opening this link.
2. Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in our Amazon Web Services account. In the Athena Console, under Data, choose AWSDataCatalog under Data Source. Then, select the database we have been working with as a database: lab-module-1-database. Under Tables, we should see one table we created with our Glue Crawlers, corresponding to the CSV file, and then select another database: lab-module-1-databaseb. Under Tables, we should see a table we created with our Glue Crawlers, corresponding to the Parquet file.
3. Let us inspect the data. Click on the three dots on the right of each table's name to open a menu. Then, click on Preview Table. This will run a query to preview the first ten rows of the dataset automatically with the following query: SELECT * FROM "lab-module-1-database"."csv" limit 10;
4. The data shows that this is a dataset listing customer orders. Let us run a test query to obtain the orders in a specific country.
5. In the query editor, write the following query:
- CSV data: SELECT * FROM "lab-module-1-database"."csv" WHERE country = 'Spain';
6. Take note of the query results and check the runtime and amount of data scanned.
7. In the query editor, write the following query:
- Parquet data: SELECT * FROM "lab-module-1-databaseb"."parquet" WHERE country = 'Spain';
8. Take note of the query results and check the runtime and amount of data scanned.
As we can see, when running the same query, Athena needs to scan significantly less data on the Parquet dataset than on the CSV dataset. This means that to query our Parquet dataset, we need fewer resources. This information will help us analyze data more efficiently.
This module is about creating a new Glue Job to transform a CSV file into partitioned Parquet, crawling the data, and comparing the scanning of three datasets: CSV, Parquet, and partitioned Parquet.
To create a new Glue Job, follow these steps:
1. Visit the AWS Glue Studio management console and click on ETL Jobs.
2. Choose the Visual with a source and target option and click Create.
4. Select the Data source - S3 bucket at the top of the graph.
5. Choose the lab-database-module-1 database (or a similar one we use throughout this lab) and select the Data Catalog table radio button.
6. Select the csv table.
7. Select the Data target - S3 bucket node at the bottom of the graph.
8. Change the Format to Parquet in the dropdown.
9. Under Compression Type, select Uncompressed from the dropdown.
10. In the textbox, append /parquet_partitioned/ to the S3 URL of the bucket we used in this lab - do not forget the "/" at the end. The URL should look like s3://[BUCKET_NAME]/parquet_partitioned/.
11. Click on Add a partition key.
12. Select the first partition key from the drop-down menu, for example, country.
13. Add a second partition key, for example, industry.
14. On the Job details tab, give the job a name Glue-CSV-to-Parquet-partitioned-Module-1.
15. Select the AWSGlueRole-module-1-lab role we created at the beginning of the lab for IAM Role; we can leave the rest of the settings as default.
16. Press the Save button in the top right-hand corner to create the job. If we get an error saving the job, make sure that, under Advanced properties, the script filename matches the job name.
17. Once we see the Successfully created job message in the banner, click the Run button to start the job.
18. Once the job succeeds, we can check if our new Parquet file is stored in our Amazon S3 bucket.
To crawl the new dataset, follow these steps:
1. Head to the Glue console, click on Crawlers, select Crawlers on the left bar.
2. Choose Create Crawler.
3. Enter a name for our crawler (e.g., crawler-module-1-parquet-partitioned) and click Next.
4. Click Add data source.
5. Browse S3 to find the path to our Parquet object (e.g., s3://[BUCKET_NAME]/parquet_partitioned/) and click Add an S3 data source.
6. Select the AWSGlueRole-module-1-lab role we created at the beginning of the lab from the drop-down menu.
7. Click Next.
8. Choose lab-module-1-databaseb as our target database, which was created in the Comparing CSV and Parquet section under Query Performance, in step 2.
9. Click Next.
10. Review the details and click Create Crawler.
11. Once we create the crawler successfully, select the created crawler and click Run crawler.
12. Wait for the crawler to finish crawling our dataset stored on Amazon S3. Once finished, we should see a new table on the Glue Data Catalog under the database we created database-module-1-lab.
To query our data in Athena, run the following simple queries:
Query 1: Querying the table that points at the CSV dataset.
SELECT SUM(sales)
FROM "lab-module-1-database"."csv"
WHERE country = 'Spain' and industry = 'Energy';
Query 2: Querying the table that points at the Parquet dataset without partitions.
SELECT SUM(sales)
FROM "lab-module-1-databaseb"."parquet"
WHERE country = 'Spain' and industry = 'Energy';
Query 3: Querying the table that points to the partitioned Parquet dataset.
SELECT SUM(sales)
FROM "lab-module-1-databaseb"."parquet_partitioned"
WHERE country = 'Spain' and industry = 'Energy';
The table below shows the amount of data scanned for each file format:
Partitioning Parquet data results in significantly higher query performance. Organizations can optimize their resources by using a columnar format to query their data. This article uses a small amount of data, but partitioning can provide significant benefits when working with larger datasets.
As we navigate the complexities of managing data patterns within the Sustainability Pillar of the Well-Architected Framework, it becomes evident that adopting the best practice of utilizing lifecycle policies for data deletion is not just beneficial, but essential. This practice not only aids in minimizing the storage requirements of our workload but also underscores the criticality of automating this process. By defining lifecycle policies for all our data and setting up automated lifecycle policies to enforce them, we can effectively delete any unused data, thereby optimizing our data storage and management.
In our scenario, we have a CSV dataset that was only used for a month, and now we have a Parquet dataset to build our reports. This implies that our CSV dataset is no longer necessary. Rest assured, the data can be recovered from the data department if needed, making it safe to delete. This reassurance should instill confidence in the process.
To manage our objects so that they are stored efficiently throughout their lifecycle, we can configure their Amazon S3 Lifecycle. This is a set of rules that define actions that Amazon S3 takes to a group of objects. We can set up different rules for different data types in the same bucket.
To automatically delete the CSV file after 7 days, we can create a lifecycle rule as follows:
1. Go to Amazon S3 and click on the bucket we use for this article.
2. Choose the Management tab and click on Create lifecycle rule.
3. Enter a name for the Lifecycle rule (e.g., delete_CSV).
4. Select the option to limit the scope of this rule using one or more filters.
5. Add "/csv" (or the prefix of the folder where we have our CSV dataset) in the Filter type.
6. Select the option to expire current versions of objects.
7. In the "Days after object creation" field, enter "7".
8. Click on Create rule.
By diligently following these steps, we will successfully create a rule that automates the deletion of our unused CSV file. This will not only streamline our data management process but also ensure that our storage space is efficiently utilized.
References: