Make Data Reports Easier with Amazon S3 and AWS Glue

Make Data Reports Easier with Amazon S3 and AWS Glue

This article empowers us to simplify data reports using Amazon S3. It provides practical steps on converting CSV files to Parquet, comparing their performance, and working with a sample dataset from a fictional B2B SaaS company. The article also provides the SaaS-sales.csv dataset in Comma Separated Values (CSV) format, which can be accessed through this link: https://static.us-east-1.prod.workshops.aws/public/17cd925c-9947-46e3-a4cb-be4803b06e6c/static/Data/different_datasets_and_their_use_case/data/SaaS-Sales.csv. The dataset is presented in a table format. 

Article content

Now, let's dive into the process. We'll use the AWS Glue crawler to explore the dataset, transform the CSV file into Parquet, and compare the performance of querying Parquet and CSV data using AWS Glue and Amazon Athena. This process enables us to reduce the resources needed for querying and storing data confidently.

We will use Amazon S3 to store assets used and created in this article, Amazon Athena to query data stored in Amazon S3, and AWS Glue to discover and transform data. 

Finally, we will automate data deletion using Amazon S3 Lifecycle configuration to optimize data storage resources. 

This article is flexible to our needs. While it uses US East (N. Virginia) (us-east-1) as the region, other regions can be used if AWS Glue and Amazon Athena are available, giving us the freedom to adapt to our environment. 

To get started, we must create a new Amazon S3 bucket to store the resources used and created in the article. We must also create an IAM (Identity and Access Management) role to permit AWS Glue to access our bucket. Here are the steps to create our bucket and upload our dataset: 

1. Open the Amazon S3 console. We can do this by using the search bar or clicking on the "Services" link in the upper left-hand corner of the screen. Select S3 under the "Storage" heading. 

Article content

2. Click the "Create Bucket" button. We will be taken to the "Create bucket" page. Our bucket name must be unique among all existing bucket names in Amazon S3. A straightforward way to make it unique is to add your initials and your organization's name to the name. For example, [your initials]—[your org]-sust-data-lab. 

3. Select the same region where we want to deploy the lab from the "Region" drop-down list. Ensure that AWS Glue is available by checking the AWS Regional Services List. 

Article content
Article content

 4. On the "Block Public Access settings for this bucket" page, leave "Block all public access" checked. 

Article content

5. leave the rest of the settings as the default and click on the "Create bucket" button. 

Article content
Article content

6. We will return to the page displaying all our S3 buckets. Click on the name of the bucket we just created. 

Article content

7. Create a folder for our dataset by clicking "Create folder". 

Article content

8. Name our folder "csv" and click "Create folder". 

Article content

9. Go to our new bucket's overview page, click on the folder we created, and then click "Upload" under the "Objects" tab. 

Article content
Article content

10. Select our files for upload by clicking on the "Add Files" button. Upload the dataset file "SaaS-Sales.csv" from our device. After we have selected the file, click on the "Upload" dialogue. 

Article content
Article content

11. When the upload is complete, we should see "Upload succeeded". Click on "Close" to return to the bucket overview page. 

Article content

12. We can view information about the uploaded object, including the type of object (.csv), size in MB (1.6MB), and storage class (Amazon S3 Standard) on our new bucket overview page. 

Article content

13. Return to the bucket, select the csv folder, and copy its path (e.g., s3://[BUCKET_NAME]/csv/). We will use this path later. 

Article content

The second step is to set up the permissions required to scan, transform, and store data on Amazon S3 using AWS Glue. To achieve this, we must create a service role that will provide AWS Glue with the necessary permissions.

Follow these steps to create the service role: 

1. Go to AWS IAM on the AWS Console or click here. 

Article content

2. Select Policies on the left menu and click Create Policy. 

Article content

3. Choose the JSON tab and replace the content with the policy statement below, replacing [BUCKET_NAME] with the name of our bucket: 

    "Version": "2012-10-17", 

    "Statement": [ 

        { 

            "Effect": "Allow", 

            "Action": [ 

                "s3:GetObject", 

                "s3:PutObject" 

            ], 

            "Resource": [ 

                "arn:aws:s3:::[BUCKET_NAME]*" 

            ] 

        } 

    ] 

This policy lets the principal get and put objects on our Amazon S3 bucket. Click Next. 

Article content
Article content
Article content

4. Give our policy a name such as "module-1-s3-policy" and enter a description: "Read and write on sb-org-module-1-lab bucket." Click on Create policy. 

Article content
Article content

5. Select Roles on the left menu and then click Create Role. 

Article content

6. Keep AWS service selected and choose Glue from the Use cases for other AWS services drop-down menu, then click Next. 

Article content

 7. In the Filter policies by property or policy name and press enter field, search for AWSGlueServiceRole, and select the policy with that name. 

Article content

8. Search again for the name we gave our new policy and select it. Click Next. 

Article content

9. Set a name for our Role such as "AWSGlueRole-module-1". Make sure both policies are selected and click Create Role. 

Article content
Article content

We have now created the necessary service role, which we will use in the next steps of the article. 

Article content

:

Before exploring our data, we need to create an AWS Glue Crawler to scan it. In the following steps, we will find our SaaS-sales.csv dataset using AWS Glue and Amazon Athena. Our dataset is in Comma Separated Values (CSV) format. 

AWS Glue is a tool for integrating data for analysis, machine learning, and application development. It allows us to easily prepare, discover, and combine data. We will use AWS Glue to set up a Crawler to scan our dataset and obtain its schema. 

Amazon Athena is a serverless tool for analyzing data in Amazon Simple Storage Service (S3) using Python or standard SQL. It enables us to start analyzing data immediately. We do not even need to load our data into Athena; it works directly with data stored in Amazon S3. 

To scan our data and understand its schema, we will use AWS Glue Crawlers and integrate it into our AWS Glue Data Catalog. The AWS Glue Crawler connects to a data store and extracts our data schema and other statistics. By crawling the dataset, we will create tables in Athena that can be used for querying. 

To convert a dataset to Parquet and check the differences, follow these steps: 

1. Go to Services and search for Glue. Click on AWS Glue or open the AWS console here. 

Article content

2. On the Glue console left menu, expand Data Catalog, and select Crawlers. 

Article content

3. Choose Create Crawler. 

Article content

4. Enter a name for our Crawler, for example, "crawler-module-1" and click Next. 

Article content

5. Click Add data source. 

Article content

6. Paste the "bucket name" that we copied earlier, for example, "s3://[BUCKET_NAME]/csv/". Then, click Add an S3 data source and click Next. 

Article content
Article content

7. From the dropdown menu, select the IAM role we created at the beginning of the lab: AWSGlueRole-module-1. This role allows our Crawler to scan the data on our bucket. Click Next. 

Article content

8. Click Add Database. This will open a new window for Creating a glue database. Create a database with a name like "lab-module-1-database" and click Create. 

Article content
Article content
Article content

9. Once created, return to the Glue Crawler tab, and choose our new table from the dropdown menu. Use the refresh button next to the database dropdown if our database is not shown on the list. Then, click Next. 

Article content

10. Review the details and click Create Crawler. 

Article content

11. Once successfully created, select the created Crawler, and click Run crawler. 

Article content
Article content

12. Wait for the Crawler to finish crawling our dataset stored on Amazon S3. Once it is finished, we should see a new table on the Glue Data Catalog. In the left menu, click on Data Catalog > Databases. Then, select the database we created in step 8, "database-module-1-lab." 

Article content
Article content

13. Click on our new table name to view details about our dataset, including its location (the S3 bucket where it is stored) and schema. 

Article content
Article content
Article content

Convert data from CSV to Parquet:

Follow these steps to convert our data from CSV to Parquet using AWS Glue Studio. This conversion can improve performance and reduce Amazon Athena's query costs by 30–90 percent. 

1. Click on ETL Jobs on the AWS Glue left bar. 

2. Click Create after selecting Visual with a source and target. 

Article content

3. Choose the data source, S3 bucket, and the Data Catalog table radio button. Select the table created by the crawler. 

Article content
Article content

4. Select the Data target—S3 bucket node. Change the Format to Parquet in the dropdown, select Uncompressed from the Compression Type dropdown, and browse to the bucket we created for this lab. In the textbox, add parquet/ to the S3 URL. 

Article content

5. Remove the Apply mapping step. 

Article content
Article content
Article content
Article content

6. Enter a name for the job and select the role we created for the IAM Role. 

Article content
Article content

7. Disable the Job bookmark and save. 

Article content
Article content

8. Run the job and check the job run monitoring to see a list of running jobs, success/failure rates, and statistics. 

Article content
Article content
Article content
Article content

9. Check our Amazon S3 bucket for the new Parquet file. 

 

Article content

10. To include the new dataset in our Glue Data Catalog, repeat the steps in Section 2 to create a new crawler.  

Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content

11. Once the crawler finishes crawling our dataset stored on Amazon S3, we should see a new table in the Glue Data Catalog under the database we created. 

Article content

Comparing CSV and Parquet:

Here is a summary of what we have done so far. We uploaded our CSV dataset to Amazon S3, defined a database on AWS Glue, and used a crawler to explore data in an Amazon S3 bucket. Then, we created a table from it. Next, we transformed the CSV file into Parquet.  

Now that we have the same dataset in both CSV and Parquet formats let us compare them in terms of storage and performance. 

Storage :

Let us compare the size of the SaaS-Sales.csv object and its Parquet transformation. If we go to our Amazon S3 bucket and explore both objects, we can see that the Parquet dataset is significantly smaller than the CSV dataset. 

Article content
Article content
Article content
Article content

Here are the sizes of both objects: 

- CSV dataset size: 1.6 MB 

- Parquet dataset size: 614.6 KB 

Using columnar format, customers could save on storage resources, reducing their infrastructure footprint. 

Query Performance : 

Let us use Amazon Athena to compare how these two formats perform when querying the data. 

Amazon Athena is an interactive analytics service that makes it easier to analyze data in Amazon Simple Storage Service (S3) using Python or standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and we can start analyzing data immediately. (“Amazon Athena FAQs – Serverless Interactive Query Service – Amazon Web ...”) We do not need to load our data into Athena; it works directly with data stored in Amazon S3. 

Here is how we can compare the two formats: 

1. Open the Athena Console by typing Athena on the AWS Console search bar or by opening this link. 

Article content

2. Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in our Amazon Web Services account. In the Athena Console, under Data, choose AWSDataCatalog under Data Source. Then, select the database we have been working with as a database: lab-module-1-database. Under Tables, we should see one table we created with our Glue Crawlers, corresponding to the CSV file, and then select another database: lab-module-1-databaseb. Under Tables, we should see a table we created with our Glue Crawlers, corresponding to the Parquet file. 

3. Let us inspect the data. Click on the three dots on the right of each table's name to open a menu. Then, click on Preview Table. This will run a query to preview the first ten rows of the dataset automatically with the following query: SELECT * FROM "lab-module-1-database"."csv" limit 10; 

Article content

4. The data shows that this is a dataset listing customer orders. Let us run a test query to obtain the orders in a specific country. 

Article content

 5. In the query editor, write the following query: 

   - CSV data: SELECT * FROM "lab-module-1-database"."csv" WHERE country = 'Spain'; 

Article content

6. Take note of the query results and check the runtime and amount of data scanned. 

  • Run time: 475 ms 
  • Data scanned: 1.56 MB 

7. In the query editor, write the following query: 

   - Parquet data: SELECT * FROM "lab-module-1-databaseb"."parquet" WHERE country = 'Spain';  

Article content

8. Take note of the query results and check the runtime and amount of data scanned. 

  • Run time: 419 ms 
  • Data scanned: 619.15 KB 

As we can see, when running the same query, Athena needs to scan significantly less data on the Parquet dataset than on the CSV dataset. This means that to query our Parquet dataset, we need fewer resources. This information will help us analyze data more efficiently. 

This module is about creating a new Glue Job to transform a CSV file into partitioned Parquet, crawling the data, and comparing the scanning of three datasets: CSV, Parquet, and partitioned Parquet.  

To create a new Glue Job, follow these steps: 

1. Visit the AWS Glue Studio management console and click on ETL Jobs. 

2. Choose the Visual with a source and target option and click Create. 

Article content

4. Select the Data source - S3 bucket at the top of the graph. 

5. Choose the lab-database-module-1 database (or a similar one we use throughout this lab) and select the Data Catalog table radio button. 

6. Select the csv table. 

Article content

7. Select the Data target - S3 bucket node at the bottom of the graph. 

8. Change the Format to Parquet in the dropdown. 

9. Under Compression Type, select Uncompressed from the dropdown. 

10. In the textbox, append /parquet_partitioned/ to the S3 URL of the bucket we used in this lab - do not forget the "/" at the end. The URL should look like s3://[BUCKET_NAME]/parquet_partitioned/. 

11. Click on Add a partition key. 

12. Select the first partition key from the drop-down menu, for example, country. 

13. Add a second partition key, for example, industry. 

Article content

14. On the Job details tab, give the job a name Glue-CSV-to-Parquet-partitioned-Module-1. 

15. Select the AWSGlueRole-module-1-lab role we created at the beginning of the lab for IAM Role; we can leave the rest of the settings as default. 

16. Press the Save button in the top right-hand corner to create the job. If we get an error saving the job, make sure that, under Advanced properties, the script filename matches the job name. 

Article content
Article content

17. Once we see the Successfully created job message in the banner, click the Run button to start the job. 

Article content
Article content
Article content

18. Once the job succeeds, we can check if our new Parquet file is stored in our Amazon S3 bucket. 

Article content
Article content
Article content

To crawl the new dataset, follow these steps: 

1. Head to the Glue console, click on Crawlers, select Crawlers on the left bar. 

2. Choose Create Crawler. 

Article content

3. Enter a name for our crawler (e.g., crawler-module-1-parquet-partitioned) and click Next. 

Article content

4. Click Add data source. 

5. Browse S3 to find the path to our Parquet object (e.g., s3://[BUCKET_NAME]/parquet_partitioned/) and click Add an S3 data source. 

Article content

6. Select the AWSGlueRole-module-1-lab role we created at the beginning of the lab from the drop-down menu. 

7. Click Next. 

Article content

8. Choose lab-module-1-databaseb as our target database, which was created in the Comparing CSV and Parquet section under Query Performance, in step 2.

9. Click Next. 

Article content

10. Review the details and click Create Crawler. 

Article content

11. Once we create the crawler successfully, select the created crawler and click Run crawler. 

Article content

12. Wait for the crawler to finish crawling our dataset stored on Amazon S3. Once finished, we should see a new table on the Glue Data Catalog under the database we created database-module-1-lab. 

Article content
Article content

To query our data in Athena, run the following simple queries: 

Query 1: Querying the table that points at the CSV dataset. 

SELECT SUM(sales) 

FROM "lab-module-1-database"."csv" 

WHERE country = 'Spain' and industry = 'Energy'; 

Article content

Query 2: Querying the table that points at the Parquet dataset without partitions. 

SELECT SUM(sales) 

FROM "lab-module-1-databaseb"."parquet" 

WHERE country = 'Spain' and industry = 'Energy'; 

Article content

Query 3: Querying the table that points to the partitioned Parquet dataset. 

SELECT SUM(sales) 

FROM "lab-module-1-databaseb"."parquet_partitioned" 

WHERE country = 'Spain' and industry = 'Energy'; 

Article content

The table below shows the amount of data scanned for each file format: 

Article content

Partitioning Parquet data results in significantly higher query performance. Organizations can optimize their resources by using a columnar format to query their data. This article uses a small amount of data, but partitioning can provide significant benefits when working with larger datasets. 

As we navigate the complexities of managing data patterns within the Sustainability Pillar of the Well-Architected Framework, it becomes evident that adopting the best practice of utilizing lifecycle policies for data deletion is not just beneficial, but essential. This practice not only aids in minimizing the storage requirements of our workload but also underscores the criticality of automating this process. By defining lifecycle policies for all our data and setting up automated lifecycle policies to enforce them, we can effectively delete any unused data, thereby optimizing our data storage and management. 

In our scenario, we have a CSV dataset that was only used for a month, and now we have a Parquet dataset to build our reports. This implies that our CSV dataset is no longer necessary. Rest assured, the data can be recovered from the data department if needed, making it safe to delete. This reassurance should instill confidence in the process. 

To manage our objects so that they are stored efficiently throughout their lifecycle, we can configure their Amazon S3 Lifecycle. This is a set of rules that define actions that Amazon S3 takes to a group of objects. We can set up different rules for different data types in the same bucket. 

To automatically delete the CSV file after 7 days, we can create a lifecycle rule as follows: 

1. Go to Amazon S3 and click on the bucket we use for this article. 

2. Choose the Management tab and click on Create lifecycle rule. 

Article content
Article content

3. Enter a name for the Lifecycle rule (e.g., delete_CSV). 

4. Select the option to limit the scope of this rule using one or more filters. 

5. Add "/csv" (or the prefix of the folder where we have our CSV dataset) in the Filter type. 

Article content

 6. Select the option to expire current versions of objects. 

7. In the "Days after object creation" field, enter "7". 

8. Click on Create rule. 

Article content

By diligently following these steps, we will successfully create a rule that automates the deletion of our unused CSV file. This will not only streamline our data management process but also ensure that our storage space is efficiently utilized. 

Article content

References:


To view or add a comment, sign in

More articles by Suresh Bandaru

Insights from the community

Others also viewed

Explore topics