Big Data

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

The workflow includes the following steps:

  1. Create a JDBC connection for Amazon Redshift and a DataBrew project.
  2. AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
  3. The DataBrew job writes the final output to our S3 output bucket.
  4. The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

  1. Create the Amazon Redshift cluster to capture the student performance data.
  2. Set up a security group for Amazon Redshift.
  3. Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
  4. We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  3. For Connection name, enter a name (for example, student-db-connection).
  4. For Connection type, select JDBC.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  5. Provide other parameters like the JDBC URL and login credentials.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  6. In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  7. Choose Create connection.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

Create datasets

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Dataset name, enter a name (for example, student).
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  3. For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
  4. Select the study_details table.
  5. For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
  6. Choose Create dataset.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, on the Projects page, choose Create project.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Project Name, enter student-proj.
  3. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

  1. For Select a dataset, select My datasets.
  2. Select the student dataset.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  3. For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
  4. Choose Create project.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. On the Profile jobs tab, choose Create job.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  3. For Job name, enter student-profile-job.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  4. Choose the student dataset.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  5. Provide the S3 location for job output.
  6. For Role name, choose the role to be used with DataBrew.
  7. Choose Create and run job.

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

Wait for the job to complete.

  1. Choose the Columns statistics tab.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

You can see that the age column has some missing values.
Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

You can also see that the study_time_in_hr column has two outliers.

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

  1. Choose Column and choose Delete.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Source columns, choose the columns school_name, first_name, and last_name.
  3. Choose Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

  1. Choose Missing and choose Fill with numeric aggregate.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Source column, choose age.
  3. For Numeric aggregate, choose Median.
  4. For Apply transform to, select All rows.
  5. Choose Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

  1. Choose Outliers and choose Remove outliers.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Source column, choose study_time_in_hr.
  3. Select Z-score outliers.
  4. For Standard deviation threshold, choose 3.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  5. Select Remove outliers.
  6. Under Remove outliers, select All outliers.
  7. Under Outlier removal options¸ select Delete outliers.
  8. Choose Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  9. Choose Delete rows and click Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

The next step is to convert the categorical value to a numerical value for the gender column.

  1. Choose Mapping and choose Categorical mapping.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Source column, choose gender.
  3. For Mapping options, select Map top 1 values.
  4. For Map values, select Map values to numeric values.
  5. For M, choose 1.
  6. For Others, choose 2.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  7. For Destination column, enter gender_mapped.
  8. For Apply transform to, select All rows.
  9. Choose Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

  1. Choose Encode and choose One-hot encode column.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  2. For Source column, choose health.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  3. For Apply transform to, select All rows.
  4. Choose Apply.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter student-performance.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

We use CSV as the output format.

  1. For File type, choose CSV.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web ServicesData preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  4. Navigate to the Jobs page and wait for the student-performance job to complete.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.
    Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web Services

Clean up

Delete the following resources that might accrue cost over time:

  • The Amazon Redshift cluster
  • The recipe job student-performance
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project student-proj and its associated recipe student-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.


About the Author

Data preparation using Amazon Redshift with AWS Glue DataBrew | Amazon Web ServicesDhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

 Source link

Back to top button
SoundCloud To Mp3